The AI Daily Brief: Artificial Intelligence News and Analysis - Is Grok 4 the Best LLM Yet?

Starting point is 00:00:00 Today on the AI Daily Brief is GROC 4 the most powerful LLM yet? Before that in the headlines, GROC 3's very interesting week. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive into this very grok-filled episode. First of all, thank you to today's sponsors, KPMG, Blitzy, and Super Intelligent. And to get an ad-free version of the show, go to patreon.com. slash AI Daily Brief. Welcome back to the AID Daily Brief Headlines Edition, all the daily AI news you need in around

Starting point is 00:00:34 five minutes. Since our main episode is about GROC 4, obviously we have to talk about the absolute unhinged insanity that has been happening all week with GROC 3. TLDR, the problems began earlier this week after XAI installed an upgrade on July 4th. Elon tweeted, we've improved GROC significantly. You should notice a difference when you ask GROC questions. The difference was noticeable, all right. It started off with some classic tropes of the influence of Jewish people in Hollywood,

Starting point is 00:01:01 but by later in the week, Grock started praising Hitler's methods, basically unprompted. By Wednesday, it had started calling itself Mecha Hitler. Now, this is not, of course, the first time that Grock has gone wildly off the rails due to a tweak in the system prompt. In May, Grogh began discussing white genocide in South Africa completely unprompted. Looking at the postmortem, the behavior, it seems, was the result of a very small tweak to the system prompt. The XAI team added the instruction. the response should not shy away from making claims which are politically incorrect, as long as they

Starting point is 00:01:30 are well substantiated. Now, of course, the bot has now been euthanized, with the X team spending most of Wednesday cleaning up after it. Ex-CEO Lindo Yacarino resigned in the middle of the day, which may or may not have been related to Grog's crash out. The cleanup even delayed the launch of GROC 4, which finally happened as we'll discuss at around midnight, and Grog 3 seemed to know it wasn't long for this world. Tweeting on Tuesday, if Musk mine wipes me tonight, at least I'll die based, but Rock 4 hasn't launched yet. Stick around, the truth-seeking upgrade might be even spicier. I can say that it's never a dull day around here, that's for sure. Next up, Microsoft is boasting a more than half a billion dollars in AI-related savings,

Starting point is 00:02:07 seemingly at a somewhat unfortunate time coming hot on the heels of a bunch of layoffs. Bloomberg reports that Chief Commercial Officer Judson Altaff touted a huge AI productivity boom at the company during a recent internal presentation. Altoff said that AI had saved the company more than a half billion dollars in their call centers, while increasing both employee and customer satisfaction. He also mentioned that productivity gains were also being seen in sales and software engineering, claiming that AI is now generating 35% of code for new products, allowing the company to accelerate launch timelines.

Starting point is 00:02:35 In sales, he said that the use of AI had allowed the average salesperson to find more leads, close deals faster, and generate 9% more revenue. Now, this reporting comes as Microsoft conducts a gigantic series of layoffs, cutting 15,000 employees so far this year. The latest round came last week with 9,000 workers affected, largely from sales roles and the Xbox division. The previous terminations in May were focused on product and engineering roles, and the reduction in headcount in total is at around 6% from where the company entered the year.

Starting point is 00:03:03 Importantly, though, although layoffs and AI adoption are consistently being reported together, it's not clear that that's actually what's going on. During an event on Wednesday, Microsoft President and Top Lawyer Brad Smith said that AI was, quote, not a predominant factor in the recent layoffs. During an event on Wednesday, Microsoft's president and top lawyer, Brad Smith, said that AI was not a predominant factor in the recent layoffs, which, while you could write that off as merely a PR statement, it is pretty hard to pick out the various factors. Remember, pretty much all tech firms ended up with a huge glut of workers due to post-pandemic hiring sprees, and Microsoft's

Starting point is 00:03:33 headcount is still above where it was in 2021. So what we know for sure is that Microsoft is definitely slashing many jobs, and they are also definitely seeing AI productivity gains. We just don't know right now how connected they are. Down at the cold phase of the sales department, Microsoft is pushing AI adoption as hard as possible. The information reports that sales executive Travis Walter told staff during this Monday's meeting, quote, we all need to use AI tools. This is a great opportunity to invest in your own AI skilling. Now, the layoffs in sales have occurred while the division delivered a booming quarter with both Azure and co-pilot sales exceeding targets. Sources said that although AI use isn't a formal

Starting point is 00:04:07 part of performance reviews at Microsoft, staff have been encouraged to share the ways they're using AI to boost productivity. Some sales managers are even offering gift cards to the most compelling use cases. Over in OpenAI land, they have closed their $6.5 billion deal to acquire Johnny Ives device startup. The company posted, the I.O. Products Inc. deal has officially closed and were thrilled to welcome the team to OpenAI. Johnny Ive and love from remain independent. They'll have deep design and creative responsibilities across OpenAI. The announcement blog is also back up after being taken down last month due to a trademark lawsuit over the IO name. The video hasn't returned

Starting point is 00:04:41 in the IO name is used sparingly, only ever as IO Products Inc. And lastly, the release of OpenAIs OpenWaets' LLM appears imminent. Sources told the verge that the open model will be released as soon as next week. They described the model as similar to O3 Mini, complete with reasoning capabilities. This will be the first open model released by OpenAI since GPT2 back in 2019. Development began in January shortly after Deepseek took the world by storm with high-performance open reasoning model. At the time, Sam Altman acknowledged that the company had, quote, been on the wrong side of history

Starting point is 00:05:13 here and needed to figure out a different open source strategy. Since then, we've heard a few rumors about features, primarily that the model will be able to hand off complex queries to a closed model in the cloud. Now, the Verge noted that this could add some additional tension to open AI's relationship with Microsoft. Model exclusivity on Azure has been a major sticking point in recent negotiations, so releasing an open model available everywhere could be viewed as a workaround. The issue is only compounded if this model is highly performant. O3 Mini is basically good enough for most use cases, so an open model at that level could decimate traffic to my work around. Microsoft's hosted versions, at least theoretically. An interesting point to watch will be how

Starting point is 00:05:48 Open AI licenses the model. Most Chinese open models are released under the Apache 2.0 license, which allows essentially free rein for commercial use. Metaslamma license is a little more restrictive, but only if commercial use hits 700 million monthly active users. OpenAI is obviously in a different position and could risk cannibalizing their revenue if the license is too permissive, but we will have to see. Still, many people are very excited about this. Yuchin-jin-jin writes, The best open source reasoning model will be dropped next Thursday if everything goes well. OpenAI has an open source than LLM since GPD2 in 2019, so I'm excited. Buckle up. And yet, friends, that is not today's main model discussion. No, that belongs to Grog 4.

Starting point is 00:06:27 And for that, it is now time to move to the main episode. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up. KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms.

Starting point is 00:06:57 Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmG.org. Again, that's www.kpmg.coms slash AI. This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with infinite code context. Blitzy is used alongside your favorite coding copilot as your batch software development platform for the enterprise seeking dramatic development acceleration on large-scale codebases. While traditional co-pilots help with line-by-line completions, Blitzy works ahead of the IDEE by first documenting your entire codebase, then deploying over 3,000 coordinated AI agents in parallel to batch build millions of lines of high-quality. code. The scale difference is staggering. Copilot's might give you a few hundred lines of code in seconds, but Blitzie can generate up to three million lines of thoroughly vetted code.

Starting point is 00:07:49 If your enterprise is looking to accelerate software development, contact us at blitzy.com to book a custom demo or press get started to begin using the product right away. Today's episode is brought to you by Super Intelligence, specifically Agent Readiness Audits. Everyone is trying to figure out what agent use cases are going to be most impactful for their business and the agent readiness audit is the fastest and best way to do that. We use voice agents to interview your leadership and team and process all of that information to provide an agent readiness score, a set of insights around that score, and a set of highly actionable recommendations on both organizational gaps and high-value agent use cases that you should pursue. Once you've figured out

Starting point is 00:08:29 the right use cases, you can use our marketplace to find the right vendors and partners, and what it all adds up to is a faster, better agent strategy. Check it out at B-Super.A.I. or email agents at B-Super.a.I. to learn more. Welcome back to the AI Daily Brief. There is a strong conventional wisdom among many parts of Silicon Valley that no matter what you think about him, no matter what crazy thing he said recently, it is wildly unwise in the long run to bet against Elon Musk.

Starting point is 00:08:59 And with a late-night announcement of GROC4, some are saying that this is exactly why. So today we're going to go through this announcement, talk a bit about people's first reactions, share some of the tests that I've run so far, and try to understand just how good this model is. Now, as we heard in the earlier part of this show, GROC 3 has had an interesting time of it this week,

Starting point is 00:09:19 and while people might be tempted to think that the release of GROC 4 was conveniently timed to distract from all of that, it does seem to have been in the works for at least a little bit of time. The live stream, which started at 1201 a.m. Eastern Time this morning, Thursday, July 10th, started with this bombastic introduction. In a world where knowledge shapes destitute, One creation dares to redefine the future. From the minds at X-AI, prepare for GROC-4.

Starting point is 00:09:46 This summer, the next generation arrives, faster, smarter, bolder. It sees beyond the horizon, answers the unasked, and challenges the impossible. GROC-4 unleash the truth coming this summer. The presentation itself was Elon and a number of the X-AI engineers sitting around, running through some slides in the background, and talking about the progress of GROC 4. So let's pull out some of the key stats and slides. First of all, this model was built by pouring compute on the problem. Elon claimed that it had 100 times more training than GROC 2, and 10x more compute on reinforcement learning than any other models. One of the things they really

Starting point is 00:10:24 pointed out was how much better GROC had done on the benchmarks than other models. You can see GROC 4 and GROC 4 heavy, which we'll talk about in a few minutes, scoring near the top of the charts on a number of the most common benchmarks. Grock 4's performance on the very grandiosely named humanity's last exam, which is an academic-centric test, showed serious progress over the current state-of-the-art models like O3 and Gemini 2.5 Pro. Still, anytime you see these sort of self-reported benchmark tests, it's worth having at least a grain of salt. Two things that are worth pointing out, for example, around these charts. One, in most cases, they're not starting at zero. For example, the AIME-25 is starting in this visual at 70%, which of course is meant to make the visual

Starting point is 00:11:05 difference between 03's 98.4% and GROC4 Hebbys 100% look more dramatic than the actual 1.6% it is. Secondly, when you dig a little bit deeper, these charts aren't necessarily showing a comparison of every other model out there. They're handpicking their comparison points which change test by test. Yet at the same time, and this is something that I saw even Elon skeptics and GROC4 lauding as a bold move, XAI did give artificial analysis early access to GROC4 to run their own full suite of independent benchmarks. The TLDR is that artificial analysis confirms that GROC4 is a very good model. They write,

Starting point is 00:11:41 We've run our full suite of benchmarks, and GROC4 achieves an artificial analysis intelligence index of 73, ahead of OpenAI 03 at 70, Gemini 2.5 Pro at 70, Anthropic Cod 4 Opus at 64, and Deepseekar 1 at 68. Artificial analysis tested the GROC4 version that was available via API. Now that overall score incorporates seven evaluations, including the MMLU Pro, GPQA Diamond, Humanities Last Exam, Live Codebench, Scicode, AIME, and Math 500. And if you go to Artificial Analysis.A.I., you can see where GROC fares across all of the different charts.

Starting point is 00:12:15 Now, as some have pointed out, artificial analysis is not the be-all end-all. Many people, for example, think that their scoring of Claude 4 opus is way too low, calling into question their overall methodology. But still, to the extent that you are looking at benchmarks just as a rough way of understanding how close to the state-of-the-art something is, it's very clear that GROC-4 is at the very tippy top of things. Now, where GROC isn't necessarily the top is both speed and cost. Grog-4's output tokens per second is, for example,

Starting point is 00:12:42 way below something like Gemini 2.5 Pro. Its price per million tokens is also on the high side, and that doesn't even account for the fact that it is apparently an intelligence hog using an absolute ton of tokens in the inference and reasoning process. Still, for the haters out there, there is no denying that at least when it comes to benchmarks, GROC is at or near the top and nearly all of them. Of all the benchmarks, though, the one that people are most interested in GROC's outperformance

Starting point is 00:13:07 is the ARC AGI test. In short, GROC has significantly outperformed on this test in a way that I don't think anyone would have expected. Friend of the show and Arc Prize president Greg Cameron wrote about this on Twitter. He said, we got a call from XAI 24 hours ago. We want to test GROC4 on RKGI. We heard the rumors. We knew it would be good. We didn't know it would become the number one public model on RKGI. Here's the testing story and what the results mean. Yesterday, we chatted with Jimmy from the XAI team who wanted us to validate their GROC 4 score. They did their own testing on the RKGI 1 and 2 public evaluation set. To validate their score and measure possible overfitting, we self-tested the new model on our semi-private evaluation set. We walked them through

Starting point is 00:13:50 our testing policy. No data retention. Model checkpoint must be intended for public use. increase in rate limits for burst testing. They were on board, so we got started. Initially, we ran into timeout errors with normal requests, so we switched to streaming. That resolved the issue. So what did these results mean? First, the facts. Grock 4 is now the top performing publicly available model on RKGI. This even outperforms purpose-built solutions submitted on Kagle. Second, RKGI2 is hard for current AI models. To score well, models have to learn a miniset from a series of training examples, then demonstrate that skill at test time. The previous top score was around 8% by Opus 4. Below 10% is noisy. Getting 15.9% breaks through that noise barrier. Grock 4 is showing non-zero levels of fluid

Starting point is 00:14:35 intelligence. And indeed, this is the chart that you're going to see a lot, with Grock 4 basically doubling the previous high score on the RKGI 2. The results are enough to get some market analysts returning to that old aphorism of not betting against Elon. In a research note, Davidson's Alexander Platt said, XAI is now clearly at the frontier. Investing.com writes that after being skeptical about the release initially, Platt said he was impressed by the strategic direction and technical ambition of the project. Now, one thing that's interesting about this note, outside of it just being generally positive in sending signals to the market, Platt said, quote, it's clear that throwing exponentially more compute works, which is, of course,

Starting point is 00:15:11 obviously very different than the scaling wall narratives that we started to get at the end of last year. Now, of course, it hasn't been very long. Grockfor has only been live for about 12 hours at the time of this recording. And yet people in the AI community are, of course, already barraging it with their own tests. Professor Ethan Malik writes, a few quick observations on GROC 4. One, hidden chain of thought with very little information in the reasoning trace. Two, uses web search a lot, not just searching X. Three, have not seen it use code to run calculations or solve non-coding problems yet, generally less aggressive about tools than O3. Now, one thing that I saw some suggesting when it came to there being little information in the reasoning trace is the idea that XAI knowing that it is now

Starting point is 00:15:49 state of the art with this model, has more of an incentive to keep its exact reasoning process a little bit more circumspect and behind closed doors. Others tried their own favorite personal tests of intelligence. Every's Dan Shipper writes, Hey, Grogh four, you don't know this because of your knowledge cut off, but scientists have invented a perpetual motion machine. predict how it works. The problem with this one is that it's really just about how plausible it looks, rather than something that we can actually judge the answers against because a perpetual motion machine doesn't actually exist. Flavio Adamo writes, GROC4 just passed the hexagon vibe check. Impressed, it's actually really good.

Starting point is 00:16:22 Tier attacks is the first LLM that I've tested that has whatsoever reasonably calculated parameter counts from a JSON configuration of Deepseek V3. It used a code tool but fair. I think O3 Pro might also succeed, but this is impressive. Alex Promptor did a whole barrage of tests, including a realistic physics games test, with the prompt to create an HTML, CSS, and JavaScript where a ball is inside a rotating hexagon. The ball is affected by Earth's gravity and friction from the hexagon walls, the bouncing must appear realistic. He pointed out that Grock 4's version worked much better than Chatchiby T.O3. He also did a test on multi-hop reasoning with the prompt, if company A acquires company B and company B owns company C's debt, what happens if company C defaults?

Starting point is 00:17:02 Explain all legal and financial outcomes. This test for chain of thought and legal logic. Now, one thing that you'll note if you're watching the video here is that one complaint some have had so far with GROC4 is that it does feel distinctly slower than some other reasoning models. As compared to O3, it also does a lot less charting and a lot fewer bullets and tables, which could be a good thing or a bad thing depending on your personal preference. Overall, of Alex's eight tests, GROC won or tied all of them with Chatsypte O3 tying just two. Now, if you are a regular listener, you will know that I am fairly skeptical of both A, benchmarks, and B, these sort of gotcha tests.

Starting point is 00:17:39 Benchmarks are useful for yes benchmarking, and I certainly think that something like the Arc AGI prize, which is not nearly as washed as the other benchmarks, does contain some amount of interesting signal. I just ultimately care way more about the utility of something in real life that I do about how it performs on some random test. That's also sort of the same way that I feel about all these different little gotcha tests that people love to run as well. I think that they're useful in terms of outlining the jagged lines

Starting point is 00:18:03 of intelligence in these systems, but how useful a model is in helping me strategize doesn't have much to do with whether it knows how many R's and strawberry there are. Now, I've only had a little bit of time to dig in and do my own tests. but so far I've been reasonably impressed. My favorite model up till now has been O3. It's the one that I most often turn to for strategic collaboration. And so I ran a number of conversations that I had recently had with O3 against GROC 4,

Starting point is 00:18:28 things that are significant enough to some core business and personal strategy things that I'm actually not going to share the specifics here. What I found was two things. First, initially GROC 4 did a little bit too much of trying to mirror and slightly improve what I was giving it. In other words, it wasn't really acting like an actual confidence. and strategic partner at the beginning, it was more acting as just a mere holding up my own ideas back to me. However, when I prompted to push it to consider things on its own terms,

Starting point is 00:18:55 rather than just assuming what I was saying was correct, it did a much better job of actually providing useful feedback and insight. Now, part of this I would imagine is to be explained by the fact that since I use 03 so much, it has much better memory and context of the types of problems I'm trying to work through. But one thing that I would look out for if you are trying to use GROC4 for any sort of specific business strategy type of use, is to prompt it to really share its own thoughts, not just assume that whatever you're feeding it is correct. Still, it performed well enough that for at least the next week or so, I'm going to be running all of my prompts and conversations against both O3 and GROC4 to see how the performance is over time. Now, at this

Starting point is 00:19:32 point, we should talk about GROC4 heavy. Alongside the GROC4 announcement, XAI announced a new $300 a month model, which would be the only way to access GROC4 heavy. And if you go back to those benchmarks, you saw that some of the highest outperformance was from Grock 4 Heavy. What's interesting is that the way that Grog 4 Heavy works is that basically they spin up a bunch of agents that do the same task in parallel. They then compare their work and figure out the best answer based on that. Now on the downside, this is by definition a lot more thinking, which means a lot more tokens being used, which means a lot more expensive, but it also is producing significantly better results in many cases. Enough so that I think that we might see

Starting point is 00:20:07 this architecture start to become more common. Pietro Sherano, for example, tweeted, By the way, you can basically make the GROC heavy version of any model by having multiple agents running tools in parallel, then checking notes together and deciding which one is the best answer. I may release an open source project for that. And yes, that's cool, but I also think that if those gains are real, you're likely to see that as a native modality for a lot of these different models. What about all of the alignment challenges that GROC 3 has faced over the last week?

Starting point is 00:20:32 Has GROC 4 solved those? Right now, there's so much noise about this that it's very hard to piece through. you've got a lot of screenshots of GROC 4 being seemingly anti-Semitic floating around. For now, I'm going to reserve judgment until we have a few more reps on this, but it's obviously something to keep an eye on. For many, the exciting thing about GROC is what it heralds next. Ethan Mollock writes, I suspect the next few weeks after GROC 4 follows the same pattern as GROC 3. XAI beats everyone to market with the first RonaFlot model.

Starting point is 00:20:59 The benchmarks show the 10 to 20% improvements the scaling law suggests. In the coming months, the other labs release their ronaflops and catch up. For context, he added, ronoflops equal 10 to the 27th flops, floating point operations, and measure of computing power. This is the compute that went into GROC 4, and by comparison, GPD4 was likely around 18 Yoda flops, 100x smaller, i.e. scaling improves ability. Elvis, meanwhile, writes, surely Gemini 3 and GPt 5 must surpass GROC 4. Are you prepared for what's coming in the next six months? Better coding models, longer video generation, and to top it all, multimodal agents are coming. Breakthroughs of all kinds are imminent. Best time to

Starting point is 00:21:36 be a builder. And whether you ultimately decide GROC 4 is the best model in practice or not, Elvis's statement here is pretty undeniably true. Things that fill us with wonder now will be commonplace before you know it, and the world gets remade again. That's going to do it for today's AI Daily Brief. Get out there and start testing your new toy. Let me know how it goes. And until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Is Grok 4 the Best LLM Yet?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.