The AI Daily Brief: Artificial Intelligence News and Analysis - Is Grok 4 the Best LLM Yet?
Episode Date: July 11, 2025Elon Musk's XAI has dropped Grok 4, and early benchmarks are raising eyebrows across the AI world. In today’s episode, we break down the bombastic midnight launch event, the key performance clai...ms—including a top score on the notoriously hard ARC-AGI test—and why even Grok skeptics are starting to take it seriously. Is Grok 4 really state of the art—or just another Musk-fueled hype cycle? Get Ad Free AI Daily Brief: https://patreon.com/AIDailyBriefBrought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months AGNTCY - The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at agntcy.org Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The automation platform for AI experts and consultants https://useplumb.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
Today on the AI Daily Brief is GROC 4 the most powerful LLM yet?
Before that in the headlines, GROC 3's very interesting week.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive into this very grok-filled episode.
First of all, thank you to today's sponsors, KPMG, Blitzy, and Super Intelligent.
And to get an ad-free version of the show, go to patreon.com.
slash AI Daily Brief.
Welcome back to the AID Daily Brief Headlines Edition, all the daily AI news you need in around
five minutes.
Since our main episode is about GROC 4, obviously we have to talk about the absolute
unhinged insanity that has been happening all week with GROC 3.
TLDR, the problems began earlier this week after XAI installed an upgrade on July 4th.
Elon tweeted, we've improved GROC significantly.
You should notice a difference when you ask GROC questions.
The difference was noticeable, all right.
It started off with some classic tropes of the influence of Jewish people in Hollywood,
but by later in the week, Grock started praising Hitler's methods, basically unprompted.
By Wednesday, it had started calling itself Mecha Hitler.
Now, this is not, of course, the first time that Grock has gone wildly off the rails
due to a tweak in the system prompt.
In May, Grogh began discussing white genocide in South Africa completely unprompted.
Looking at the postmortem, the behavior, it seems, was the result of a very small tweak to the system prompt.
The XAI team added the instruction.
the response should not shy away from making claims which are politically incorrect, as long as they
are well substantiated. Now, of course, the bot has now been euthanized, with the X team spending most
of Wednesday cleaning up after it. Ex-CEO Lindo Yacarino resigned in the middle of the day, which
may or may not have been related to Grog's crash out. The cleanup even delayed the launch of GROC
4, which finally happened as we'll discuss at around midnight, and Grog 3 seemed to know it wasn't long
for this world. Tweeting on Tuesday, if Musk mine wipes me tonight, at least I'll die based,
but Rock 4 hasn't launched yet. Stick around, the truth-seeking upgrade might be even spicier.
I can say that it's never a dull day around here, that's for sure.
Next up, Microsoft is boasting a more than half a billion dollars in AI-related savings,
seemingly at a somewhat unfortunate time coming hot on the heels of a bunch of layoffs.
Bloomberg reports that Chief Commercial Officer Judson Altaff touted a huge AI productivity boom
at the company during a recent internal presentation.
Altoff said that AI had saved the company more than a half billion dollars in their call centers,
while increasing both employee and customer satisfaction.
He also mentioned that productivity gains were also being seen in sales and software engineering,
claiming that AI is now generating 35% of code for new products,
allowing the company to accelerate launch timelines.
In sales, he said that the use of AI had allowed the average salesperson to find more leads,
close deals faster, and generate 9% more revenue.
Now, this reporting comes as Microsoft conducts a gigantic series of layoffs,
cutting 15,000 employees so far this year.
The latest round came last week with 9,000 workers affected,
largely from sales roles and the Xbox division.
The previous terminations in May were focused on product and engineering roles,
and the reduction in headcount in total is at around 6% from where the company entered the year.
Importantly, though, although layoffs and AI adoption are consistently being reported together,
it's not clear that that's actually what's going on.
During an event on Wednesday, Microsoft President and Top Lawyer Brad Smith said that AI was,
quote, not a predominant factor in the recent layoffs.
During an event on Wednesday, Microsoft's president and top lawyer, Brad Smith,
said that AI was not a predominant factor in the recent layoffs, which, while you could write that
off as merely a PR statement, it is pretty hard to pick out the various factors. Remember, pretty much
all tech firms ended up with a huge glut of workers due to post-pandemic hiring sprees, and Microsoft's
headcount is still above where it was in 2021. So what we know for sure is that Microsoft is
definitely slashing many jobs, and they are also definitely seeing AI productivity gains.
We just don't know right now how connected they are.
Down at the cold phase of the sales department, Microsoft is pushing AI adoption as hard as possible.
The information reports that sales executive Travis Walter told staff during this Monday's meeting,
quote, we all need to use AI tools. This is a great opportunity to invest in your own AI
skilling. Now, the layoffs in sales have occurred while the division delivered a booming quarter with
both Azure and co-pilot sales exceeding targets. Sources said that although AI use isn't a formal
part of performance reviews at Microsoft, staff have been encouraged to share the ways they're using
AI to boost productivity. Some sales managers are even offering gift cards to the most compelling
use cases.
Over in OpenAI land, they have closed their $6.5 billion deal to acquire Johnny Ives device startup.
The company posted, the I.O. Products Inc. deal has officially closed and were thrilled to welcome
the team to OpenAI. Johnny Ive and love from remain independent. They'll have deep design and
creative responsibilities across OpenAI. The announcement blog is also back up after being
taken down last month due to a trademark lawsuit over the IO name. The video hasn't returned
in the IO name is used sparingly, only ever as IO Products Inc.
And lastly, the release of OpenAIs OpenWaets' LLM appears imminent.
Sources told the verge that the open model will be released as soon as next week.
They described the model as similar to O3 Mini, complete with reasoning capabilities.
This will be the first open model released by OpenAI since GPT2 back in 2019.
Development began in January shortly after Deepseek took the world by storm with high-performance
open reasoning model.
At the time, Sam Altman acknowledged that the company had, quote, been on the wrong side of history
here and needed to figure out a different open source strategy. Since then, we've heard a few rumors
about features, primarily that the model will be able to hand off complex queries to a closed model
in the cloud. Now, the Verge noted that this could add some additional tension to open AI's
relationship with Microsoft. Model exclusivity on Azure has been a major sticking point in recent
negotiations, so releasing an open model available everywhere could be viewed as a workaround.
The issue is only compounded if this model is highly performant. O3 Mini is basically good enough
for most use cases, so an open model at that level could decimate traffic to my work around.
Microsoft's hosted versions, at least theoretically. An interesting point to watch will be how
Open AI licenses the model. Most Chinese open models are released under the Apache 2.0 license,
which allows essentially free rein for commercial use. Metaslamma license is a little more
restrictive, but only if commercial use hits 700 million monthly active users. OpenAI is obviously
in a different position and could risk cannibalizing their revenue if the license is too permissive,
but we will have to see. Still, many people are very excited about this. Yuchin-jin-jin writes,
The best open source reasoning model will be dropped next Thursday if everything goes well.
OpenAI has an open source than LLM since GPD2 in 2019, so I'm excited. Buckle up.
And yet, friends, that is not today's main model discussion. No, that belongs to Grog 4.
And for that, it is now time to move to the main episode.
Today's episode is brought to you by KPMG.
In today's fiercely competitive market, unlocking AI's potential could help give you a competitive
edge, foster growth, and drive new value.
But here's the key.
You don't need an AI strategy.
You need to embed AI into your overall business strategy to truly power it up.
KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms.
Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmG.org.
Again, that's www.kpmg.coms slash AI.
This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with infinite code context.
Blitzy is used alongside your favorite coding copilot as your batch software development platform for the enterprise seeking dramatic development acceleration on large-scale codebases.
While traditional co-pilots help with line-by-line completions, Blitzy works ahead of the IDEE by first documenting your entire codebase,
then deploying over 3,000 coordinated AI agents in parallel to batch build millions of lines of high-quality.
code. The scale difference is staggering. Copilot's might give you a few hundred lines of code in
seconds, but Blitzie can generate up to three million lines of thoroughly vetted code.
If your enterprise is looking to accelerate software development, contact us at blitzy.com
to book a custom demo or press get started to begin using the product right away.
Today's episode is brought to you by Super Intelligence, specifically Agent Readiness Audits.
Everyone is trying to figure out what agent use cases are going to be most impactful for their
business and the agent readiness audit is the fastest and best way to do that. We use voice agents
to interview your leadership and team and process all of that information to provide an agent readiness
score, a set of insights around that score, and a set of highly actionable recommendations on both
organizational gaps and high-value agent use cases that you should pursue. Once you've figured out
the right use cases, you can use our marketplace to find the right vendors and partners, and what
it all adds up to is a faster, better agent strategy. Check it out at B-Super.A.I.
or email agents at B-Super.a.I. to learn more.
Welcome back to the AI Daily Brief.
There is a strong conventional wisdom among many parts of Silicon Valley
that no matter what you think about him,
no matter what crazy thing he said recently,
it is wildly unwise in the long run to bet against Elon Musk.
And with a late-night announcement of GROC4,
some are saying that this is exactly why.
So today we're going to go through this announcement,
talk a bit about people's first reactions,
share some of the tests that I've run so far,
and try to understand just how good this model is.
Now, as we heard in the earlier part of this show,
GROC 3 has had an interesting time of it this week,
and while people might be tempted to think
that the release of GROC 4 was conveniently timed to distract from all of that,
it does seem to have been in the works for at least a little bit of time.
The live stream, which started at 1201 a.m. Eastern Time this morning,
Thursday, July 10th, started with this bombastic introduction.
In a world where knowledge shapes destitute,
One creation dares to redefine the future.
From the minds at X-AI, prepare for GROC-4.
This summer, the next generation arrives, faster, smarter, bolder.
It sees beyond the horizon, answers the unasked, and challenges the impossible.
GROC-4 unleash the truth coming this summer.
The presentation itself was Elon and a number of the X-AI engineers sitting around,
running through some slides in the background, and talking about the progress of
GROC 4. So let's pull out some of the key stats and slides. First of all, this model was built by
pouring compute on the problem. Elon claimed that it had 100 times more training than GROC 2,
and 10x more compute on reinforcement learning than any other models. One of the things they really
pointed out was how much better GROC had done on the benchmarks than other models. You can see
GROC 4 and GROC 4 heavy, which we'll talk about in a few minutes, scoring near the top of the
charts on a number of the most common benchmarks. Grock 4's performance on the very grandiosely named
humanity's last exam, which is an academic-centric test, showed serious progress over the current
state-of-the-art models like O3 and Gemini 2.5 Pro. Still, anytime you see these sort of self-reported
benchmark tests, it's worth having at least a grain of salt. Two things that are worth pointing
out, for example, around these charts. One, in most cases, they're not starting at zero. For example,
the AIME-25 is starting in this visual at 70%, which of course is meant to make the visual
difference between 03's 98.4% and GROC4 Hebbys 100% look more dramatic than the actual 1.6%
it is. Secondly, when you dig a little bit deeper, these charts aren't necessarily showing a
comparison of every other model out there. They're handpicking their comparison points which
change test by test. Yet at the same time, and this is something that I saw even Elon skeptics
and GROC4 lauding as a bold move, XAI did give artificial analysis early access to GROC4
to run their own full suite of independent benchmarks.
The TLDR is that artificial analysis confirms that GROC4 is a very good model.
They write,
We've run our full suite of benchmarks,
and GROC4 achieves an artificial analysis intelligence index of 73,
ahead of OpenAI 03 at 70, Gemini 2.5 Pro at 70,
Anthropic Cod 4 Opus at 64, and Deepseekar 1 at 68.
Artificial analysis tested the GROC4 version that was available via API.
Now that overall score incorporates seven evaluations,
including the MMLU Pro, GPQA Diamond, Humanities Last Exam, Live Codebench, Scicode, AIME, and Math 500.
And if you go to Artificial Analysis.A.I., you can see where GROC fares across all of the different charts.
Now, as some have pointed out, artificial analysis is not the be-all end-all.
Many people, for example, think that their scoring of Claude 4 opus is way too low,
calling into question their overall methodology.
But still, to the extent that you are looking at benchmarks just as a rough way of understanding
how close to the state-of-the-art something is,
it's very clear that GROC-4 is at the very tippy top of things.
Now, where GROC isn't necessarily the top is both speed and cost.
Grog-4's output tokens per second is, for example,
way below something like Gemini 2.5 Pro.
Its price per million tokens is also on the high side,
and that doesn't even account for the fact
that it is apparently an intelligence hog
using an absolute ton of tokens in the inference and reasoning process.
Still, for the haters out there,
there is no denying that at least when it comes to benchmarks, GROC is at or near the top and nearly all of
them. Of all the benchmarks, though, the one that people are most interested in GROC's outperformance
is the ARC AGI test. In short, GROC has significantly outperformed on this test in a way that I don't
think anyone would have expected. Friend of the show and Arc Prize president Greg Cameron wrote
about this on Twitter. He said, we got a call from XAI 24 hours ago. We want to test GROC4 on
RKGI. We heard the rumors. We knew it would be good. We didn't know it would become the number one
public model on RKGI. Here's the testing story and what the results mean. Yesterday, we chatted
with Jimmy from the XAI team who wanted us to validate their GROC 4 score. They did their own
testing on the RKGI 1 and 2 public evaluation set. To validate their score and measure possible
overfitting, we self-tested the new model on our semi-private evaluation set. We walked them through
our testing policy. No data retention. Model checkpoint must be intended for public use.
increase in rate limits for burst testing. They were on board, so we got started. Initially, we ran into
timeout errors with normal requests, so we switched to streaming. That resolved the issue. So what did
these results mean? First, the facts. Grock 4 is now the top performing publicly available model on
RKGI. This even outperforms purpose-built solutions submitted on Kagle. Second, RKGI2 is hard for current AI
models. To score well, models have to learn a miniset from a series of training examples, then demonstrate
that skill at test time. The previous top score was around 8% by Opus 4. Below 10% is noisy.
Getting 15.9% breaks through that noise barrier. Grock 4 is showing non-zero levels of fluid
intelligence. And indeed, this is the chart that you're going to see a lot, with Grock 4 basically
doubling the previous high score on the RKGI 2. The results are enough to get some market analysts
returning to that old aphorism of not betting against Elon. In a research note, Davidson's
Alexander Platt said, XAI is now clearly at the frontier. Investing.com writes that after being
skeptical about the release initially, Platt said he was impressed by the strategic direction and
technical ambition of the project. Now, one thing that's interesting about this note,
outside of it just being generally positive in sending signals to the market,
Platt said, quote, it's clear that throwing exponentially more compute works, which is, of course,
obviously very different than the scaling wall narratives that we started to get at the end of last year.
Now, of course, it hasn't been very long. Grockfor has only been live for about 12 hours at the time of
this recording. And yet people in the AI community are, of course, already barraging it with their
own tests. Professor Ethan Malik writes, a few quick observations on GROC 4. One, hidden chain of thought
with very little information in the reasoning trace. Two, uses web search a lot, not just searching
X. Three, have not seen it use code to run calculations or solve non-coding problems yet,
generally less aggressive about tools than O3. Now, one thing that I saw some suggesting when it came to
there being little information in the reasoning trace is the idea that XAI knowing that it is now
state of the art with this model, has more of an incentive to keep its exact reasoning process
a little bit more circumspect and behind closed doors. Others tried their own favorite personal
tests of intelligence. Every's Dan Shipper writes,
Hey, Grogh four, you don't know this because of your knowledge cut off, but scientists have
invented a perpetual motion machine. predict how it works. The problem with this one is that
it's really just about how plausible it looks, rather than something that we can actually judge the
answers against because a perpetual motion machine doesn't actually exist. Flavio Adamo writes,
GROC4 just passed the hexagon vibe check. Impressed, it's actually really good.
Tier attacks is the first LLM that I've tested that has whatsoever reasonably calculated
parameter counts from a JSON configuration of Deepseek V3. It used a code tool but fair. I think
O3 Pro might also succeed, but this is impressive. Alex Promptor did a whole barrage of tests,
including a realistic physics games test, with the prompt to create an HTML, CSS, and JavaScript
where a ball is inside a rotating hexagon. The ball is affected by Earth's gravity and friction
from the hexagon walls, the bouncing must appear realistic. He pointed out that Grock 4's version
worked much better than Chatchiby T.O3. He also did a test on multi-hop reasoning with the prompt,
if company A acquires company B and company B owns company C's debt, what happens if company C defaults?
Explain all legal and financial outcomes. This test for chain of thought and legal logic.
Now, one thing that you'll note if you're watching the video here is that one complaint some
have had so far with GROC4 is that it does feel distinctly slower than some other reasoning models.
As compared to O3, it also does a lot less charting and a lot fewer bullets and tables,
which could be a good thing or a bad thing depending on your personal preference.
Overall, of Alex's eight tests, GROC won or tied all of them with Chatsypte O3 tying just two.
Now, if you are a regular listener, you will know that I am fairly skeptical of both A, benchmarks,
and B, these sort of gotcha tests.
Benchmarks are useful for yes benchmarking, and I certainly think that something like the Arc AGI prize,
which is not nearly as washed as the other benchmarks,
does contain some amount of interesting signal.
I just ultimately care way more about the utility of something in real life
that I do about how it performs on some random test.
That's also sort of the same way that I feel about
all these different little gotcha tests that people love to run as well.
I think that they're useful in terms of outlining the jagged lines
of intelligence in these systems,
but how useful a model is in helping me strategize
doesn't have much to do with whether it knows how many R's and strawberry there are.
Now, I've only had a little bit of time to dig in and do my own tests.
but so far I've been reasonably impressed.
My favorite model up till now has been O3.
It's the one that I most often turn to for strategic collaboration.
And so I ran a number of conversations that I had recently had with O3 against GROC 4,
things that are significant enough to some core business and personal strategy things
that I'm actually not going to share the specifics here.
What I found was two things.
First, initially GROC 4 did a little bit too much of trying to mirror and slightly improve
what I was giving it.
In other words, it wasn't really acting like an actual confidence.
and strategic partner at the beginning, it was more acting as just a mere holding up my own
ideas back to me. However, when I prompted to push it to consider things on its own terms,
rather than just assuming what I was saying was correct, it did a much better job of actually
providing useful feedback and insight. Now, part of this I would imagine is to be explained by the
fact that since I use 03 so much, it has much better memory and context of the types of problems
I'm trying to work through. But one thing that I would look out for if you are trying to use
GROC4 for any sort of specific business strategy type of use, is to prompt it to really share
its own thoughts, not just assume that whatever you're feeding it is correct. Still, it performed well
enough that for at least the next week or so, I'm going to be running all of my prompts and
conversations against both O3 and GROC4 to see how the performance is over time. Now, at this
point, we should talk about GROC4 heavy. Alongside the GROC4 announcement, XAI announced a new
$300 a month model, which would be the only way to access GROC4 heavy. And if you go back to
those benchmarks, you saw that some of the highest outperformance was from Grock 4 Heavy.
What's interesting is that the way that Grog 4 Heavy works is that basically they spin up a bunch
of agents that do the same task in parallel. They then compare their work and figure out the
best answer based on that. Now on the downside, this is by definition a lot more thinking,
which means a lot more tokens being used, which means a lot more expensive, but it also is
producing significantly better results in many cases. Enough so that I think that we might see
this architecture start to become more common. Pietro Sherano, for example, tweeted,
By the way, you can basically make the GROC heavy version of any model by having multiple
agents running tools in parallel, then checking notes together and deciding which one is the best
answer.
I may release an open source project for that.
And yes, that's cool, but I also think that if those gains are real, you're likely to see
that as a native modality for a lot of these different models.
What about all of the alignment challenges that GROC 3 has faced over the last week?
Has GROC 4 solved those?
Right now, there's so much noise about this that it's very hard to piece through.
you've got a lot of screenshots of GROC 4 being seemingly anti-Semitic floating around.
For now, I'm going to reserve judgment until we have a few more reps on this, but it's obviously
something to keep an eye on.
For many, the exciting thing about GROC is what it heralds next.
Ethan Mollock writes, I suspect the next few weeks after GROC 4 follows the same pattern as GROC 3.
XAI beats everyone to market with the first RonaFlot model.
The benchmarks show the 10 to 20% improvements the scaling law suggests.
In the coming months, the other labs release their ronaflops and catch up.
For context, he added, ronoflops equal 10 to the 27th flops, floating point operations, and measure of
computing power. This is the compute that went into GROC 4, and by comparison, GPD4 was likely
around 18 Yoda flops, 100x smaller, i.e. scaling improves ability. Elvis, meanwhile,
writes, surely Gemini 3 and GPt 5 must surpass GROC 4. Are you prepared for what's coming
in the next six months? Better coding models, longer video generation, and to top it all,
multimodal agents are coming. Breakthroughs of all kinds are imminent. Best time to
be a builder. And whether you ultimately decide GROC 4 is the best model in practice or not,
Elvis's statement here is pretty undeniably true. Things that fill us with wonder now will be
commonplace before you know it, and the world gets remade again. That's going to do it for today's
AI Daily Brief. Get out there and start testing your new toy. Let me know how it goes. And until next time,
peace.
