The AI Daily Brief: Artificial Intelligence News and Analysis - Just How Good is Grok-3?

Episode Date: February 18, 2025

XAI’s Grok-3 has arrived, but does it meet expectations? Elon Musk’s AI venture asserts that its newest model competes with GPT-4 and DeepSeek, introducing features such as Big Brain Mode and Deep... Search. While benchmarks indicate encouraging performance, is this truly revolutionary or merely incremental? Additionally, OpenAI has turned down Musk’s $97 billion acquisition offer, while XAI looks to raise $10 billion in fresh capital.Brought to you by:KPMG – Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠www.kpmg.us/ai⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to learn more about how KPMG can help you drive value with our AI solutions.Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdown

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, GROC3 kicks off what appears to be the beginning of model update season. Before that in the headlines, perplexity launches their own version of deep research. The AI Daily Brief is a daily podcasted video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Open AI's Deep Research is one of the more exciting products that many people have gotten recently. In fact, if you go on Twitter or X, you can find people saying it's the most impressive product they've seen in years. It is, however, behind an extremely expensive paywall. Right now, the only people who have access to deep research are those who are paying for OpenAI's $200 pro tier.
Starting point is 00:00:44 And then in comes perplexity with their own version of deep research, with the same name, in fact, suggesting that they're trying to make this just a category of AI usage like Chatbot, and it absolutely obliterates the OpenAI price point. Free users have five queries per day, while pro users get up to 500 daily queries and have access to faster speeds. When asked how the company could offer this tool at this price, CEO Aravan Shrinivas said, thankful for open source, we're going to keep making this faster and cheaper. Knowledge should be universally accessible and useful,
Starting point is 00:01:14 not kept behind obscenely expensive subscription plans that benefit the corporates, but not in the interest of humanity. So yes, if you're wondering, Sam Altman is now being shanked from below and from above, given the aggressiveness of that particular positioning. Perplexity's deep research works very similarly to how rival tools work using a combination of agenic web search and iterative reasoning to generate in-depth research reports. They share a bunch of benchmarks, but honestly, I think for this type of product, everything is about how it actually performs. And for that, you're just going to have to go check it out yourself,
Starting point is 00:01:43 which, thankfully you can, given that they offer even free users, some number of queries each day. One user asked perplexity to compare itself to rival deep research features, ultimately producing a multi-page analysis that summarized, Perplexity AI excels in speed and accessibility for casual researchers. Open AI dominates in analytical depth for enterprise applications. Google integrates most seamlessly with existing productivity ecosystems, which honestly seems like a fairly decent write-up in summarization. Now, if you go cruise around the internet,
Starting point is 00:02:11 you can find people who are saying that perplexity's version of the tool is every bit as good or even better than Open AIs, but you also have a lot of sentiment like this one from Siki Chen, who writes, until you have access to full O3 or Quad 4 or something, you simply are not going to build a better deep research than OpenAI. This is a use case where the raw model reasoning capability matters a lot. Still, from a consumer perspective, obviously more options is a good thing, and so glad to see some competition in this space. Next up, an update on Ilya Sutskever, the former OpenAI co-founder,
Starting point is 00:02:41 who is back out fundraising for his new company Safe Super Intelligence. Previous reports had Ilya raising about a billion dollars at a $20 billion valuation, and it seems like that is now up to a $30 billion plus valuation. Bloomberg reports that Green Oaks capital partners will lead the round and plans to invest about half of it. And we still have no idea whether the valuation update from the original $5 billion reflects something new that Ilya has shown investors or is just the premium that the market feels it has to pay for any Ilya product. Now, while the startups like perplexity race ahead, don't expect the next generation of AI-enabled home assistance anytime soon, as the big tech companies are struggling.
Starting point is 00:03:19 Both Alexa and Siri have hit another round of delays. People got excited recently when it was reported that there was an Alexa AI event, but at a last minute go-no-go meeting last week, apparently Amazon's executives decided that no-go was the answer, and the Washington Post is now reporting that AI Alexa won't be ready until March 31st or later. The delay is reportedly due to Alexa giving inaccurate answers, which has been the scourge of this development cycle. Apple's AI Siri upgrade is also facing delays after plans were first unveiled all the way back last June at WWDC. Bloomberg reports that the project is facing engineering problems and software bugs, and that while Apple is, quote, racing to the finish line, some features planned for an
Starting point is 00:03:57 April rollout may be delayed until May or even later. One of the things that this highlights is that the margin of error and consumer forgiveness for AI hallucinations and incorrect answers when it comes to these sort of smart home devices is basically zero. And the risk of finding yourself on the wrong end of some viral clip on social media is really high, making these particular product rollouts a real challenge. Lastly today, meta is apparently planning a big investment in humanoid robots. The company will establish a new team within their Reality Labs hardware division, which is the group that has released the Meta Raybans and the MetaQuest. The new plan is to develop Meta's hardware for humanoid robots designed to
Starting point is 00:04:34 complete household tasks, initially focusing on developing sensors to be used by third-party startups. In an internal leaked memo, Meta's CTO Andrew Bosworth said, the core technology we've already invested in and built across reality labs in AI are complementary to developing the advancements needed for robotics. We believe that expanding our portfolio to a invest in this field will only accrue value to meta-a-I and our mixed and augmented reality programs. I think we're still a little premature, but you are going to see a lot more of the intersection of robotics and AI this year and in the years to come. For now, though, that is going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode. Today's episode is brought to you
Starting point is 00:05:12 by Vanta. Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC2 and ISO-2701. Centralized security workflows, complete questionnaires up to 5X faster, and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back, so you can
Starting point is 00:05:55 focus on building your company. Join over 9,000 global companies like Atlassian, Kora, and Factory, who use Vanta to manage risk and prove security in real time. For a limited time, this audience gets $1,000 off Vanta at vanta.com slash NLW. That's V-A-N-T-A-com slash NLW for $1,000 off. If there is one thing that's clear about AI in 2025, it's that the agents are coming. Vertical agents by industry, horizontal agent platforms, agents per function. If you are running a large enterprise, you will be experimenting with agents next year. And given how new this is, all of us are going to be back in pilot mode. That's why Superintelligent is offering a new product for the beginning of this year.
Starting point is 00:06:42 It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business. If you are interested in the agent readiness and opportunity audit, reach out directly to me, NLW at B-Super.a.I. Put the word agent in the subject line so I know what you're talking about, and let's have you be a leader in the most dynamic part of the AI market.
Starting point is 00:07:13 Hey listeners, want a supercharge your business with AI? In our fast-paced world, having a solid AI plan can make all the difference. Enabling organizations to create new value, grow, and stay ahead of the competition is what it's all about. KPMG is here to help you create an AI strategy that really works. Don't wait, now is the time to get ahead. Check out real stories from KPMG of how AI is driving success with its clients at KPMG. dot us slash AI. Again, that's
Starting point is 00:07:41 www.kmg. dot us slash AI. Now, back to the show. Welcome back to the AI Daily Brief. Today we are digging into the always juicy topic of model competition. Specifically, Elon Musk's XAI has released their long-awaited
Starting point is 00:07:59 flagship model GROC 3. In fact, the launch unveiled a family of models built around the GROC3 architecture. The flagship model competes against OpenAI's GPT-40. But there's also a mini version that's designed for speed. The company will also release reasoning versions of the model in each size shortly. Users, for example, will be able to engage something called Big Brain Mode to add more
Starting point is 00:08:19 reasoning time for more difficult queries. And XAI also introduced a mode called DeepSearch. Deepsearch uses a form of rudimentary agent to search the web and Twitter slash Xpost to compile long-form reports, obviously now in a similar way to how deep research works with open AI. There's also a forthcoming voice mode, which will be rolled out in about a week, according to the announcements. Brock 3 is first available to premium plus subscribers on X,
Starting point is 00:08:43 but M1 Astra and Apple Insider also claims that XAI will launch a GROC pro tier at $30 a month or $300 per year. It seems like that subscription might be required to use advanced features like the deep search voice mode and big brain mode. Now, as these new models come online, Elon announced that GROC 2 would be open-sourced in the coming months. He said, our general approach is that we will open-source the last version when the next version is fully out. When Grok 3 is mature and stable, which is probably within a few months,
Starting point is 00:09:10 then we'll open source Grok 2. Sam Altman has flagged that he's considered doing the same with OpenAI's older models as well, so maybe that becomes the new norm. Now, one of the reasons that GROC 3 has been highly anticipated is that it's the first model that's trained on a larger scale data center. Last month, Elon claimed that the model was trained using 10 times the compute of GROC2, which was achieved, of course, with the Colossus supercluster, the first training cluster capable of networking 100,000 Nvidia H-100s. GROC3 was therefore viewed as the first first real test of whether pre-training scaling had hit a wall with the last generation of models. Now, of course, as is the case with every launch, people are pouring over the benchmarks.
Starting point is 00:09:45 When it comes to math science and coding benchmarks, GROC 3 Mini achieved parity with Gemini 2.0 Pro and Deepseek v3, and the full-size GROC model, and of course this is according to XAI itself, outperformed on each test by a noticeable margin. Important to note, this was only comparing leading non-reasoning models with GROC3 not putting up the same performances, OpenAI's O3 Mini, on these tests. For the reasoning models, both sizes of GROC3 seem fairly competitive with O1 on low-infrance settings and outperform O3 Mini on high-inference settings. This would imply that the reasoning version of GROC3 isn't on the same level as the full-size
Starting point is 00:10:18 03, given that we don't have access to either model at this stage, we don't know for sure. XAI noted that GROC3 reasoning is still in beta and will have further post-training before its full release. There wasn't a huge boost from scaling pre-training, but the gains were there. Professor Ethan Malik writes, Based on the early stats, looks like GROC3 base is going to be a very solid frontier model, suggesting pre-training scaling law continues with linear improvements to 10x compute. In essence, GROC3 doesn't invalidate those scaling laws,
Starting point is 00:10:45 but it could also suggest that much, much larger training clusters are needed to see paradigm changing improvements. One benchmark that many people took note of was chatbot arena where users vote on which AI output they prefer. While the metric is inherently subjective, it gives a sense of how the models will perform in the market. Investor Gavin Baker writes, GROC 3 is the first model ever to score over 1,400 on Chatbot Arena and outperforms the best publicly available reasoning models from OpenAI and Google. XAI was founded 13 years after DeepMind and 8 years after OpenAI and is now ahead of both, the SR71 Blackbird of AI Labs. Baker did, of course, then, note that he is a little biased as an XAI investor. AI Breakfast wrote, for everyday users, the Chatbot Arena is the only benchmark that matters.
Starting point is 00:11:26 GROC3 is officially the best LLM. Given the speed at which XAI achieved this, they will only widen the gap over time. A more complete review comes from Andre Carpathy. And although Carpathy was a co-founder at OpenAI, most people view his take as inherently unbiased given his lack of affiliation today and the general credibility that he has. He wrote a long review on X saying,
Starting point is 00:11:48 I was given early access to GROC 3 earlier today, making me, I think, one of the first few who could run a quick vibe check. He goes into a long review, sharing some of his tests around thinking, exploring the deep search feature, trying a bunch of random LLM gotchas, and ultimately here's the conclusion he came to. He writes,
Starting point is 00:12:06 Grock 3 plus thinking feels somewhere around the state-of-the-art territory of OpenAI's strongest model, so one pro at $200 a month, and slightly better than Deepseek R1 and Gemini 2.0 flash thinking, which is quite incredible, considering that the team started from scratch around a year ago. This time scale to state-of-the-art territory is unprecedented. Do also keep in mind the caveats. The models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days to weeks.
Starting point is 00:12:34 The early L.L.M Arena results look quite encouraging indeed. For now, big congrats to the XAI team, they clearly have huge velocity and momentum. Now, the larger context around the Grock Three launches the ongoing feud between Elon and Sam Altman, and indeed, it's very difficult to cover this. Elon especially is more divisive than he's ever been, and it is enormously difficult to find people who can separate whatever they think about Elon in general from their reviews of anything that he touches. Here's how Gary Mark has summed up where this leaves the competition, which I think is fairly reflective of what others think as well. He writes, one, Sam Altman can breathe easy for now. Two, no game changers, no major leap forward here. Hallucinations haven't been magically solved,
Starting point is 00:13:11 etc. Three, that said, OpenAI's moat keeps diminishing, so price wars will continue and profits will continue to be elusive for everyone except Nvidia. Four, pure pre-training scaling has clearly failed to produce AGI. Open AI leaker Jimmy Apples writes, strong model, the main thing is the speed with which they caught up. I think it lives up to expectation, strong offering, good dollar value. He then prodded Sam Altman to release 4.5, which we know is coming soon. Earlier in the day when someone had told him to release 4.5 the same day to steal the show, Altman wrote, that wouldn't be very nice, ellipsis.
Starting point is 00:13:43 To me, one of the things that really stands out is just how absolutely saturated these benchmarks are and how little I find myself compelled by them when a new model comes out. Ethan Mollock again got at this writing. Another thing Grock 3 highlights is the urgent need for better batteries of tests and independent testing authorities. Public benchmarks are both meh and saturated, leaving a lot of AI testing to be like food reviews based on taste. If AI is critical to work, we need more.
Starting point is 00:14:09 He continues GPQA, Diamond, and MMLU, and Arc AGI look nothing like actual work. He also adds, and this is something I completely agree with, I'm surprised no large IT consulting or even National Standards Agency hasn't stepped in with large-scale batteries of private tests, especially given the the hundreds of billions of dollars being invested. This is a hugely salient point. Ultimately, it doesn't matter for the vast majority of users how they do on these benchmarks. It matters how they perform in real work settings. And speaking of OpenAI and Elon's fight, the OpenAI board has now formally rejected Elon's $97 billion bid to take over the nonprofit. In a unanimous vote,
Starting point is 00:14:44 the board decided that the takeover was, quote, not in the best interest of Open AI's mission. A statement from Chairman Brett Taylor said, Open AI is not for sale and the board has unanimously rejected Mr. Musk's latest attempt to disrupt his competition. Any potential reorganization of OpenAI will strengthen our nonprofit and its mission to ensure AGI benefits all of humanity. OpenAI lawyers have insisted that Musk's bid doesn't set the price for the nonprofit, which will need to be paid during the conversion to a for-profit company. Separately, the Financial Times reports that the company is considering granting special voting rights to the nonprofit board in an attempt to ensure that they are on a target for a hustle takeover from Musk following the for-profit conversion.
Starting point is 00:15:20 Meanwhile, XAI itself is heading back to the well for another funding round. Bloomberg reports that the company is seeking to raise $10 billion at a $75 billion valuation, with sources saying that existing investors, including Sequoia, and Dresen Horowitz and Valor Equity Partners all participating in the talks, which are still at an early stage. A significant portion of the new funding seems as though it would pay for upgraded chips at XAI's data centers. On Friday, Bloomberg reported that the company was close to closing a $5 billion deal with Dell to provide servers powered by NVIDIA's Blackwell G2 200 chips.
Starting point is 00:15:50 So ultimately, friends, where we are is that the proof is going to be in the pudding. A lot of folks over the next few weeks will be testing Rock 3 and seeing how it compares to the latest chat GPT and Claude models. But it also feels to me like this is the beginning of model update season, not the end, with both Anthropic and OpenAI promising new models coming soon. So we could have a lot of new developments in the near future, which is obviously nothing but good for all of us users. For now though, that is going to do it for today's AI Daily Brief. Until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.