The AI Daily Brief: Artificial Intelligence News and Analysis - A Promising Alternative Way to Improve LLM Performance

Starting point is 00:00:00 Today on the AI Daily Brief, a new MIT paper about test time training, which has been a big part of the recent discussion of the limitations of LLM scaling. And before that in the headlines, chat GPT can now read from apps on Mac. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. We kick off today with an update from OpenAI where ChatGPT can now read direct.

Starting point is 00:00:33 directly from certain other apps. ChatGPT's desktop app for MacOS can now read from a handful of leading developer-focused coding applications, including VS code, Xcode, text edit, terminal, and item 2. What this means practically is that developers no longer need to copy and paste their code into chatGBT while using it as a coding copilot. Instead, when this new work-with apps feature is enabled, the section of code you're working on will be automatically sent as context alongside your prompt. ChadGPT won't be able to write code directly into a developer app like cursor or

Starting point is 00:01:03 hub copilot can, but the feature is more about building a test case for more general applications of this capability. OpenAI said the ability to understand other apps is a key building block towards creating agenic systems. The feature is drawing from Apple's accessibility API to read and translate the screen. This means the technique will only work with text-based apps, however, it also avoids using visual-based inputs which are prohibitively expensive for heavy use. Still, the feature sends up to 200 lines of code as context, so it's going to be using a lot of tokens. It's unclear at this stage how OpenAI plans to make the feature compatible with apps that don't work with Apple screen reader. We've seen Anthropic go all the way in the other

Starting point is 00:01:37 direction using an approach that takes constant screenshots for context rather than relying on APIs. OpenAI desktop product lead Alexander Ambirico said, this isn't meant to be an agent. It's a way to collaborate with coding tools to start, and there will be more tools coming soon. On the side of agents, I think this is a really key building block. This idea that Chachapit understands or can work with all the content that you have, so it can help with it. The feature is all already available to Plus and Teams users and rolling out to Enterprise and Education tiers in the next few weeks. Next up, the latest from Elon and XAI, we had heard previously that they were raising up to

Starting point is 00:02:10 $6 billion, but we've gotten some new details. The latest report suggests that that's happening at a $50 billion valuation and could close as early as next week. CNBC suggests that it's going to be a combination of $5 billion from sovereign funds in the Middle East and $1 billion from other investors. Now, of course, most of this is going to end up in Jensen Huang's pocket, because the money is going to be used to acquire 100,000 Nvidia chips, according to CNBC's sources. We'll keep an eye out to see if this deal actually closes. Speaking of data, consulting firm Gartner has warned that the AI

Starting point is 00:02:41 energy crisis could arrive as soon as 2027. In a new report, the firm said that power shortages would restrict 40% of AI data centers within a few years. Bob Johnson, VP analysts at Gartner, said, the explosive growth of new hyperscale data centers to implement GenAI is creating an insatiable demand for power that will exceed the ability of utility providers to expand their capacity fast enough. In turn, this threatens to disrupt energy availability and lead to shortages, which will limit the growth of new data centers for Gen AI and other uses from 2026. Garner said that new servers last year required 195 terawatt hours of electricity, which is as much as 18 million U.S. households. By 2027, they believe that just the new facilities will demand 500 terawatt hours. And this, my friends,

Starting point is 00:03:21 is why, of course, all of the big AI labs are so focused on energy and energy solutions. Over in the world of staffing moves, prominent AI developer Francois Chalet is leaving Google after close to a decade at the company. He is credited for creating Keras, a high-level open-source API for creating AI models for tackling machine learning tasks. The platform boasts over 2 million users and underpin several high-profile products, including Waymo's self-driving algorithm, as well as the recommendation engines for YouTube, Netflix, and Spotify.

Starting point is 00:03:49 In a post on X, he said, I'm very grateful for my decade at Google, and that time span deep learning went from a niche academic topic to a massive industry employing millions. Karras went from a small library used by a few thousand enthusiasts to a state-of-the-art framework used by two million developers. Chalet says he plans to, quote, go start a new company with a friend, but didn't give any further details. Aside from Keras, Shaleigh published the abstraction and reasoning corpus for AGI in 2019. The ARC AGI benchmark, which by the way features prominently in today's main episode, measures the ability of AI systems to solve novel reasoning problems and is viewed as one of the most recognizable signposts that a model has achieved

Starting point is 00:04:23 to True AGI. This year in collaboration with others, he launched the Ark Prize, awarding $1 million to the first team to achieve 85% on this benchmark. The prize remains unwon with the closest score coming in at 42%. Chalet has also taken a firm view on the scaling issues that has recently returned to prominence. He has often argued that the current approach to feeding ever more data and compute resources to train models is unlikely to achieve an AI that's as smart as humans. Instead, he believes that methods that involve reasoning in ways are more likely to yield results. Now that, I think, is a perfect segue into what our main discussion is going to be today, which is a new paper out of MIT, that puts a little more juice around this idea of new strategies like test time compute. That's going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode.

Starting point is 00:05:09 Today's episode is brought to you by Vanta. Whether you're starting or scaling your company's security program, demonstrating top-notch security practices, and establishing trust is more important than ever. Vanta automates compliance for ISO-27-001. 1, SOC2, GDPR, and leading AI frameworks like ISO-402,001, and NIST AI Risk Management Framework, saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center, all powered by Vanta AI. Over 8,000 global companies like Langchain, Lila AI, and Factory AI use Vanta to demonstrate AI trust and prove security in real time. Learn more at Vanta.com slash NLW.

Starting point is 00:05:49 That's Vanta.com slash NLW. Today's episode is brought to you, as always, by Super Intelligent. Have you ever wanted an AI Daily Brief, but totally focused on how AI relates to your company? Is your company struggling with AI adoption, either because you're getting stalled, figuring out what use cases will drive value, or because the AI transformation that is happening is siloed at individual teams, departments, and employees, and not able to change the company as a whole? Superintelligent has developed a new custom internal podcast product that inspires your teams by sharing the best AI use cases from inside and outside your

Starting point is 00:06:26 company. Think of it as an AI Daily Brief, but just for your company's AI use cases. If you'd like to learn more, go to Bsuper.a.i slash partner and fill out the information request form. I am really excited about this product, so I will personally get right back to you. Again, that's Bsuper.a.ai slash partner. Welcome back to the AI Daily Brief. Today we're doing something a little bit different, And I think it's going to be pretty fun. One of the big points of conversation for the last week or two has been this question of whether labs are hitting some limits with their previous approach to scaling

Starting point is 00:06:57 LLMs. Basically, reports are coming out that the next version of Gemini, as well as the next version of ChatGBT, as well as OpenAI's next GPT model, just don't reflect as big a jump as previous state-of-the-art leaps represented. If you want more background on that, you can go listen to a couple shows from earlier this week that are all about those limitations. However, the important thing to note here is that it's not that AI can't get any more performant from here, it's about what techniques and strategies are needed to actually

Starting point is 00:07:22 make the next leaps. One of the strategies that seems really promising is something called test time compute. This is a part of what's been built into the 01 reasoning model that OpenAI has released, and a new paper this week from MIT was called the surprising effectiveness of test time training for abstract reasoning. Now, the challenge when any of these papers are released is that they are extremely dense and extremely technical. And yet, we have ourselves some assets to better understand this. And so from here, I'm going to turn the AI Daily Brief over to Google's Notebook L.M, where I've worked with it to create a conversational podcast about this paper. I'm not looking to put myself out of a job here, but I think you'll agree that this is a

Starting point is 00:08:00 really powerful use of this new Google capability that makes taking really abstract and dense information like what's in this paper a lot more accessible. So with that, I'm going to turn it over to Notebook LM. Appreciate you guys listening as always. And here once again is a discussion of of the surprising effectiveness of test time training. You know, it feels like for a while there, AI was just advancing so fast, especially with those large language models from companies like Open AI and Google.

Starting point is 00:08:26 It seemed like every few months, bam. Some new mind-blowing capability would just drop. But lately, I don't know, there's this sense that maybe that initial rush of progress is kind of starting to slow down. Like maybe just making these models bigger isn't the whole story. Yeah, what's interesting is there are signs

Starting point is 00:08:43 that this bigger is better approach to AI. might be hitting some limits. We saw that with OpenAI's releases, you know, GPT4 being the big recent run. And even with Google's Gemini. Right. And you'd think that with each new version, these jumps in capability would just get bigger and bigger. But I read something really interesting and said that while GBT4 was definitely a huge leap

Starting point is 00:09:01 over GPT3, the improvements from GPT4 to its successor Orion are actually much smaller. It's like maybe they're bumping up against some kind of fundamental barrier. Yeah, that's the question a lot of people are asking, right? If just throwing more data and computing power at the problem isn't the answer, then what is? And that's why I thought this new research coming out of MIT was so interesting. They're exploring a completely different approach to making AI smarter, something they call test time training. Okay, test time training. That sounds kind of counterintuitive.

Starting point is 00:09:32 Isn't the whole point of training in AI to get it ready to, you know, perform, to actually use it. So what's the idea here? So think of it this way. Imagine you have a big exam coming up. You've studied the material. You've got a good foundation. But then you do a few practice problems right before the test. Just to really sharpen your skills and focus on the specific types of questions you're likely to see.

Starting point is 00:09:53 That's kind of what test time training or TTT does for AI. So instead of just relying on that initial training, they're giving the AI a little extra boost right before it has to tackle the specific task. Exactly. And the MIT researchers applied this idea to a particularly challenging set of problems. It's called the ARC, the abstraction and reasoning corpus. It's basically a collection of visual puzzles that are meant to test an AI's ability to solve problems it's never seen before, to really stretch its capacity for abstract reasoning. Right.

Starting point is 00:10:25 Took a look at some of those ARC puzzles and wow, they are not messing around. Talk about some mental gymnastics. Right. They involve things like pattern recognition and applying logical rules, even some spatial reasoning. They're designed to be tough, even for us humans. So how did the AI do with this test time track? with this test time training? Did those practice problems actually help?

Starting point is 00:10:42 Oh, you bet they did. The researchers found that by using TTT with a fairly modest-sized language model, one with about 8 billion parameters, they actually achieved a 25% improvement over the previous best results on ARC, which is a significant jump. But here's what's even more remarkable. By combining TTT with a hybrid approach that uses both neural networks and symbolic reasoning,

Starting point is 00:11:05 they actually managed to match average human performance on these puzzles. Hold on. They got an AI to perform as well as an average person on these really complex visual reasoning tasks, just by giving it like a little cram session beforehand. That's incredible. It is pretty mind-blowing, isn't it? And it really challenges some of our assumptions about how AI learns and adapts. But the real question is, what's the secret sauce? What makes test time training so effective? Especially in this case. Well, the MIT researchers actually identified three key ingredients that seem to make TTTT particularly potent. First, they found that it really helps if the AI model is initially trained on tasks that share some underlying structure with the target task. In this case, the ARC puzzles.

Starting point is 00:11:48 Oh, so it's like giving the AI a head start, a foundation of knowledge that it can then build on during the test time training. Exactly. That initial training kind of primes the model to learn quickly and efficiently during the test time phase. The second ingredient is what they call augmented task format and data. Basically, they very cleverly create new training data from the test input itself. So instead of just throwing the AI into the deep end with a totally novel problem, you're giving it a few practice laps in the pool first. Yeah, that's a great way to put it.

Starting point is 00:12:16 It's like providing the AI with a set of really tailored exercises that help it zero in on the specific patterns and strategies that are relevant to the task at hand. And then the third key ingredient is that instead of using one generic adapter for all the tasks, they actually train individual adapters for each specific puzzle. So they're essentially customizing the AI's thinking for each challenge, making it hyper-focused on that specific problem at hand. No wonder it performs so well.

Starting point is 00:12:43 Right. And what's even more interesting is that they found that TTT was most effective when used with models that hadn't been trained on synthetic data that was generated by another AI, which, you know, makes you wonder if there's something about that AI-generated data that's missing, some element of real-world complexity that's crucial for truly robust learning? It's like those AI-generated images you see. They're technically very impressive, but they often have these little glitches, or inconsistencies, they kind of give them away. Maybe there's a similar issue with AI-generated training data,

Starting point is 00:13:13 like it's not quite capturing the full richness and messiness of the real world. That's a really astute observation, and it definitely points to a potential area for future research. But let's zoom out for a second and think about the bigger picture here. What these findings suggest is that TTT could actually be a more efficient path to improving AI than just blindly scaling up models. Because it's not just about making AI bigger, It's about making it smarter and more adaptable.

Starting point is 00:13:39 And the potential applications of this are huge. Imagine if we could apply TTT to fields like scientific research, where AI could help scientists analyze really complex data sets and generate hypotheses in real time. Or in software development, where AI could help programmers debug code or design more efficient algorithms, adapting to the specific challenges of each project. Or even in education,

Starting point is 00:14:01 where personalized AI tutors could tailor their teaching methods to each student's individual learning style and pace, adapting on the fly to their strengths and weaknesses. Now you're talking. That's the kind of AI that could really transform lives. But I also have to wonder, if AI can learn and adapt this quickly, doesn't that raise some concerns about control and predictability? What's to stop it from learning things we don't want it to learn, or behaving in ways that

Starting point is 00:14:27 we didn't anticipate? That's a totally valid concern, and it's one that researchers are very actively exploring. One of the key challenges with any really powerful technology is ensuring that it's used safely and responsibly. With TTT, it's about striking a balance between adaptability and robustness, making sure that the AI remains aligned with our goals and values, even as it learns and evolves. It's a delicate dance, for sure. But the potential benefits are so compelling that it seems worth exploring further. What excites you most about the possibilities of TTT?

Starting point is 00:14:58 You know, for me, it's the idea that we might be on the verge of a truly transformative shift. shift in how we design and use AI. We've been so focused on building these massive, all-knowing models, but maybe that's not the only way, or even the best way. Perhaps the future of AI lies in smaller, more specialized systems that can learn and adapt to specific tasks and contexts, systems that are more like partners or collaborators than just tools. I like that. AI is a partner working alongside us to solve these complex problems and unlock new possibilities. But I'm also curious, what are some of the limitations of TTT, is it a silver bullet? Or, or are there certain types of problems or situations where it's not as effective?

Starting point is 00:15:37 It's fascinating to think about all the ways this could change how we interact with AI. You know, for a while, it felt like AI was this distant force, something that was happening in research labs or powering big tech platforms, but TTTTTA brings it closer, makes it more personal, more tailored to our individual needs. Yeah, I totally agree. If you think about the evolution of technology, it often starts with these big centralized systems, and then it gradually becomes more distributed, more accessible, more integrated into our daily lives. And I think TTT could be a catalyst for that kind of shift in the world of

Starting point is 00:16:06 AI. Like moving from mainframe computers to personal computers. And now to smartphones and wearable devices. What was once so exclusive and specialized becomes ubiquitous and personalized. Exactly. And as AI becomes more integrated into our lives, the ability for it to learn and adapt on the fly is going to be essential. We're going to need AI systems that can understand the nuances of our individual preferences, our work styles, our learning patterns. Can you give me a concrete example of what that might look like. Sure, imagine a world where your smartphone isn't just a device. It's a true AI companion that learns from your interactions and anticipates your needs, helps you navigate your day with incredible efficiency. You're working on a complex

Starting point is 00:16:45 project and your AI assistant proactively gathers relevant information, suggest potential solutions, and even helps you draft emails or presentations, all tailored to your specific style and the context of the project. That's a pretty compelling vision. It's not just about automation, it's about augmentation, about AI amplifying our capabilities and helping us reach our full potential. And TTT could be a really crucial part of making that vision a reality by allowing AI systems to specialize on the fly to adapt to the specific challenges and opportunities of each moment. We can create a future where AI is not just powerful, but also truly useful and truly human-centered. So as we wrap up this deep dive into test time training, I want to leave you the listener with

Starting point is 00:17:27 this thought. What areas of your life could benefit? from this kind of AI. What tasks or challenges could you delegate or collaborate on with an AI partner that can learn and adapt as quickly as you can? The future of AI is being written right now, and technologies like TTT are giving us a glimpse of what's possible. It's up to all of us to imagine and shape that future, to ensure that AI is used to empower and uplift humanity and not to replace or diminish us. Thanks for joining us on this exploration of test time training. We hope this deep dive has given you a new perspective on the evolving landslide, of AI and sparked your curiosity about the incredible possibilities that lie ahead.

Starting point is 00:18:06 Until next time, keep learning, keep questioning, and keep pushing the boundaries of what's possible.

The AI Daily Brief: Artificial Intelligence News and Analysis - A Promising Alternative Way to Improve LLM Performance

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.