The AI Daily Brief: Artificial Intelligence News and Analysis - A Promising Alternative Way to Improve LLM Performance
Episode Date: November 16, 2024Researchers at MIT share encouraging results around test-time training. NLW is joined by Google's NotebookLM to tell the story. Brought to you by: Vanta - Simplify compliance - �...��vanta.com/nlwThe AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614 Subscribe to the newsletter: https://aidailybrief.beehiiv.com/ Join our Discord: https://bit.ly/aibreakdown
Transcript
Discussion (0)
Today on the AI Daily Brief, a new MIT paper about test time training,
which has been a big part of the recent discussion of the limitations of LLM scaling.
And before that in the headlines, chat GPT can now read from apps on Mac.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
To join the conversation, follow the Discord link in our show notes.
Welcome back to the AI Daily Brief Headlines edition,
all the daily AI news you need in around five minutes.
We kick off today with an update from OpenAI where ChatGPT can now read direct.
directly from certain other apps.
ChatGPT's desktop app for MacOS can now read from a handful of leading developer-focused
coding applications, including VS code, Xcode, text edit, terminal, and item 2.
What this means practically is that developers no longer need to copy and paste their code into
chatGBT while using it as a coding copilot.
Instead, when this new work-with apps feature is enabled, the section of code you're working on
will be automatically sent as context alongside your prompt.
ChadGPT won't be able to write code directly into a developer app like cursor or
hub copilot can, but the feature is more about building a test case for more general applications
of this capability. OpenAI said the ability to understand other apps is a key building block
towards creating agenic systems. The feature is drawing from Apple's accessibility API to read
and translate the screen. This means the technique will only work with text-based apps,
however, it also avoids using visual-based inputs which are prohibitively expensive for heavy
use. Still, the feature sends up to 200 lines of code as context, so it's going to be using a lot
of tokens. It's unclear at this stage how OpenAI plans to make the feature compatible with
apps that don't work with Apple screen reader. We've seen Anthropic go all the way in the other
direction using an approach that takes constant screenshots for context rather than relying on APIs.
OpenAI desktop product lead Alexander Ambirico said, this isn't meant to be an agent. It's a way
to collaborate with coding tools to start, and there will be more tools coming soon. On the side
of agents, I think this is a really key building block. This idea that Chachapit understands or can
work with all the content that you have, so it can help with it. The feature is all
already available to Plus and Teams users and rolling out to Enterprise and Education tiers in the
next few weeks.
Next up, the latest from Elon and XAI, we had heard previously that they were raising up to
$6 billion, but we've gotten some new details.
The latest report suggests that that's happening at a $50 billion valuation and could
close as early as next week.
CNBC suggests that it's going to be a combination of $5 billion from sovereign funds in the
Middle East and $1 billion from other investors.
Now, of course, most of this is going to end up in Jensen Huang's pocket, because the money
is going to be used to acquire 100,000 Nvidia chips, according to CNBC's sources. We'll keep an eye out
to see if this deal actually closes. Speaking of data, consulting firm Gartner has warned that the AI
energy crisis could arrive as soon as 2027. In a new report, the firm said that power shortages
would restrict 40% of AI data centers within a few years. Bob Johnson, VP analysts at Gartner,
said, the explosive growth of new hyperscale data centers to implement GenAI is creating an insatiable
demand for power that will exceed the ability of utility providers to expand their capacity fast
enough. In turn, this threatens to disrupt energy availability and lead to shortages, which will limit
the growth of new data centers for Gen AI and other uses from 2026. Garner said that new servers
last year required 195 terawatt hours of electricity, which is as much as 18 million U.S. households.
By 2027, they believe that just the new facilities will demand 500 terawatt hours. And this, my friends,
is why, of course, all of the big AI labs are so focused on energy and energy solutions.
Over in the world of staffing moves, prominent AI developer Francois Chalet
is leaving Google after close to a decade at the company.
He is credited for creating Keras, a high-level open-source API for creating AI models
for tackling machine learning tasks.
The platform boasts over 2 million users and underpin several high-profile products,
including Waymo's self-driving algorithm, as well as the recommendation engines for YouTube,
Netflix, and Spotify.
In a post on X, he said, I'm very grateful for my decade at Google,
and that time span deep learning went from a niche academic topic to a massive industry
employing millions. Karras went from a small library used by a few thousand enthusiasts to a state-of-the-art
framework used by two million developers. Chalet says he plans to, quote, go start a new company
with a friend, but didn't give any further details. Aside from Keras, Shaleigh published the
abstraction and reasoning corpus for AGI in 2019. The ARC AGI benchmark, which by the way
features prominently in today's main episode, measures the ability of AI systems to solve novel
reasoning problems and is viewed as one of the most recognizable signposts that a model has achieved
to True AGI. This year in collaboration with others, he launched the Ark Prize, awarding $1 million
to the first team to achieve 85% on this benchmark. The prize remains unwon with the closest
score coming in at 42%. Chalet has also taken a firm view on the scaling issues that has recently
returned to prominence. He has often argued that the current approach to feeding ever more data
and compute resources to train models is unlikely to achieve an AI that's as smart as humans.
Instead, he believes that methods that involve reasoning in ways are more likely to yield results.
Now that, I think, is a perfect segue into what our main discussion is going to be today, which is a new paper out of MIT, that puts a little more juice around this idea of new strategies like test time compute.
That's going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode.
Today's episode is brought to you by Vanta. Whether you're starting or scaling your company's security program, demonstrating top-notch security practices, and establishing trust is more important than ever. Vanta automates compliance for ISO-27-001.
1, SOC2, GDPR, and leading AI frameworks like ISO-402,001, and NIST AI Risk Management
Framework, saving you time and money while helping you build customer trust.
Plus, you can streamline security reviews by automating questionnaires and demonstrating
your security posture with a customer-facing trust center, all powered by Vanta AI.
Over 8,000 global companies like Langchain, Lila AI, and Factory AI use Vanta to demonstrate
AI trust and prove security in real time.
Learn more at Vanta.com slash NLW.
That's Vanta.com slash NLW.
Today's episode is brought to you, as always, by Super Intelligent.
Have you ever wanted an AI Daily Brief, but totally focused on how AI relates to your
company? Is your company struggling with AI adoption, either because you're getting
stalled, figuring out what use cases will drive value, or because the AI transformation
that is happening is siloed at individual teams, departments, and employees, and not able to
change the company as a whole? Superintelligent has developed a new custom internal podcast
product that inspires your teams by sharing the best AI use cases from inside and outside your
company. Think of it as an AI Daily Brief, but just for your company's AI use cases.
If you'd like to learn more, go to Bsuper.a.i slash partner and fill out the information request form.
I am really excited about this product, so I will personally get right back to you. Again, that's
Bsuper.a.ai slash partner.
Welcome back to the AI Daily Brief. Today we're doing something a little bit different,
And I think it's going to be pretty fun.
One of the big points of conversation for the last week or two has been
this question of whether labs are hitting some limits with their previous approach to scaling
LLMs.
Basically, reports are coming out that the next version of Gemini, as well as the next
version of ChatGBT, as well as OpenAI's next GPT model, just don't reflect as big a jump
as previous state-of-the-art leaps represented.
If you want more background on that, you can go listen to a couple shows from earlier
this week that are all about those limitations.
However, the important thing to note here is that it's not that AI can't
get any more performant from here, it's about what techniques and strategies are needed to actually
make the next leaps. One of the strategies that seems really promising is something called
test time compute. This is a part of what's been built into the 01 reasoning model that
OpenAI has released, and a new paper this week from MIT was called the surprising effectiveness
of test time training for abstract reasoning. Now, the challenge when any of these papers are
released is that they are extremely dense and extremely technical. And yet, we have ourselves
some assets to better understand this. And so from here, I'm going to turn the AI Daily Brief
over to Google's Notebook L.M, where I've worked with it to create a conversational podcast about this
paper. I'm not looking to put myself out of a job here, but I think you'll agree that this is a
really powerful use of this new Google capability that makes taking really abstract and dense
information like what's in this paper a lot more accessible. So with that, I'm going to turn it over
to Notebook LM. Appreciate you guys listening as always. And here once again is a discussion of
of the surprising effectiveness of test time training.
You know, it feels like for a while there,
AI was just advancing so fast,
especially with those large language models
from companies like Open AI and Google.
It seemed like every few months, bam.
Some new mind-blowing capability would just drop.
But lately, I don't know,
there's this sense that maybe that initial rush of progress
is kind of starting to slow down.
Like maybe just making these models bigger
isn't the whole story.
Yeah, what's interesting is there are signs
that this bigger is better approach to AI.
might be hitting some limits.
We saw that with OpenAI's releases, you know, GPT4 being the big recent run.
And even with Google's Gemini.
Right.
And you'd think that with each new version, these jumps in capability would just get
bigger and bigger.
But I read something really interesting and said that while GBT4 was definitely a huge leap
over GPT3, the improvements from GPT4 to its successor Orion are actually much smaller.
It's like maybe they're bumping up against some kind of fundamental barrier.
Yeah, that's the question a lot of people are asking, right?
If just throwing more data and computing power at the problem isn't the answer, then what is?
And that's why I thought this new research coming out of MIT was so interesting.
They're exploring a completely different approach to making AI smarter, something they call test time training.
Okay, test time training.
That sounds kind of counterintuitive.
Isn't the whole point of training in AI to get it ready to, you know, perform, to actually use it.
So what's the idea here?
So think of it this way.
Imagine you have a big exam coming up.
You've studied the material.
You've got a good foundation.
But then you do a few practice problems right before the test.
Just to really sharpen your skills and focus on the specific types of questions you're likely to see.
That's kind of what test time training or TTT does for AI.
So instead of just relying on that initial training, they're giving the AI a little extra boost right before it has to tackle the specific task.
Exactly.
And the MIT researchers applied this idea to a particularly challenging set of problems.
It's called the ARC, the abstraction and reasoning corpus.
It's basically a collection of visual puzzles that are meant to test an AI's ability to solve problems it's never seen before,
to really stretch its capacity for abstract reasoning.
Right.
Took a look at some of those ARC puzzles and wow, they are not messing around.
Talk about some mental gymnastics.
Right.
They involve things like pattern recognition and applying logical rules, even some spatial reasoning.
They're designed to be tough, even for us humans.
So how did the AI do with this test time track?
with this test time training?
Did those practice problems actually help?
Oh, you bet they did.
The researchers found that by using TTT with a fairly modest-sized language model,
one with about 8 billion parameters,
they actually achieved a 25% improvement over the previous best results on ARC,
which is a significant jump.
But here's what's even more remarkable.
By combining TTT with a hybrid approach that uses both neural networks
and symbolic reasoning,
they actually managed to match average human performance on these puzzles.
Hold on. They got an AI to perform as well as an average person on these really complex visual reasoning tasks,
just by giving it like a little cram session beforehand. That's incredible.
It is pretty mind-blowing, isn't it? And it really challenges some of our assumptions about how AI learns and adapts.
But the real question is, what's the secret sauce? What makes test time training so effective?
Especially in this case. Well, the MIT researchers actually identified three key ingredients that seem to make TTTT particularly potent.
First, they found that it really helps if the AI model is initially trained on tasks that share some underlying structure with the target task.
In this case, the ARC puzzles.
Oh, so it's like giving the AI a head start, a foundation of knowledge that it can then build on during the test time training.
Exactly.
That initial training kind of primes the model to learn quickly and efficiently during the test time phase.
The second ingredient is what they call augmented task format and data.
Basically, they very cleverly create new training data from the test input itself.
So instead of just throwing the AI into the deep end with a totally novel problem,
you're giving it a few practice laps in the pool first.
Yeah, that's a great way to put it.
It's like providing the AI with a set of really tailored exercises
that help it zero in on the specific patterns and strategies
that are relevant to the task at hand.
And then the third key ingredient is that instead of using one generic adapter for all the tasks,
they actually train individual adapters for each specific puzzle.
So they're essentially customizing the AI's thinking
for each challenge, making it hyper-focused on that specific problem at hand.
No wonder it performs so well.
Right. And what's even more interesting is that they found that TTT was most effective
when used with models that hadn't been trained on synthetic data that was generated by another AI,
which, you know, makes you wonder if there's something about that AI-generated data that's missing,
some element of real-world complexity that's crucial for truly robust learning?
It's like those AI-generated images you see.
They're technically very impressive, but they often have these little glitches,
or inconsistencies, they kind of give them away.
Maybe there's a similar issue with AI-generated training data,
like it's not quite capturing the full richness and messiness of the real world.
That's a really astute observation,
and it definitely points to a potential area for future research.
But let's zoom out for a second and think about the bigger picture here.
What these findings suggest is that TTT could actually be a more efficient path
to improving AI than just blindly scaling up models.
Because it's not just about making AI bigger,
It's about making it smarter and more adaptable.
And the potential applications of this are huge.
Imagine if we could apply TTT to fields like scientific research,
where AI could help scientists analyze really complex data sets
and generate hypotheses in real time.
Or in software development,
where AI could help programmers debug code or design more efficient algorithms,
adapting to the specific challenges of each project.
Or even in education,
where personalized AI tutors could tailor their teaching methods
to each student's individual learning style and pace, adapting on the fly to their strengths
and weaknesses.
Now you're talking.
That's the kind of AI that could really transform lives.
But I also have to wonder, if AI can learn and adapt this quickly, doesn't that raise some
concerns about control and predictability?
What's to stop it from learning things we don't want it to learn, or behaving in ways that
we didn't anticipate?
That's a totally valid concern, and it's one that researchers are very actively exploring.
One of the key challenges with any really powerful technology is ensuring that it's used safely and responsibly.
With TTT, it's about striking a balance between adaptability and robustness,
making sure that the AI remains aligned with our goals and values, even as it learns and evolves.
It's a delicate dance, for sure.
But the potential benefits are so compelling that it seems worth exploring further.
What excites you most about the possibilities of TTT?
You know, for me, it's the idea that we might be on the verge of a truly transformative shift.
shift in how we design and use AI. We've been so focused on building these massive, all-knowing
models, but maybe that's not the only way, or even the best way. Perhaps the future of AI lies
in smaller, more specialized systems that can learn and adapt to specific tasks and contexts,
systems that are more like partners or collaborators than just tools. I like that. AI is a partner
working alongside us to solve these complex problems and unlock new possibilities. But I'm also
curious, what are some of the limitations of TTT, is it a silver bullet? Or, or
are there certain types of problems or situations where it's not as effective?
It's fascinating to think about all the ways this could change how we interact with AI.
You know, for a while, it felt like AI was this distant force, something that was happening
in research labs or powering big tech platforms, but TTTTTA brings it closer, makes it more personal,
more tailored to our individual needs.
Yeah, I totally agree.
If you think about the evolution of technology, it often starts with these big centralized
systems, and then it gradually becomes more distributed, more accessible, more integrated
into our daily lives. And I think TTT could be a catalyst for that kind of shift in the world of
AI. Like moving from mainframe computers to personal computers. And now to smartphones and wearable
devices. What was once so exclusive and specialized becomes ubiquitous and personalized.
Exactly. And as AI becomes more integrated into our lives, the ability for it to learn
and adapt on the fly is going to be essential. We're going to need AI systems that can understand
the nuances of our individual preferences, our work styles, our learning patterns. Can you give me a
concrete example of what that might look like. Sure, imagine a world where your smartphone isn't
just a device. It's a true AI companion that learns from your interactions and anticipates your
needs, helps you navigate your day with incredible efficiency. You're working on a complex
project and your AI assistant proactively gathers relevant information, suggest potential solutions,
and even helps you draft emails or presentations, all tailored to your specific style and the context
of the project. That's a pretty compelling vision. It's not just about automation, it's about
augmentation, about AI amplifying our capabilities and helping us reach our full potential.
And TTT could be a really crucial part of making that vision a reality by allowing AI systems
to specialize on the fly to adapt to the specific challenges and opportunities of each moment.
We can create a future where AI is not just powerful, but also truly useful and truly human-centered.
So as we wrap up this deep dive into test time training, I want to leave you the listener with
this thought. What areas of your life could benefit?
from this kind of AI. What tasks or challenges could you delegate or collaborate on with an AI
partner that can learn and adapt as quickly as you can? The future of AI is being written right now,
and technologies like TTT are giving us a glimpse of what's possible. It's up to all of us to
imagine and shape that future, to ensure that AI is used to empower and uplift humanity
and not to replace or diminish us. Thanks for joining us on this exploration of test time training.
We hope this deep dive has given you a new perspective on the evolving landslide,
of AI and sparked your curiosity about the incredible possibilities that lie ahead.
Until next time, keep learning, keep questioning, and keep pushing the boundaries of what's possible.
