The AI Daily Brief: Artificial Intelligence News and Analysis - Google Launches Gemini 1.5 with ONE MILLION Token Context Window
Episode Date: February 15, 2024Google surprises everyone with a massive context window on their new Gemini 1.5 model. Speaking of huge context windows, Magic.dev raises $100M+ for a model that apparently can handle multiple million... token context windows. ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, Google has announced Gemini 1.5 Pro with an unbelievable million token context window.
Before that on the brief, another very long context window project, specifically focused on coding, has raised a fresh $100 million.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
We're going to Breakdown.network for more information about our YouTube, our Discord, and our newsletter.
Welcome back to the AI Breakdown Brief.
All the AI headline news you need in around five minutes.
One of the big pursuits for AI developers right now is building software that can build software.
This is important on a number of different levels. As we will discuss, it is not only about
building a new type of tool, but also about giving AI's more broad reasoning power. Well, the
specific context that we're starting with today is a new nine-figure funding round for Magic.
Former GitHub CEO and now AI investor extraordinaire, also the guy behind the Vesuvius
challenge that we talked about last week. Nat Friedman tweeted, magic.comdive has trained a groundbreaking
model with many millions of tokens of context that performed far better in our avals than anything
we've tried before. They're using it to build an advanced AI programmer that can reason over
your entire codebase and the transitive closure of your dependency tree. If this sounds like magic,
well, you get it. Daniel and I were so impressed, that's Daniel Gross, his often investing partner,
we are investing $100 million into the company today. The team is intensely smart and hardworking.
Building an AI programmer is both self-evidently valuable and intrinsically self-improving.
If this sounds interesting to you, consider joining them.
Now, one thing I want to note is this many millions of tokens of context,
which is something that we're going to get deeper into in the main part of the episode.
But here, obviously, part of the big innovation is that with that big of a context window,
as Nat puts it, the AI programmer can reason over the entire codebase.
The ability to ingest an entire code base all at once, obviously seems like it could give
Magic a very different set of capabilities.
Now, in their announcement post, Magic writes that they're working on frontier-scale code models
to build a coworker, not just a co-pilot. They write things we believe. Code generation is both a product
and a path to AGI. AGI safety matters and is solvable. To build a great AI product, we need to
train our own frontier scale model. Transformers aren't the final architecture. We have something with a
multi-million token context window. So there is a lot in just a few bullets there. Code generation has
both product and a path to AGI. We're about to discuss Mark Zuckerberg's comments recently,
where meta has come around to this thinking, that code generation is indeed a central capability
for the state of the art, not just from a feature parody standpoint, but because it improves general
reasoning. The other bombastic note here is this idea that Transformers, which is of course what
GPT is built on, the T stands for Transformer, aren't the final architecture, and that they have a
multi-million token context window. Now, given all this, it's not surprising that magic is very
buzzy. However, they're not the only project thinking in similar ways. Antonosika, who has been building
GBT Engineer, which we've covered on the show before. Yesterday tweeted, happy Valentine's Day,
everyone, launching a new AI startup out of Europe today, Lovable. We're building software that builds
software. How we are approaching it at a high level? Step one, a tool for anyone to prototype web apps
and collaborate via natural language on the same codebase as developers. Step two, make fully autonomous
cogeneration work on one single set of technology choices. Now, if you go to their website lovable.dev,
the claim they make is that this is the last piece of software. They write, we're building the software
that builds other software. Why? The world they write is full of ambitious people who want to solve
important problems. For three decades, software has been the most significant tool to unleash the
world's ambition. Still less than 1% of the world has the skills required to create software. If we
succeed, they write, everyone will have the same capabilities that entire product development
teams at stellar tech companies have today right at their fingertips. We will unlock a new era of
innovation, empowering dreamers everywhere to shape the world. We're reducing the barriers to build
and staying committed to one goal to unleash human creativity on an unprecedented scale.
Now, about a month ago, Mark Zuckerberg talked a bit about what was coming with Lama 3,
and some of these same themes were on display. He discussed how the company had changed its position
on coding in AI, saying, one hypothesis was that coding isn't that important because it's not
like a lot of people are going to ask coding questions in WhatsApp, which is, of course, a big part
of the meta portfolio. Zuckerberg went on, though, it turns out that coding is actually really
important structurally for having the LLMs be able to understand the rigor and hierarchical
structure of knowledge and just generally have more of an intuitive sense of logic.
You remember there are a lot of headlines around this time last month about Zuckerberg's goal
to build an open source AGI. Now the consequence of this shift to focus on coding inside Lama
3 is not just to compete on that open source access to compete at the state of the art in general.
Zuckerberg said, Lama 2 wasn't an industry leading model, but it was the best open source model.
With Lama 3 and beyond, our ambition is to build.
things that are at the state of the art, and eventually the leading models in the industry.
Now, of course, there are big debates around what the net impact of this type of self-generating
software can do. Holding aside all the arguments about how this is where AI goes off the rails
and all the safety risks, which is worthy of an entire episode on its own, even the debate around
whether AI is going to replace programmers has a lot of interesting dimensions. Amad, the CEO at Mercury,
tweeted about this yesterday writing, why AI is not going to replace programmers. When I was in college
studying computer science in 2005, I was told that outsourcing to India will remove the demand for
programmers. It was a real fear. AI replacing coders, I think, is based on a similar misconception.
He goes on, this misconception is rooted in a misunderstanding of what programmers do that non-programmers
and even many engineers have. Programming is sometimes seen as a science where you have a very
specific spec and you convert that through a series of repetitive work to code. People perceive it
similarly to high school level math. You get an equation and you solve it and get an answer. And to be
fair, that is what learning to program an entry-level programming is like. So it's easy to see why people
have this misconception. In reality, a lot of programming is, one, understanding customers internal or
external, two, converting that understanding into potential implementation, three, thinking about
integrating seamlessly into existing code, four, thinking about building in a scalable way for future
wrecks, five, etc. A lot of it involves having deep taste and applying that taste against the customer
need. This part is much more art than science even. So what he says AI will actually do, one, make
programmers more efficient at the repetitive part? Two, enable more people to be programmers.
Three, turn more of the world into bespoke software. Four, increase the demand for great programming
artists elevating the craft even more. Ultimately, he concludes, if you're thinking of learning to
code, don't be put off by the AI will replace programmers meme, and just do it. Another take on this
that we've often heard from Sam Altman is that AI won't replace programmers because the demand for
things that programmers build is just going to go up consequently with our capacity to actually
deliver against that demand.
Anyways, it's a super interesting discussion, and honestly could have been a full episode
rather than just a brief.
But before we get to the main part of this episode, let's actually cover a few additional
stories.
Slack becomes the latest Web 2.0 software to deeply integrate AI, announcing that Slack AI has
arrived.
There are a bunch of different parts of this.
Personalized search results, channel recaps and thread summaries.
So, for example, if you've been gone for a while, Slack AI can summarize what you've missed,
which for anyone working with me on Slack would be a very useful thing given how many
messages I send. And of course, Slack and its owner's Salesforce are promising that users will
control their data, that they'll have more granular ability to implement tools, et cetera, et cetera.
Interesting news out of Amazon. Researchers at Amazon have apparently trained the largest ever
text-to-speech model, which of course matters given that the company is so invested in things like
Alexa, and they claim that they're seeing what they call emergent qualities, which is improving
that model's ability to speak naturally, even in the context of complex sentences. The new model is called
big adaptive steamable TTS with emergent abilities, or base TTS.
TechCrunch writes at 980 million parameters, base large appears to be the biggest model in this
category. The largest version was trained on 100,000 hours of public domain speech, 90% of which
was in English with the remainder in German, Dutch, and Spanish. Areas where this large model
improved from previous models, they said, include compound nouns, emotions, foreign words,
paralinguistics, aka readable non-words like sh, punctuation, questions, and syntactic complexity.
The authors of a paper wrote,
So, once again, we have an advanced model that is working in ways we just don't quite understand.
Over on Wall Street, a number of companies saw their stock rise after Nvidia disclosed that it was an investor.
Those investments included recursion pharmaceuticals, conversational voice assistant developer SoundHound AI,
and fellow AI chip designer, ARM.
So, friends, AI's Whitehawks Street continues, but for us, that will end the AI breakdown brief.
Thanks for listening or watching as always, and up next to the main AI breakdown.
Welcome back to the AI breakdown.
Today we had some unexpected and exciting news out of Google.
CEO Sundar Pichai tweets,
In December, we launched Gemini 1.0 Pro.
Today, we're introducing Gemini 1.5 Pro.
This next-gen model uses a mixture of experts' MOE approach
for more efficient training and higher quality responses.
Gemini 1.5 Pro, our mid-sized model, will soon come standard
with a 128K token context window.
But starting today,
developers and customers can sign up for the limited private preview to try out 1.5 Pro
with a groundbreaking and experimental 1 million token context window.
The 1 million tokens feature unlocks huge possibilities for devs.
Upload hundreds of pages of text, entire code repos, and long videos, and let Gemini reason
across them.
It's still experimental and early, and we'd love your feedback.
Now, context windows are something that people have been talking about ever since ChatGBT
VT launched.
It refers to the number of tokens that any given model can.
engage with at a particular time. The larger that window, the more coherent an LLM can reason
across a bigger volume of text. To get a sense of how much of a difference it is to be talking
about 128k and million token context windows, people were incredibly excited to see chat GPT move from
8K up to 32K. So obviously we're talking about significantly longer than that. Anthropics Clot has also
used longer context windows to try to compete, although of course one of the things that people
watch out for is whether performance starts to degrade when you're actually using longer
Windows. Still, the initial response has been incredibly excited. Lior at Alpha Signal writes,
Just in, Google releases Gemini 1.5, a powerful MOE model. It's a huge breakthrough. The model
has the longest context window ever seen, 1 million tokens. It can process one hour of video,
11 hours of audio, 30,000 lines of code, or 700,000 words in a single prompt. When tested on
text, code, image, audio, and video evaluations, 1.5 Pro outperforms 1.0 pro on 87% of the
benchmarks used for developing LLMs.
Dean the chief scientist at Google DeepMind and Google Research says,
one of the key differentiators of this model is its incredibly long context capabilities,
supporting millions of tokens of multimodal input.
The multimodal capabilities of the model mean you can interact in sophisticated ways with
entire books, very long document collections, codebases of hundreds of thousands of lines
across hundreds of files, full movies, entire podcast series, and more.
Now, given that Google has kind of a history of announcing things before they make them available,
another thing that people were very excited about is that early testers were
actually allowed to start using this 1 million token context window at no cost during the testing period.
Writing about the technical approach to this model on their announcement post, Google says,
Gemini 1.5 is built upon our leading research on transformer and MOE architecture.
While a traditional transformer functions as one large neural network, MOE models are divided into
smaller expert neural networks. Depending on the type of input given, MOE models learn to selectively
activate only the most relevant expert pathways in its neural network. This specialization massively
enhances the model's efficiency. Google has been an early adopter and pioneer of the MOE technique for
deep learning. Our latest innovations in model architecture allow Gemini 1.5 to learn complex tasks more
quickly and maintain quality while being more efficient to train and serve. These efficiencies
are helping our teams iterate, train, and deliver more advanced versions of Gemini faster than ever
before. Another performance piece that many people noticed was this one. Quote,
Gemini 1.5 Pro maintains high levels of performance even as its context window increases. In the needle
in a haystack evaluation where a small piece of
text containing a particular factor statement is purposely placed within a long block of text,
1.0 Pro found the embedded text 99% of the time, in blocks of data as long as 1 million tokens.
Jeff Dean again tweeted about this a little bit, saying,
Needle in a haystack tests out of 10 million tokens. First, let's take a quick glance at a needle
in a haystack test across many different modalities to exercise Gemini 1.5 Pro's ability to retrieve
information from its very long contexts. He points out that the results are almost entirely good,
meaning 99.7% recall even out to 10 million tokens. That's in audio, video, and text.
Avi Schiffman, who's working on the AI wearable tab, writes,
Perfect recall with 10 million tokens? The Messiah has arrived and its name is Google.
Developer Nick Dobos responded, what? That's crazy. Okay, Google, you might have a chance.
And indeed, this is the sentiment that I am seeing all over Twitter slash X today.
In many ways, this Gemini 1.5 pro announcement has developers and people who are deep in this
technical part of this space, more excited than Gemini advanced even did. There is so much sentiment
like, wow, Google is really shipping, which is, of course, a complete switch from the discussion
around them last year. The Verge's piece about this is Gemini 1.5 is Google's next-gen
AI model, and it's already almost ready. The next version of Google's model is better and faster,
sure, but it also has one pretty remarkable new party trick. Now, the Verge does a good job of contextualizing
what one million tokens means in this multimodal model. Apparently, Sondar Pichai told the Verge,
That means you can fit the entire Lord of the Rings film trilogy into that context window.
As you might imagine, this has a lot of people saying, is OpenAI starting to fall behind?
Where is GPT5? Is this going to put increased pressure on them?
Well, interestingly, we did get some new leaks about some things going on inside OpenAI
that do in some ways relate to Google as well.
Last week, we heard that they were working on AI agent startups, but now, according to people
with knowledge of the project, OpenAI is also working to develop a web search product
that would bring them into direct competition with Google and, of course, other AI search tools like
perplexity.
writes the information, OpenAI has been developing a web search product that would bring the Microsoft
back startup into more direct competition with Google.
Now, details are scant right now.
The information source said that the search service would be partly powered by Bing, but it
wasn't clear whether it would be a separate product from chat GPT or in some ways embedded
into chat GPT.
Now, this is a story that I could see going in a bunch of different ways.
It could be as simple as a slightly different interface and a set of expanded features around what they already have,
which is effectively what perplexity has done, although of course, done in such a valuable way that many people have shifted their research behavior entirely to perplexity.
Or it could be a fundamentally different and expanded product.
In any case, it feels to me like there is a lot brewing and bubbling behind the scenes at OpenAI,
and it'll be interesting to see what actually pans out.
Now, staying on the topic of OpenAI for just a minute,
they just published a piece called No, Sam Altman isn't raising trillions of dollars for chips.
And basically what they're saying is that while the Wall Street Journal report from last week
implied that he was actively seeking that much capital for some sort of joint venture,
instead, that's Altman's calculation of the total cost of everything from real estate to power for
data centers that it would take to actually achieve the type of objective that he wanted to see.
So it's not just some specific company that he's out trying to raise money for.
To me, it's pretty interesting to see Google putting the pedal to the medal so hard
and finally starting to reclaim some ground against OpenAI,
which had been so lost throughout the course of 2023.
Will this mean that we'll see an increased urgency from OpenAI
to release more advanced models or new types of products,
or are they comfortable just continuing to go at their own pace?
That is certainly what I will be watching,
and I will share it with you as I get any clarity on it.
For now, though, that is going to do it for today's AI breakdown.
Until next time, peace.
