The AI Daily Brief: Artificial Intelligence News and Analysis - Google Launches Gemini 1.5 with ONE MILLION Token Context Window

Starting point is 00:00:00 Today on the AI breakdown, Google has announced Gemini 1.5 Pro with an unbelievable million token context window. Before that on the brief, another very long context window project, specifically focused on coding, has raised a fresh $100 million. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. We're going to Breakdown.network for more information about our YouTube, our Discord, and our newsletter. Welcome back to the AI Breakdown Brief. All the AI headline news you need in around five minutes. One of the big pursuits for AI developers right now is building software that can build software. This is important on a number of different levels. As we will discuss, it is not only about

Starting point is 00:00:43 building a new type of tool, but also about giving AI's more broad reasoning power. Well, the specific context that we're starting with today is a new nine-figure funding round for Magic. Former GitHub CEO and now AI investor extraordinaire, also the guy behind the Vesuvius challenge that we talked about last week. Nat Friedman tweeted, magic.comdive has trained a groundbreaking model with many millions of tokens of context that performed far better in our avals than anything we've tried before. They're using it to build an advanced AI programmer that can reason over your entire codebase and the transitive closure of your dependency tree. If this sounds like magic, well, you get it. Daniel and I were so impressed, that's Daniel Gross, his often investing partner,

Starting point is 00:01:21 we are investing $100 million into the company today. The team is intensely smart and hardworking. Building an AI programmer is both self-evidently valuable and intrinsically self-improving. If this sounds interesting to you, consider joining them. Now, one thing I want to note is this many millions of tokens of context, which is something that we're going to get deeper into in the main part of the episode. But here, obviously, part of the big innovation is that with that big of a context window, as Nat puts it, the AI programmer can reason over the entire codebase. The ability to ingest an entire code base all at once, obviously seems like it could give

Starting point is 00:01:52 Magic a very different set of capabilities. Now, in their announcement post, Magic writes that they're working on frontier-scale code models to build a coworker, not just a co-pilot. They write things we believe. Code generation is both a product and a path to AGI. AGI safety matters and is solvable. To build a great AI product, we need to train our own frontier scale model. Transformers aren't the final architecture. We have something with a multi-million token context window. So there is a lot in just a few bullets there. Code generation has both product and a path to AGI. We're about to discuss Mark Zuckerberg's comments recently, where meta has come around to this thinking, that code generation is indeed a central capability

Starting point is 00:02:29 for the state of the art, not just from a feature parody standpoint, but because it improves general reasoning. The other bombastic note here is this idea that Transformers, which is of course what GPT is built on, the T stands for Transformer, aren't the final architecture, and that they have a multi-million token context window. Now, given all this, it's not surprising that magic is very buzzy. However, they're not the only project thinking in similar ways. Antonosika, who has been building GBT Engineer, which we've covered on the show before. Yesterday tweeted, happy Valentine's Day, everyone, launching a new AI startup out of Europe today, Lovable. We're building software that builds software. How we are approaching it at a high level? Step one, a tool for anyone to prototype web apps

Starting point is 00:03:06 and collaborate via natural language on the same codebase as developers. Step two, make fully autonomous cogeneration work on one single set of technology choices. Now, if you go to their website lovable.dev, the claim they make is that this is the last piece of software. They write, we're building the software that builds other software. Why? The world they write is full of ambitious people who want to solve important problems. For three decades, software has been the most significant tool to unleash the world's ambition. Still less than 1% of the world has the skills required to create software. If we succeed, they write, everyone will have the same capabilities that entire product development teams at stellar tech companies have today right at their fingertips. We will unlock a new era of

Starting point is 00:03:43 innovation, empowering dreamers everywhere to shape the world. We're reducing the barriers to build and staying committed to one goal to unleash human creativity on an unprecedented scale. Now, about a month ago, Mark Zuckerberg talked a bit about what was coming with Lama 3, and some of these same themes were on display. He discussed how the company had changed its position on coding in AI, saying, one hypothesis was that coding isn't that important because it's not like a lot of people are going to ask coding questions in WhatsApp, which is, of course, a big part of the meta portfolio. Zuckerberg went on, though, it turns out that coding is actually really important structurally for having the LLMs be able to understand the rigor and hierarchical

Starting point is 00:04:19 structure of knowledge and just generally have more of an intuitive sense of logic. You remember there are a lot of headlines around this time last month about Zuckerberg's goal to build an open source AGI. Now the consequence of this shift to focus on coding inside Lama 3 is not just to compete on that open source access to compete at the state of the art in general. Zuckerberg said, Lama 2 wasn't an industry leading model, but it was the best open source model. With Lama 3 and beyond, our ambition is to build. things that are at the state of the art, and eventually the leading models in the industry. Now, of course, there are big debates around what the net impact of this type of self-generating

Starting point is 00:04:54 software can do. Holding aside all the arguments about how this is where AI goes off the rails and all the safety risks, which is worthy of an entire episode on its own, even the debate around whether AI is going to replace programmers has a lot of interesting dimensions. Amad, the CEO at Mercury, tweeted about this yesterday writing, why AI is not going to replace programmers. When I was in college studying computer science in 2005, I was told that outsourcing to India will remove the demand for programmers. It was a real fear. AI replacing coders, I think, is based on a similar misconception. He goes on, this misconception is rooted in a misunderstanding of what programmers do that non-programmers and even many engineers have. Programming is sometimes seen as a science where you have a very

Starting point is 00:05:34 specific spec and you convert that through a series of repetitive work to code. People perceive it similarly to high school level math. You get an equation and you solve it and get an answer. And to be fair, that is what learning to program an entry-level programming is like. So it's easy to see why people have this misconception. In reality, a lot of programming is, one, understanding customers internal or external, two, converting that understanding into potential implementation, three, thinking about integrating seamlessly into existing code, four, thinking about building in a scalable way for future wrecks, five, etc. A lot of it involves having deep taste and applying that taste against the customer need. This part is much more art than science even. So what he says AI will actually do, one, make

Starting point is 00:06:13 programmers more efficient at the repetitive part? Two, enable more people to be programmers. Three, turn more of the world into bespoke software. Four, increase the demand for great programming artists elevating the craft even more. Ultimately, he concludes, if you're thinking of learning to code, don't be put off by the AI will replace programmers meme, and just do it. Another take on this that we've often heard from Sam Altman is that AI won't replace programmers because the demand for things that programmers build is just going to go up consequently with our capacity to actually deliver against that demand. Anyways, it's a super interesting discussion, and honestly could have been a full episode

Starting point is 00:06:45 rather than just a brief. But before we get to the main part of this episode, let's actually cover a few additional stories. Slack becomes the latest Web 2.0 software to deeply integrate AI, announcing that Slack AI has arrived. There are a bunch of different parts of this. Personalized search results, channel recaps and thread summaries. So, for example, if you've been gone for a while, Slack AI can summarize what you've missed,

Starting point is 00:07:05 which for anyone working with me on Slack would be a very useful thing given how many messages I send. And of course, Slack and its owner's Salesforce are promising that users will control their data, that they'll have more granular ability to implement tools, et cetera, et cetera. Interesting news out of Amazon. Researchers at Amazon have apparently trained the largest ever text-to-speech model, which of course matters given that the company is so invested in things like Alexa, and they claim that they're seeing what they call emergent qualities, which is improving that model's ability to speak naturally, even in the context of complex sentences. The new model is called big adaptive steamable TTS with emergent abilities, or base TTS.

Starting point is 00:07:41 TechCrunch writes at 980 million parameters, base large appears to be the biggest model in this category. The largest version was trained on 100,000 hours of public domain speech, 90% of which was in English with the remainder in German, Dutch, and Spanish. Areas where this large model improved from previous models, they said, include compound nouns, emotions, foreign words, paralinguistics, aka readable non-words like sh, punctuation, questions, and syntactic complexity. The authors of a paper wrote, So, once again, we have an advanced model that is working in ways we just don't quite understand. Over on Wall Street, a number of companies saw their stock rise after Nvidia disclosed that it was an investor.

Starting point is 00:08:47 Those investments included recursion pharmaceuticals, conversational voice assistant developer SoundHound AI, and fellow AI chip designer, ARM. So, friends, AI's Whitehawks Street continues, but for us, that will end the AI breakdown brief. Thanks for listening or watching as always, and up next to the main AI breakdown. Welcome back to the AI breakdown. Today we had some unexpected and exciting news out of Google. CEO Sundar Pichai tweets, In December, we launched Gemini 1.0 Pro.

Starting point is 00:09:18 Today, we're introducing Gemini 1.5 Pro. This next-gen model uses a mixture of experts' MOE approach for more efficient training and higher quality responses. Gemini 1.5 Pro, our mid-sized model, will soon come standard with a 128K token context window. But starting today, developers and customers can sign up for the limited private preview to try out 1.5 Pro with a groundbreaking and experimental 1 million token context window.

Starting point is 00:09:47 The 1 million tokens feature unlocks huge possibilities for devs. Upload hundreds of pages of text, entire code repos, and long videos, and let Gemini reason across them. It's still experimental and early, and we'd love your feedback. Now, context windows are something that people have been talking about ever since ChatGBT VT launched. It refers to the number of tokens that any given model can. engage with at a particular time. The larger that window, the more coherent an LLM can reason

Starting point is 00:10:12 across a bigger volume of text. To get a sense of how much of a difference it is to be talking about 128k and million token context windows, people were incredibly excited to see chat GPT move from 8K up to 32K. So obviously we're talking about significantly longer than that. Anthropics Clot has also used longer context windows to try to compete, although of course one of the things that people watch out for is whether performance starts to degrade when you're actually using longer Windows. Still, the initial response has been incredibly excited. Lior at Alpha Signal writes, Just in, Google releases Gemini 1.5, a powerful MOE model. It's a huge breakthrough. The model has the longest context window ever seen, 1 million tokens. It can process one hour of video,

Starting point is 00:10:52 11 hours of audio, 30,000 lines of code, or 700,000 words in a single prompt. When tested on text, code, image, audio, and video evaluations, 1.5 Pro outperforms 1.0 pro on 87% of the benchmarks used for developing LLMs. Dean the chief scientist at Google DeepMind and Google Research says, one of the key differentiators of this model is its incredibly long context capabilities, supporting millions of tokens of multimodal input. The multimodal capabilities of the model mean you can interact in sophisticated ways with entire books, very long document collections, codebases of hundreds of thousands of lines

Starting point is 00:11:24 across hundreds of files, full movies, entire podcast series, and more. Now, given that Google has kind of a history of announcing things before they make them available, another thing that people were very excited about is that early testers were actually allowed to start using this 1 million token context window at no cost during the testing period. Writing about the technical approach to this model on their announcement post, Google says, Gemini 1.5 is built upon our leading research on transformer and MOE architecture. While a traditional transformer functions as one large neural network, MOE models are divided into smaller expert neural networks. Depending on the type of input given, MOE models learn to selectively

Starting point is 00:12:00 activate only the most relevant expert pathways in its neural network. This specialization massively enhances the model's efficiency. Google has been an early adopter and pioneer of the MOE technique for deep learning. Our latest innovations in model architecture allow Gemini 1.5 to learn complex tasks more quickly and maintain quality while being more efficient to train and serve. These efficiencies are helping our teams iterate, train, and deliver more advanced versions of Gemini faster than ever before. Another performance piece that many people noticed was this one. Quote, Gemini 1.5 Pro maintains high levels of performance even as its context window increases. In the needle in a haystack evaluation where a small piece of

Starting point is 00:12:35 text containing a particular factor statement is purposely placed within a long block of text, 1.0 Pro found the embedded text 99% of the time, in blocks of data as long as 1 million tokens. Jeff Dean again tweeted about this a little bit, saying, Needle in a haystack tests out of 10 million tokens. First, let's take a quick glance at a needle in a haystack test across many different modalities to exercise Gemini 1.5 Pro's ability to retrieve information from its very long contexts. He points out that the results are almost entirely good, meaning 99.7% recall even out to 10 million tokens. That's in audio, video, and text. Avi Schiffman, who's working on the AI wearable tab, writes,

Starting point is 00:13:11 Perfect recall with 10 million tokens? The Messiah has arrived and its name is Google. Developer Nick Dobos responded, what? That's crazy. Okay, Google, you might have a chance. And indeed, this is the sentiment that I am seeing all over Twitter slash X today. In many ways, this Gemini 1.5 pro announcement has developers and people who are deep in this technical part of this space, more excited than Gemini advanced even did. There is so much sentiment like, wow, Google is really shipping, which is, of course, a complete switch from the discussion around them last year. The Verge's piece about this is Gemini 1.5 is Google's next-gen AI model, and it's already almost ready. The next version of Google's model is better and faster,

Starting point is 00:13:49 sure, but it also has one pretty remarkable new party trick. Now, the Verge does a good job of contextualizing what one million tokens means in this multimodal model. Apparently, Sondar Pichai told the Verge, That means you can fit the entire Lord of the Rings film trilogy into that context window. As you might imagine, this has a lot of people saying, is OpenAI starting to fall behind? Where is GPT5? Is this going to put increased pressure on them? Well, interestingly, we did get some new leaks about some things going on inside OpenAI that do in some ways relate to Google as well. Last week, we heard that they were working on AI agent startups, but now, according to people

Starting point is 00:14:24 with knowledge of the project, OpenAI is also working to develop a web search product that would bring them into direct competition with Google and, of course, other AI search tools like perplexity. writes the information, OpenAI has been developing a web search product that would bring the Microsoft back startup into more direct competition with Google. Now, details are scant right now. The information source said that the search service would be partly powered by Bing, but it wasn't clear whether it would be a separate product from chat GPT or in some ways embedded

Starting point is 00:14:51 into chat GPT. Now, this is a story that I could see going in a bunch of different ways. It could be as simple as a slightly different interface and a set of expanded features around what they already have, which is effectively what perplexity has done, although of course, done in such a valuable way that many people have shifted their research behavior entirely to perplexity. Or it could be a fundamentally different and expanded product. In any case, it feels to me like there is a lot brewing and bubbling behind the scenes at OpenAI, and it'll be interesting to see what actually pans out. Now, staying on the topic of OpenAI for just a minute,

Starting point is 00:15:21 they just published a piece called No, Sam Altman isn't raising trillions of dollars for chips. And basically what they're saying is that while the Wall Street Journal report from last week implied that he was actively seeking that much capital for some sort of joint venture, instead, that's Altman's calculation of the total cost of everything from real estate to power for data centers that it would take to actually achieve the type of objective that he wanted to see. So it's not just some specific company that he's out trying to raise money for. To me, it's pretty interesting to see Google putting the pedal to the medal so hard and finally starting to reclaim some ground against OpenAI,

Starting point is 00:15:54 which had been so lost throughout the course of 2023. Will this mean that we'll see an increased urgency from OpenAI to release more advanced models or new types of products, or are they comfortable just continuing to go at their own pace? That is certainly what I will be watching, and I will share it with you as I get any clarity on it. For now, though, that is going to do it for today's AI breakdown. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Google Launches Gemini 1.5 with ONE MILLION Token Context Window

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.