The AI Daily Brief: Artificial Intelligence News and Analysis - ChatGPT Can Now See and Hear

Starting point is 00:00:00 Today on the AI breakdown, we're looking at some massive new multimodal features from chat GPT. Before that on the brief, Amazon makes a major investment in OpenAI competitor Anthropic. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown. Network for more information about our Discord, our newsletter, and our YouTube channel. Welcome back to the AI Breakdown Brief, all the AI headline news you need in around five minutes. Very, very late last night slash early this morning, news broke. that Amazon was making a huge, up to $4 billion investment in Anthropic. Anthropic is, of course, the company best known for their chatbot Claude, and has tried

Starting point is 00:00:43 to differentiate itself from OpenAI in a couple of different ways. One is from a feature standpoint. While chat CPT remains at a much lower context window, Claude rolled out a 100K context window earlier this year, and Anthropic has also tried to differentiate itself based on its approach to AI safety. Rather than focus on reinforcement learning through human feedback, Anthropic is invested in something that they call constitutional AI. The idea is to train the AI on a set of underlying principles, drawn from a variety of different sources, and effectively help it reason around what it should

Starting point is 00:01:13 or shouldn't do in any given situation based on those underlying principles. That idea of safety does make it into the company's press release. In fact, the title of the announcement on Anthropics' webpage is expanding access to safer AI with Amazon. Now, in terms of financial details, they didn't reveal much. Anthropic rights as part of the investment, Amazon will take them a note. state in Anthropic. Our corporate governance structure remains unchanged with the long-term benefit trust continuing to guide Anthropic in accordance with our responsible scaling policy. As outlined in this policy, we will conduct pre-deployment tests of new models to help us manage the risks of increasingly capable AI systems. So no valuation given here, we just know presumably that if Amazon deployed

Starting point is 00:01:53 the entirety of that amount, they would still own less than 50% of the company. Now, a lot of emphasis in the announcement is around how the companies will be working together beyond just capital. They write, The Agreement is part of a broader collaboration to develop the most reliable and high-performing foundation models in the industry. Our frontier safety research and products, together with Amazon Web Services' expertise in running secure, reliable infrastructure, will make Anthropics safe and steerable AI widely accessible to AWS customers. So specifically, AWS is becoming Anthropics' primary cloud provider, and in addition, they are committing to train future models on Amazon's AWS Trainium and Infurtia chips, with the idea being that in addition to just using these new chips,

Starting point is 00:02:32 They will help develop the future versions of them as well. As part of the announcement, Anthropic also said that they're expanding support of Amazon's Bedrock platform. Bedrock is Amazon's approach to giving enterprises access to multiple models from a single space, and Anthropic writes their increased support includes secure model customization and fine-tuning on the service to enable enterprises to optimize Claude's performance with their expert knowledge while limiting the potential for harmful outcomes. Now, obviously everyone understands, instantly upon reading this, that this is Amazon's

Starting point is 00:03:00 Microsoft OpenAI-style deal. It is Amazon going deep with one of the leading startups in the foundation model space, and it is clearly a multi-pronged partnership designed to touch on everything from Amazon's enterprise services all the way to their development of new AI chips. So you have Microsoft teaming up with OpenAI, Amazon teaming up with Anthropic, and Meta, Google, and presumably Apple all going it on their own. Speaking of Meta, we've had reports for a few weeks that the company was developing AI chatbots with different personalities in an attempt to increase engagement among younger users of their services. According to the Wall Street Journal, those chatbots could be coming as early as this week. WSJ explains the context saying, going after younger

Starting point is 00:03:39 users has been a priority for meta with the emergence of TikTok, which overtook Instagram and popularity among teenagers in the past couple of years. The shift prompted meta chief executive Mark Zuckerberg in October 2021 to say the company would retool its teams to make serving young adults their North Star rather than optimizing for the larger number of older people. Now, in terms of the personalities of these bots that are actually coming. The WSJ writes about one called Bob the Robot, which is a self-described SaaS Master General with superior intellect, sharp wit, and biting sarcasm. Now, the reference point is the robot bender from Futurama, but I have to say, the description of a SaaS Master General makes me wonder just how in touch with the kids this

Starting point is 00:04:17 company really is. Now, obviously other social media companies like Snap have been experimenting with chatbots to increase engagement as well, with frankly inconclusive results so far. Moving back, over to the world of Microsoft for just a moment. A job listing got some people chattering over the weekend. Data Center Dynamics sums up Microsoft Cloud hiring to implement global small modular reactor reactor strategy to power data centers. Basically, it appears that Microsoft is expanding their engagement with nuclear energy and is potentially exploring how SMRs or small modular reactors could be a part of their energy mix in the future. Radiant Energy funds, Mark Nelson writes, Word is out, Microsoft is plunging ahead on nuclear energy. They want a fleet of reactors,

Starting point is 00:04:56 powering new data centers. And now they're hiring people from the traditional nuclear industry to get it done. A world is coming where only the tech companies willing to become nuclear power developers may get to keep expanding their cloud businesses, and only countries open to new reactors get to host this expansion. A world where tech companies with 50% margins become the only survival hope for traditional industrial concerns with 5% margins who need someone else to bootstrap a proper electricity supply. The race is on. Now, speaking of advanced technology powering new data centers and other manufacturing concerns. The South China Morning Post is reporting that China is planning to get around U.S. chip sanctions by building a massive chip factory powered by a particle

Starting point is 00:05:35 accelerator. S&P writes, China is exploring new avenues to bypass restrictions on lithography machines which are used in the production of microchips. Using particle accelerators to create a novel laser source, researchers are laying the foundation for the future of semiconductor fabrication. Now, what's interesting to me about this story is just the way that it reflects how much is in flux right now and how much U.S. policy towards China around artificial intelligence-related technologies is having an impact in how that company plans its technological future. Speaking of the U.S. government, Semaphore is reporting that the White House is looking into an executive order on artificial intelligence that would, among other things,

Starting point is 00:06:09 force cloud companies to disclose which AI companies were using their services. From the article, the provision would direct the Commerce Department to write rules forcing cloud companies like Microsoft, Google, and Amazon to disclose when a customer purchases computing resources beyond a certain threshold. The order hasn't been finalized and the specifics of it could still change. Now, Semaphore draws the connection to KYC rules for banking, and basically this is another way for authorities to have a sense of who's making extreme transactions, although this case in energy, in order to effectively get out of challenges before they happen. As Semaphore writes, the rules are intended to create a system that would allow the U.S. government to identify potential

Starting point is 00:06:44 AI threats ahead of time, particularly those coming from entities in foreign countries. If a company in the Middle East began building a powerful large language model using Amazon Web Services, for example, the reporting requirement would theoretically give American authorities an early warning about it. Now, the last piece is also really interesting. They write, the policy proposal represents a potential step towards treating computing power like a national resource. Hold aside the specifics of this potential executive order. There are a lot of people who sit at the intersection of global politics and technology who think that that idea that computing power is, a national resource is a good one for the government to embrace. But for now, that is going to do

Starting point is 00:07:19 it for today's AI breakdown brief. We are kicking off the week with a bang. Stick around. We're going to talk a little bit more about OpenAI's latest multimodal announcements, plus some big speculation from Reddit, all of that and more coming up on the main episode. Hey guys, one more quick thing before we get into the main episode. If you subscribe to the newsletter, you've seen this, and you might have heard it on an earlier episode. But right now, I am getting information from you guys, the listeners about what you are looking for. in terms of AI educational resources. A bunch of you have filled out the survey already, and it's so helpful.

Starting point is 00:07:50 But if you would take the about one minute and go to bit.ly slash AI breakdown survey, I would love to know what type of online courses you might need, what you're trying to learn more about, whether you'd be interested in a community of learners. I'm getting really close to making some decisions about what we're going to do next, and I really want all of your input. Again, it'll take about one minute, and you can find it at bit.ly slash AI breakdown survey.

Starting point is 00:08:12 Thanks so much, and now on with the show. Welcome back to the AI breakdown. Today, we are talking about the latest developments in ChatGPT. They are very emblematic of the larger business battle that we find ourselves in the midst of. And in the second part of the show, we will dig into some very intriguing rumors coming from Twitter and Reddit around the company and just how much they've developed internally that we don't yet have visibility into. But let's kick it off with the announcement from this morning that, as they put it, ChatGPT can now see hear and speak. Developer Relations Logan over at OpenAI says this is one of the biggest

Starting point is 00:08:47 evolutions for ChatGPT to date. Y'all are going to love these new capabilities, truly incredible. So there are two big things going on here. The first is around using images as inputs for chat GPT. The example they give is they take a photo of a bike and ask ChatGPT to help them lower the bike seat. ChatGPT responds, giving them a set of instructions and then saying, if you have tools, show me and I'll guide you further. The prompter takes a close-up photo of a specific part of the bike and draws a circle to let Chat Chapti know to focus on that specific part. The prompter Ryan says, is this the lever? To which ChatchapT responds, no, that's not a lever, it's a bolt. You'll need an Allen wrench to loosen it. Now, obviously, we don't need to go too

Starting point is 00:09:26 deep into the details here, but the point is that all of a sudden you can use pictures of the real world to interact with ChatGPT in a way that wasn't possible before. This, of course, opens up a huge number of different use cases, which is why people have been excited about multimodal and image-based inputs. Now, the second part of the second part of the one, we're it is that in addition to just using voice as an input for the chat GPT mobile app, chat GPT can now talk back. OpenAI writes, use your voice to engage in a back and forth conversation with chat GPT. Speak with it on the go, request a bedtime story, or settle a dinner table debate. They've been loving the cutesy examples recently, and the bedtime story is the one that they

Starting point is 00:09:59 chose to demo. Now, one small technical detail that was interesting. When it comes to speech recognition, they use whisper, which has of course been lauded for being much farther ahead than many other text recognition services, and that's what's used to transcribe when someone speaks into chat GPT, but they write the new voice capability is powered by a new text-to-speech model, capable of generating human-like audio from just text and a few seconds of sample speech. We collaborated with professional voice actors to create each of the voices. They have five different voices, Juniper, sky, cove, ember, and breeze that they give a demo of. So is this all rolling out all at once?

Starting point is 00:10:33 The answer, of course, is no. OpenAI writes, we are deploying image and voice capabilities gradually. They basically say that this is their normal model anyways, but when it comes to things like image and voices, it's even more important. They write, the new voice technology capable of creating realistic synthetic voices from just a few seconds of real speech, opens doors to many creative and accessibility-focused applications. However, these capabilities also present new risks, such as the potential for malicious actors, to impersonate public figures or commit fraud. When it comes to the challenges of new image inputs, they say they range from hallucinations

Starting point is 00:11:01 to people overly relying on the model's interpretation of images in high-stakes domains. So, TLDR, this is the update that the information was reporting about about a week ago. The context they gave, which is the one I agree with, is that the impending reality of Google's Gemini is creating pressure for OpenAI to race towards multimodality, perhaps faster than they might otherwise have. That was Dr. Jim Fan from Nvidia's take when Dolly 3 was announced. He tweeted, it's not just a stance against Mid-Journey, it's actually a sneak peek of the upcoming epic battle of massively multimodal LLMs against DeepMind Gemini.

Starting point is 00:11:32 Now, the interesting thing about this news is that it's so easy. to get caught up in this larger conversation of the battle between Google and OpenAI and this larger phenomenon of competitive accelerationism, that we don't stop and remember how remarkable these new features are. When it comes to increasing the utility of chat GPT in a day-to-day way, the ability to interact going back and forth via audio makes it unbelievably more useful for a mobile world, but the ability to use images as inputs, especially when on the go, makes chat GPT so much closer to the actual superpowered AI assistant that so many people have imagined. The bike example may seem small, but that's the type of thing that

Starting point is 00:12:09 people interact with every single day, day in and day out. That's the type of thing that people use Google for. I wonder what percentage of my Google searches have something to do with finding instructions or how to do something. It's probably a fairly big percentage, relatively speaking. By having this type of image input, ChatGibt is effectively competing not just with Google searches, but with my FaceTiming my brother who's much more technical than I am to have him try to figure out something for me. We talked a bunch last week about how Google is trying to differentiate by just loading up on actual utility and making their AIs more useful through integrations with other tools like Google Workspace.

Starting point is 00:12:43 And then in the wake of that conversation, we saw Microsoft integrating AI everywhere through Windows 11 updates and now OpenAI expanding the sort of day-to-day type of capabilities that will make ChatGPT much more powerful. Now, this would be interesting if it was the only ChatGPT and OpenAI story, but it was not. However, from this part of all confirmed announced things, we are now moving wildly into the realm of speculation, so a huge grain of salt warning for everything that comes next in this discussion. Over the weekend, there was a bunch of discussion on Reddit from two users who claimed that they had access to OpenAI's internal models and who are sharing some of the

Starting point is 00:13:18 information that they had seen. I'll link to the specific posts, or at least Twitter screenshots of those posts, but here are some of the highlights that these users claimed. One of those users felt steam writes, so Open AI obviously isn't just slowly developing one model at a time, but are, of course, working on multiple. The one that I know most about has an internal name of Iraqis. It is kind of wild. So far as I know, it's an everything-to-everything model, meaning you can input on any combination of text, image, audio, and video. So what are some of the other details that these posters give? Well, one, they say that Iraqi succeeds GPT4 capabilities and can match human experts in many different fields. They claim that hallucination rates are much lower than GPT4, and interestingly

Starting point is 00:13:57 that half of the training data was synthetic. Now, this has been an ongoing conversation about the extent to which synthetic data might be problematic for training AI models in the future, although there have been some results, including an unreleased Facebook Lama 2 model that suggests that synthetic data actually can increase performance as well. In other words, that's a big open question, so it's fascinating that potentially this advanced model has 50% of its data coming from synthetic sources. Now, when it comes to when this stuff is coming out, the poster writes, in terms of release date, they originally didn't plan to release in 2024, but I think it's entirely possible to see it release sometime during 2024 as their timelines have been

Starting point is 00:14:30 accelerated, though it is their fault that everyone is accelerating in AI development as the release of ChatGBT and GPD4 showed what was possible, and now people are slowly catching up, so it's complicated. Now, the other speculation around OpenAI comes around a Twitter account called Jimmy Apples. People are paying attention to this a little bit more than they might a random Twitter account, because after the information reported on September 18th that OpenAI was going to be releasing these multimodal features and that they were working on a new multimodal LLM called Gobe, Jimmy Apples pointed out his own tweet from April 28, where he said, the big multimodal currently in the works at OpenAI is called Gobee. Should I leak more?

Starting point is 00:15:04 Given that they were right about that, people are paying attention. And on September 18th, Apple's tweeted, AGI has been achieved internally. Now add to that a bunch of cryptic tweets from Sam Altman, one being, sure, 10x engineers are cool, but damn those 10,000 X engineers and researchers, dot, dot, dot, and the other being short timelines and slow takeoff will be a pretty good call, I think, but the way people define the start of the takeoff may make it seem otherwise. So this is just ratcheted AGI speculation to about a thousand. Simeon CPS tweets, can we consider seriously the hypothesis that, one, the recently hyped tweets from OA's staff, two, AGI has been achieved internally, three, Sam Altman's

Starting point is 00:15:41 comments on the qualification of slower fast takeoff hinging on the date you count from, four, Sam Altman's comments on 10,000 X researchers are actually mapping to something true. The implications are so crazy in terms of power shift, or levels of risk over the next few months. Now, Sully Omar captured some of my feeling when he wrote, this whole thing is giving weird vibes. There's two possibilities. One, OpenAI has achieved AGI internally. Two, they're messing with everyone slash hyping things up for fun, question mark. But one thing is for sure. AGI is coming way faster than everyone thinks it is. Eliezer Yudkowski seems to agree. The Twitter account at Pause AI responded to Sam Altman's tweet about short timelines and slow takeoff and

Starting point is 00:16:18 said, how about we abort launch? To which Elyzer responded, you're talking to the wrong person. OpenAI has zero ability to stop the avalanche they started. That's now a matter for treating. between major powers. So friends, lots of intriguing things happening in the world of AI. We've certainly got another example of that competitive accelerationism we've been talking about. And holding aside all of the speculative stuff about AGI, we have a massively more performant and useful chat chabit coming right around the corner. From the sheer standpoint of people who use chat chit and tools like Mid Journey for productivity, October is gearing up to be a very, very good month. We will, of course, keep you up to date on all of the developments, including probably any relevant speculation here

Starting point is 00:16:57 the AI breakdown but for now that is going to do it for the show thanks as always for listening or watching and until next time peace

The AI Daily Brief: Artificial Intelligence News and Analysis - ChatGPT Can Now See and Hear

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.