The AI Daily Brief: Artificial Intelligence News and Analysis - ChatGPT Can Now See and Hear
Episode Date: September 25, 2023The race towards multimodal LLMs is heating up! With rumors of a big impending launch of Google Gemini, OpenAI is racing to push out their multimodal features. Today they launched the ability for Chat...GPT to carry on audio conversations, as well as to use images as inputs. Before that on the Brief, Amazon to invest up to $4B in Anthropic. ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're looking at some massive new multimodal features from chat GPT.
Before that on the brief, Amazon makes a major investment in OpenAI competitor Anthropic.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown. Network for more information about our Discord, our newsletter, and our YouTube channel.
Welcome back to the AI Breakdown Brief, all the AI headline news you need in around five minutes.
Very, very late last night slash early this morning, news broke.
that Amazon was making a huge, up to $4 billion investment in Anthropic.
Anthropic is, of course, the company best known for their chatbot Claude, and has tried
to differentiate itself from OpenAI in a couple of different ways.
One is from a feature standpoint.
While chat CPT remains at a much lower context window, Claude rolled out a 100K context window
earlier this year, and Anthropic has also tried to differentiate itself based on its approach
to AI safety.
Rather than focus on reinforcement learning through human feedback, Anthropic is invested in something
that they call constitutional AI. The idea is to train the AI on a set of underlying principles,
drawn from a variety of different sources, and effectively help it reason around what it should
or shouldn't do in any given situation based on those underlying principles. That idea of safety
does make it into the company's press release. In fact, the title of the announcement on Anthropics'
webpage is expanding access to safer AI with Amazon. Now, in terms of financial details, they didn't
reveal much. Anthropic rights as part of the investment, Amazon will take them a note.
state in Anthropic. Our corporate governance structure remains unchanged with the long-term benefit
trust continuing to guide Anthropic in accordance with our responsible scaling policy. As outlined in this
policy, we will conduct pre-deployment tests of new models to help us manage the risks of increasingly
capable AI systems. So no valuation given here, we just know presumably that if Amazon deployed
the entirety of that amount, they would still own less than 50% of the company. Now, a lot of emphasis
in the announcement is around how the companies will be working together beyond just capital.
They write, The Agreement is part of a broader collaboration to develop the most reliable and high-performing foundation models in the industry.
Our frontier safety research and products, together with Amazon Web Services' expertise in running secure, reliable infrastructure,
will make Anthropics safe and steerable AI widely accessible to AWS customers.
So specifically, AWS is becoming Anthropics' primary cloud provider,
and in addition, they are committing to train future models on Amazon's AWS Trainium and Infurtia chips,
with the idea being that in addition to just using these new chips,
They will help develop the future versions of them as well.
As part of the announcement, Anthropic also said that they're expanding support of Amazon's
Bedrock platform.
Bedrock is Amazon's approach to giving enterprises access to multiple models from a single space,
and Anthropic writes their increased support includes secure model customization and fine-tuning
on the service to enable enterprises to optimize Claude's performance with their expert
knowledge while limiting the potential for harmful outcomes.
Now, obviously everyone understands, instantly upon reading this, that this is Amazon's
Microsoft OpenAI-style deal. It is Amazon going deep with one of the leading startups in the
foundation model space, and it is clearly a multi-pronged partnership designed to touch on everything
from Amazon's enterprise services all the way to their development of new AI chips. So you
have Microsoft teaming up with OpenAI, Amazon teaming up with Anthropic, and Meta, Google, and
presumably Apple all going it on their own. Speaking of Meta, we've had reports for a few weeks that
the company was developing AI chatbots with different personalities in an attempt to increase
engagement among younger users of their services. According to the Wall Street Journal, those
chatbots could be coming as early as this week. WSJ explains the context saying, going after younger
users has been a priority for meta with the emergence of TikTok, which overtook Instagram and
popularity among teenagers in the past couple of years. The shift prompted meta chief executive
Mark Zuckerberg in October 2021 to say the company would retool its teams to make serving young adults
their North Star rather than optimizing for the larger number of older people. Now, in terms of the
personalities of these bots that are actually coming. The WSJ writes about one called Bob the Robot,
which is a self-described SaaS Master General with superior intellect, sharp wit, and biting
sarcasm. Now, the reference point is the robot bender from Futurama, but I have to say,
the description of a SaaS Master General makes me wonder just how in touch with the kids this
company really is. Now, obviously other social media companies like Snap have been experimenting with
chatbots to increase engagement as well, with frankly inconclusive results so far. Moving back,
over to the world of Microsoft for just a moment. A job listing got some people chattering over the weekend.
Data Center Dynamics sums up Microsoft Cloud hiring to implement global small modular reactor
reactor strategy to power data centers. Basically, it appears that Microsoft is expanding their
engagement with nuclear energy and is potentially exploring how SMRs or small modular reactors could be
a part of their energy mix in the future. Radiant Energy funds, Mark Nelson writes,
Word is out, Microsoft is plunging ahead on nuclear energy. They want a fleet of reactors,
powering new data centers. And now they're hiring people from the traditional nuclear industry to get
it done. A world is coming where only the tech companies willing to become nuclear power developers
may get to keep expanding their cloud businesses, and only countries open to new reactors get to host
this expansion. A world where tech companies with 50% margins become the only survival hope for
traditional industrial concerns with 5% margins who need someone else to bootstrap a proper
electricity supply. The race is on. Now, speaking of advanced technology powering new data
centers and other manufacturing concerns. The South China Morning Post is reporting that China is
planning to get around U.S. chip sanctions by building a massive chip factory powered by a particle
accelerator. S&P writes, China is exploring new avenues to bypass restrictions on lithography
machines which are used in the production of microchips. Using particle accelerators to create
a novel laser source, researchers are laying the foundation for the future of semiconductor fabrication.
Now, what's interesting to me about this story is just the way that it reflects how much is in flux
right now and how much U.S. policy towards China around artificial intelligence-related technologies
is having an impact in how that company plans its technological future.
Speaking of the U.S. government, Semaphore is reporting that the White House is looking
into an executive order on artificial intelligence that would, among other things,
force cloud companies to disclose which AI companies were using their services.
From the article, the provision would direct the Commerce Department to write rules
forcing cloud companies like Microsoft, Google, and Amazon to disclose when a customer purchases
computing resources beyond a certain threshold. The order hasn't been finalized and the specifics
of it could still change. Now, Semaphore draws the connection to KYC rules for banking, and basically
this is another way for authorities to have a sense of who's making extreme transactions, although
this case in energy, in order to effectively get out of challenges before they happen. As Semaphore writes,
the rules are intended to create a system that would allow the U.S. government to identify potential
AI threats ahead of time, particularly those coming from entities in foreign countries. If a company in the
Middle East began building a powerful large language model using Amazon Web Services, for example,
the reporting requirement would theoretically give American authorities an early warning about it.
Now, the last piece is also really interesting. They write, the policy proposal represents a
potential step towards treating computing power like a national resource. Hold aside the specifics
of this potential executive order. There are a lot of people who sit at the intersection of global
politics and technology who think that that idea that computing power is,
a national resource is a good one for the government to embrace. But for now, that is going to do
it for today's AI breakdown brief. We are kicking off the week with a bang. Stick around. We're
going to talk a little bit more about OpenAI's latest multimodal announcements, plus some big
speculation from Reddit, all of that and more coming up on the main episode.
Hey guys, one more quick thing before we get into the main episode. If you subscribe to the newsletter,
you've seen this, and you might have heard it on an earlier episode. But right now, I am getting
information from you guys, the listeners about what you are looking for.
in terms of AI educational resources.
A bunch of you have filled out the survey already, and it's so helpful.
But if you would take the about one minute and go to bit.ly slash AI breakdown survey,
I would love to know what type of online courses you might need,
what you're trying to learn more about,
whether you'd be interested in a community of learners.
I'm getting really close to making some decisions about what we're going to do next,
and I really want all of your input.
Again, it'll take about one minute,
and you can find it at bit.ly slash AI breakdown survey.
Thanks so much, and now on with the show.
Welcome back to the AI breakdown.
Today, we are talking about the latest developments in ChatGPT.
They are very emblematic of the larger business battle that we find ourselves in the midst of.
And in the second part of the show, we will dig into some very intriguing rumors coming from Twitter and Reddit around the company
and just how much they've developed internally that we don't yet have visibility into.
But let's kick it off with the announcement from this morning that, as they put it, ChatGPT can now see
hear and speak. Developer Relations Logan over at OpenAI says this is one of the biggest
evolutions for ChatGPT to date. Y'all are going to love these new capabilities, truly incredible.
So there are two big things going on here. The first is around using images as inputs for
chat GPT. The example they give is they take a photo of a bike and ask ChatGPT to help them
lower the bike seat. ChatGPT responds, giving them a set of instructions and then saying,
if you have tools, show me and I'll guide you further. The prompter takes a close-up photo of a
specific part of the bike and draws a circle to let Chat Chapti know to focus on that specific part.
The prompter Ryan says, is this the lever? To which ChatchapT responds, no, that's not a lever,
it's a bolt. You'll need an Allen wrench to loosen it. Now, obviously, we don't need to go too
deep into the details here, but the point is that all of a sudden you can use pictures of the
real world to interact with ChatGPT in a way that wasn't possible before. This, of course,
opens up a huge number of different use cases, which is why people have been excited about
multimodal and image-based inputs. Now, the second part of the second part of the one, we're
it is that in addition to just using voice as an input for the chat GPT mobile app,
chat GPT can now talk back. OpenAI writes, use your voice to engage in a back and forth conversation
with chat GPT. Speak with it on the go, request a bedtime story, or settle a dinner table
debate. They've been loving the cutesy examples recently, and the bedtime story is the one that they
chose to demo. Now, one small technical detail that was interesting. When it comes to speech recognition,
they use whisper, which has of course been lauded for being much farther ahead than many other
text recognition services, and that's what's used to transcribe when someone speaks into chat
GPT, but they write the new voice capability is powered by a new text-to-speech model,
capable of generating human-like audio from just text and a few seconds of sample speech.
We collaborated with professional voice actors to create each of the voices.
They have five different voices, Juniper, sky, cove, ember, and breeze that they give a demo of.
So is this all rolling out all at once?
The answer, of course, is no.
OpenAI writes, we are deploying image and voice capabilities gradually.
They basically say that this is their normal model anyways, but when it comes to things like image
and voices, it's even more important. They write, the new voice technology capable of creating
realistic synthetic voices from just a few seconds of real speech, opens doors to many creative
and accessibility-focused applications. However, these capabilities also present new risks,
such as the potential for malicious actors, to impersonate public figures or commit fraud.
When it comes to the challenges of new image inputs, they say they range from hallucinations
to people overly relying on the model's interpretation of images in high-stakes domains.
So, TLDR, this is the update that the information was reporting about about a week ago.
The context they gave, which is the one I agree with, is that the impending reality of Google's
Gemini is creating pressure for OpenAI to race towards multimodality, perhaps faster than they
might otherwise have.
That was Dr. Jim Fan from Nvidia's take when Dolly 3 was announced.
He tweeted, it's not just a stance against Mid-Journey, it's actually a sneak peek of the upcoming
epic battle of massively multimodal LLMs against DeepMind Gemini.
Now, the interesting thing about this news is that it's so easy.
to get caught up in this larger conversation of the battle between Google and OpenAI and this
larger phenomenon of competitive accelerationism, that we don't stop and remember how remarkable
these new features are. When it comes to increasing the utility of chat GPT in a day-to-day way,
the ability to interact going back and forth via audio makes it unbelievably more useful for a mobile
world, but the ability to use images as inputs, especially when on the go, makes chat GPT
so much closer to the actual superpowered AI assistant that so
many people have imagined. The bike example may seem small, but that's the type of thing that
people interact with every single day, day in and day out. That's the type of thing that people
use Google for. I wonder what percentage of my Google searches have something to do with
finding instructions or how to do something. It's probably a fairly big percentage,
relatively speaking. By having this type of image input, ChatGibt is effectively competing not just
with Google searches, but with my FaceTiming my brother who's much more technical than I am to
have him try to figure out something for me. We talked a bunch last week about how Google is trying
to differentiate by just loading up on actual utility and making their AIs more useful through
integrations with other tools like Google Workspace.
And then in the wake of that conversation, we saw Microsoft integrating AI everywhere
through Windows 11 updates and now OpenAI expanding the sort of day-to-day type of capabilities
that will make ChatGPT much more powerful.
Now, this would be interesting if it was the only ChatGPT and OpenAI story, but it was not.
However, from this part of all confirmed announced things, we are now moving wildly into
the realm of speculation, so a huge grain of salt warning for everything that comes next in this
discussion. Over the weekend, there was a bunch of discussion on Reddit from two users who
claimed that they had access to OpenAI's internal models and who are sharing some of the
information that they had seen. I'll link to the specific posts, or at least Twitter screenshots
of those posts, but here are some of the highlights that these users claimed. One of those users
felt steam writes, so Open AI obviously isn't just slowly developing one model at a time, but are, of course,
working on multiple. The one that I know most about has an internal name of Iraqis. It is kind of
wild. So far as I know, it's an everything-to-everything model, meaning you can input on any combination
of text, image, audio, and video. So what are some of the other details that these posters
give? Well, one, they say that Iraqi succeeds GPT4 capabilities and can match human experts in many
different fields. They claim that hallucination rates are much lower than GPT4, and interestingly
that half of the training data was synthetic. Now, this has been an ongoing conversation about the
extent to which synthetic data might be problematic for training AI models in the future,
although there have been some results, including an unreleased Facebook Lama 2 model that
suggests that synthetic data actually can increase performance as well. In other words, that's a
big open question, so it's fascinating that potentially this advanced model has 50% of its data
coming from synthetic sources. Now, when it comes to when this stuff is coming out, the poster
writes, in terms of release date, they originally didn't plan to release in 2024, but I think
it's entirely possible to see it release sometime during 2024 as their timelines have been
accelerated, though it is their fault that everyone is accelerating in AI development as the release
of ChatGBT and GPD4 showed what was possible, and now people are slowly catching up, so it's
complicated. Now, the other speculation around OpenAI comes around a Twitter account
called Jimmy Apples. People are paying attention to this a little bit more than they might
a random Twitter account, because after the information reported on September 18th that OpenAI
was going to be releasing these multimodal features and that they were working on a new multimodal
LLM called Gobe, Jimmy Apples pointed out his own tweet from April 28, where he said,
the big multimodal currently in the works at OpenAI is called Gobee. Should I leak more?
Given that they were right about that, people are paying attention. And on September 18th, Apple's
tweeted, AGI has been achieved internally. Now add to that a bunch of cryptic tweets from Sam Altman,
one being, sure, 10x engineers are cool, but damn those 10,000 X engineers and researchers, dot,
dot, dot, and the other being short timelines and slow takeoff will be a pretty good call, I think,
but the way people define the start of the takeoff may make it seem otherwise. So this is just
ratcheted AGI speculation to about a thousand.
Simeon CPS tweets, can we consider seriously the hypothesis that, one, the recently hyped
tweets from OA's staff, two, AGI has been achieved internally, three, Sam Altman's
comments on the qualification of slower fast takeoff hinging on the date you count from,
four, Sam Altman's comments on 10,000 X researchers are actually mapping to something true.
The implications are so crazy in terms of power shift, or levels of risk over the next few months.
Now, Sully Omar captured some of my feeling when he wrote,
this whole thing is giving weird vibes. There's two possibilities. One, OpenAI has achieved AGI
internally. Two, they're messing with everyone slash hyping things up for fun, question mark. But one thing is for
sure. AGI is coming way faster than everyone thinks it is. Eliezer Yudkowski seems to agree. The Twitter
account at Pause AI responded to Sam Altman's tweet about short timelines and slow takeoff and
said, how about we abort launch? To which Elyzer responded, you're talking to the wrong person.
OpenAI has zero ability to stop the avalanche they started. That's now a matter for treating.
between major powers. So friends, lots of intriguing things happening in the world of AI. We've certainly
got another example of that competitive accelerationism we've been talking about. And holding aside all of the
speculative stuff about AGI, we have a massively more performant and useful chat chabit
coming right around the corner. From the sheer standpoint of people who use chat chit and tools like
Mid Journey for productivity, October is gearing up to be a very, very good month. We will, of course,
keep you up to date on all of the developments, including probably any relevant speculation here
the AI breakdown but for now that is going to do it for the show thanks as always for listening
or watching and until next time peace
