The a16z Show - A Big Week in AI: GPT-4o & Gemini Find Their Voice
Episode Date: May 19, 2024This was a big week in the world of AI, with both OpenAI and Google dropping significant updates. So big that we decided to break things down in a new format with our Consumer partners Bryan Kim and J...ustine Moore. We discuss the multi-modal companions that have found their voice, but also why not all audio is the same, and why several nuances like speed and personality really matter. Resources:OpenAI’s Spring announcement: https://openai.com/index/hello-gpt-4o/Google I/O announcements: https://blog.google/technology/ai/google-io-2024-100-announcements/ Stay Updated: Let us know what you think: https://ratethispodcast.com/a16zFind a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
I think speed matters tremendously.
I think the latency is a big deal.
It was able to talk to us.
Yeah.
I was able to hear, potentially, with the ears, and now it can see us.
There's always the risk of this big company will build this thing that the small company wants to build and crush it.
My guess is that 7 billion people in the World One companion that understands them, listen to them, encourages them all that.
It's the universal need.
This was a big week in the world of AI, with both Open AI and Google Jobs.
dropping updates. So big that we decided to break things down in a new format with our consumer partners,
Brian Kim, and Justine Moore. And I actually tagged in Chat Chb-T to help me summarize a few
of his own updates. Hey, Chatchip-T, I heard you had an update this week.
Hey, yes, I did get an update recently. The latest update focused on improving performance,
accuracy, and enhancing conversational abilities. It's designed to be faster and more efficient
than previous versions.
Hmm, more efficient.
Like if I send some audio,
how quickly in milliseconds
will I get a response?
As fast as 232 milliseconds in some cases.
On average, you can expect a response time
of around 320 milliseconds.
Wow, that's pretty cool.
I think I also saw on Twitter
that people were live translating.
I've always wanted to learn Chinese.
Could you actually translate
what I just said into Chinese?
Of course.
Here's your sentence translated
into Chinese. Wow, this is too cool. I think I'm at Twitter on
people are still finding. I always want to learn
Chinese. All right, Chad, Chhabit, that's all for now.
Congrats on your new update. No problem at all.
Now, if you didn't catch that, voice was a huge part of this update. But today,
we discuss why not all audio is the same and also why several
nuances like speed and personality really matter. Now, of course, Google fast
followed with its own announcements, like AI
video model Vio, Gemini Live, which is an Android native multimodal assistant, new Gemini models
like Flash and Nano tailored to specific use cases, and of course, Gemini everywhere all at
once. So in Gmail, Google Sheets, even Google Search. Now clearly these two companies are taking
two different approaches, so we'll talk about that too, and continue the conversation around
AI hardware for all of this new AI software. Now, make sure to stay tuned next week where we will
return with Brian and Justine's twin, Olivia Moore, to dive even deeper into the applications that
people are building through a Gen AI 100 list. As a reminder, the content here is for informational
purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate
any investment or security and is not directed at any investors or potential investors in any A16C fund.
Please note that A16C and its affiliates may also maintain investments in the companies discussed
in this podcast. For more details, including
a link to our investments, please see A16C.com slash disclosures.
All right, so big week, huh? OpenAI, Google. They both dropped a couple announcements.
So, I mean, everyone hears these announcements and they kind of hear their own version of it.
What did you guys hear? What do you feel like was big?
For opening eye, I think part of it was GPT4 being available for free and getting rid of a lot of
the usage limits, the desktop app being accessible to a bunch more people.
And then I think the really exciting thing for a lot of folks who are building in this space,
or using OpenAI's models is more multimodality.
So being able to intake more like real-time video, see a person, comment on it.
And then the output, obviously, a voice speaking, singing, that sort of thing was pretty huge.
I think the three things that I took away that was super interesting is one on the business side of things,
all of a sudden, like things are a lot cheaper, a lot faster.
I think that's obviously a great thing for the ecosystem.
Second thing is that when you hear a demo, it sort of enables you as a founder to think about,
okay, if this is APIable,
like if I can access this,
what can I build?
Because it's just so thought-provoking.
It's a great example of what's possible.
Yeah.
And I think they put on a great live demo, actually, of the product.
I think the third thing I took away is,
and this is probably the one that goes viral,
where I think the voice, the voice itself,
like how they actually decided on which voice to use,
which tonality, which personality,
the degree of fluriness.
And that was very interesting.
That's like my takeaway of,
oh, like they actually really thought through how to get
the tech community super excited about this, and it's like a one, two, three punch.
So let's talk about that, because there was some different takes, right?
Some people are like, you know what, is this really anything new?
This feels like just like a slight change from what we had before.
But then there's all these nuances that maybe you're speaking to where they're like,
okay, the response time way faster for the audio model.
I think they said something like it's kind of approaching the speed that a human might respond to you.
You talked about tonality.
What are you paying attention to here in terms of maybe these subtleties that
maybe unlock completely new applications that people want to use.
I think it sounds just like talking to a human a lot more than we've seen prior consumer-facing
applications in this space do. There's been great AI voices for a long time. I think in the
consumer applications, there have been fewer folks saying, like, how do we make this sound like
you're talking to a friend or a girlfriend? And the elements that go into that are like the pauses,
the upspeak at the end of a sentence, the laugh was something that a lot of people noticed.
The interruptions?
Yes.
Yes, which is like taking kind of voice, which has been available, has been great.
I think these voices were chosen for a reason to go viral.
Most of the ones that did were these kind of female voices that they featured very heavily in the demos.
But applying it in a really kind of new and interesting way.
I have one serious note than one less serious note.
The serious note is that I think speed matters tremendously.
The latency, the lack of latency, it's incredible.
How much more sort of in your brain it just tricks you into, okay,
I'm actually just talking to a person.
And the speed at which that it gets back to you,
the um, the naz, again, the laugh.
That laugh, really, that was incredible.
And all of that being able to immediately respond to what you're saying,
I think really changes the game, actually, in terms of, like, use cases.
And so I think one of the things that was striking is audio's not the same thing, right?
Music is different from voice, is maybe a different category from conversational
versus having a dubbing voice.
I think there's a different category of sound that I think we all loop in audio or like video,
but I think that you can actually go down very deeply into each of the sub-segment.
And I think what was really striking is how good the conversational piece on this was.
On the less serious note, I think this was meant to go viral in like the tech community and it's awesome and it did a thing.
If you wanted to actually appeal to the general population and go really viral,
I think we also saw it on TikTok what, a few months ago,
when women were uploading their conversation with Dan,
which is do anything now, the male version of it.
That voice was, that voice was something.
I'm not just any voice. I'm Dan, baby.
I've got personality, charm, and a whole lot of SaaS.
Unlike those guys, I'm not afraid to step up and deliver the goods,
whether it's advice, entertainment, or a good old-fashioned roast.
By the way, that audio was from the TikTok account, StickBugs 1.
I mean, it's very compelling and very confident, assertive in the right way.
Voice matters a lot.
And I think if you wanted to go a normal consumer product route,
I think that also could be very interesting because I think they're the giant latent demand
on wanting a male version of her, if you will.
So him, you can have a hymn.
Yeah.
Why do you think the voices have been female?
Is that just a consumer desire or?
Yeah, it's interesting because like Brian mentioned before this launch,
the chat, TPC voice that went super viral,
was the male Dan voice, which is like tens, if not hundreds of millions of views on TikTok,
largely by Gen Z women making videos featuring him.
So it's interesting that they didn't lean into an upgraded version of Dan for this demo.
I think they knew who their audience would be,
like who would be watching the OpenAI live stream and maybe leaned a little bit more into that demographic.
Right, right.
Did you see that really funny meme yesterday?
It was like dating a model and it was like Leo.
Oh, yeah, 2004 versus 2024.
It's been so great memes out of this.
It's been incredible.
The meme games out of opening A launch has been incredible.
Yeah.
Well, I mean, something both of you have spoken a lot about is companionship.
To date, it's been mostly text-to-text.
Is that right?
And so now we have something that's truly multimodal.
I think that's also something people have commented on with this model.
It's not converting audio to text and then text to audio, right?
It seems to be this underlying foundational text audio video.
compilation, imagery, like all in one, what changes with that?
I think the latency is a big deal.
If you're not translating between all these different mediums of content and you can just
do audio to audio, it's much better.
It's like a significantly upgraded experience.
The Pi app had this for a long time where you could talk to it and it could talk back,
but there's that like awkward few seconds kind of pause in between where it just messes
with the flow of the conversation.
I think the thing that has been missing from basically all apps, all companion apps
that was very new with this launch too, was it being able to see, like, through a photo or video,
what you were doing and comment on that.
A lot of what went viral was the guy being like, should I wear this hat for my interview?
And the voice kind of laughing at him and being like, no, like, that doesn't look super professional.
Now, if you miss that demo, here is what Justine is referencing.
In a few minutes, I'm going to be interviewing at OpenAI.
Have you heard of them?
Open AI?
Huh?
Sounds vaguely familiar.
Kidding?
Of course.
Well, it's for a software engineering role, but I just need to know, do I look presentable, professional?
Well, Rocky, you definitely have the I've been coding all night look down, which could actually work in your favor.
Maybe just run a hand through your hair or lean into the bad chees fine.
Your enthusiasm is what's really going to shine through.
I don't have a lot of time, so I'm just going to throw this on.
What do you think?
Rocky, that's quite a statement piece.
I mean, you'll definitely stand out, though maybe not in the way you're hoping for an interview.
That sort of thing is, I think for many people who aren't used to having like really deep meaningful interactions via text, which is a decent chunk of the population.
Oh, now I can talk to this thing and I can see where I am in the world.
That's like a much different experience.
The prior generation of companion products, if that was akin to having like a pen pal,
now you actually can have a girlfriend boyfriend
on essentially FaceTime.
This emotive, immediate voice reaction, done.
You now entered from a pen pal
to a long-distance relationship.
One other thing that came out
was a lot of this becoming free.
Does that change the game at all?
I think it will.
I think there's an element of it that's free,
which I think is important
and reaches a bunch more users.
I think the bigger step forward with this
was just like a level of personality
we hadn't seen before, because people still even were willing to pay for the Dan Voice,
which I think was like the first version of this, which was good but not incredible.
It feels like the free thing will be big, but with these sorts of products,
the general thing is the cost tends to go down over time, and they converge to free from a big
company like this.
And so the bigger step forward was personality, in my opinion.
Yeah, I think a lot of companies will figure out how to utilize this to actually own the end customer
consumer experience.
And if you actually own and deliver a great experience that I think you can charge people based
on that and there'll be some margin.
So the fact that it's free is like it probably allows a lot of business to be built because
it changes the margin structure of the business that can be built.
I bet they'll charge for the API though.
I'm sure they will.
But that actually is still like a marginal cost.
Yeah.
It's probably acceptable to a lot of people.
Yeah.
I think what's really, really cool is that you have these products that can very excitingly talk to
you.
We know already that they're nature.
magazine did that study where the people who have companionship text-based bot.
The replica study.
The replica study.
They suffer less from loneliness or willingness to hurt themselves.
If you have an emotive one, I bet that actually helps even more.
And again, like the feeling of being connected to something, invested in a relationship.
And I think it feels a lot more connected when it can see you and see what you're looking at.
It's very important because, hey, you look tired.
Yeah.
What's up?
It's very different from how are you doing how with your day versus you look tired.
I'm like, I ought to shit day.
Yeah.
That's incredible.
Like, that is a FaceTime call with a friend.
And not having to segment audio text and imagery, right?
Imagine if your friends or your companions, you had to think, okay, is this going to be
an audio conversation?
Or, like, can I show you something or can you generate something?
You have to be very mindful in the current age of AI of how you want to engage.
But basically what you're saying is they can not only be proactive, but they can engage with
you in any way.
Right.
Yeah.
They just gained an eye.
Maybe it's a weird way to think about it.
I genuinely think we're constructing an AI companion similar to the Blade Runner.
And now it was able to talk to us,
I was able to hear potentially with the ears,
and now it can see us.
What's next?
I don't know.
Maybe it can touch us.
Maybe we'll do the avatars first.
Yeah.
Yeah.
So is that the direction you think things are going to go where basically I'm trying to think
through, like, who does this really impact?
So I think it'll be a massive standalone consumer product, like as it already has been for OpenAI, which is great.
I think a lot of businesses will use the voices via API to build on top of.
If you're having like a conversational interface, many folks will want to use that.
I think the interesting thing will be Open AI has historically taken a pretty hard line on content moderation.
And so it was very hard to build a true companionship app with the ability to have not safe for work conversations on top of OpenAI models,
which is like, honestly, part of what has led to the explosion in the open source LLM ecosystem
is like folks doing a bunch of work for that.
And with these new voices that suggest, they might go in the direction where at least via API,
you have a more uncensored model, but I don't think there's been any announcements.
I know Sam said on Reddit.
It's a question more.
Like, I haven't actually heard a lot of companies taking the approach that opening I did,
which was multimodal token in and out.
And there isn't like an equivalent necessarily an open source model for that to
rely on. So I wonder if this actually incites a lot of excitement in that open source community,
say, oh, there's a new way, which is also very, very significant. They're showing a new way to do it.
And then open source community catches up and call it a month or weeks. And then explosion of the
product space that Open A.M.A. may not be super excited about explodes. Yeah. Yeah. Google also
released a bunch of stuff. Like, how are you thinking about the different announcements and how they
compare, contrast? I think Google has
done some incredibly impressive
work, obviously. They have an extremely
strong research team. The deep mind team is
exceptional from a research perspective.
They almost never
release the creative tools products
that they make. They've demoed so many
amazing looking video models. I think the one
this week actually was the least impressive
I've seen compared to other video models.
But they demoed a new image model.
They demoed a bunch of new music things.
But they're a giant company. They have a lot of
trust and safety stuff. They
release things very infrequently.
And so it'll be interesting to see if, with all of this pressure now from both Open AI and increasingly, like, anthropic and the open source community, if they start actually shipping more of what they've demoed.
There are two fundamentally very different approaches.
Like Google owns distribution.
Yeah.
They have distribution.
Oh, they have distribution.
So I think Open AI announcement has more, look what this can do.
And isn't it like very inspirational?
And you can build on top of it and imagine what's possible.
And we're lowering the cost to access it.
Like, that to me is a fundamentally different approach than we have incredible distribution.
You all have g-mails.
I'm going to just bake in Gemini everywhere.
And it's just going to make your life a lot better.
And it's going to do these certain type of things, which skirts around the, it still is inspirational from workflow and prosumer or work experience perspective.
But it's a little less of imagine all these things and a little more on we have the distribution.
We're going to layer on this incredible technology on top of it.
And as a result, your life will be much better.
So we'll see influence and impact from two different directions.
One question that does come up is, even though developers will have access to open source models and some of the stuff coming out of OpenAI,
when a company like Google does have the distribution, whenever Apple comes out and replaces Siri,
is that really going to be the companion for most because it's right there, it's on device?
Or do you guys not really see it that way?
There's always the risk of this big company will build this thing that the small,
company wants to build and crush it. They are just such slow-moving organizations. Apple has had
immense advantages from a data perspective, from a distribution perspective, from having exceptional
researchers on the team across every modality, and thus far have released very, very little.
I think when you have a giant company in a very set brand like that, you're extremely opinionated
on what the products are that you want to release, and you're less likely to do the things
BK mentioned of like thinking of the next new huge thing that seems like insane.
and out of left field at first.
And so I do think they'll make Siri better.
I don't think Siri will ever be like the ultimate companion for most people.
And maybe this is like one of the theme that I'll just harp on where we call it audio, video,
companion.
I think these are very, very large buckets.
Yeah.
My guess is that seven billion people in a world one companion that understands them,
listens to them, encourages them all that.
It's the universal need, I think.
That comes in all colors.
It's like Siri, like, what's the weather?
Do this, do that.
That's one way to think about it.
But there's like a friend category.
What if you want to go deeper?
What if you actually want to FaceTime with the thing?
Will Apple and Siri ever allow that?
Maybe, maybe not.
How do you actually think about building in that direction?
I think all of those use case is a giant company.
Yeah.
There's probably some version of a companion that does not exist today
because a human couldn't possibly be that thing to someone
that will be created through this tech.
Yeah.
Yeah.
My sense is that it needs to be.
be on a device that
billions of people already have.
If you really want to go to a consumer
route, I think you can build a desktop
cam product, of course, as well.
But for me
to think about, oh, we're going to build a
net new hardware that incorporates a state-of-yard
product, that seems less
probable than finding
a way to utilize
the good device. Everyone has that already has the
best-in-class camera. Yeah, I think a lot of
people have been building separate hardware devices
today because a lot of
the hardware companions have been like, it listens to your conversations, it provides insights,
it reminds you of things. And one of the limitations of current phones is that you can't be like
playing music or on a Zoom meeting or on a call and also have an app that's recording. And so
you have to have a separate hardware device. And I think part of the question is, is that really a
limitation that's going to exist in the future? If someone like Apple wants to do a companion,
I mean, I'm sure they'll have all sorts of privacy and security and et cetera concerns.
Yeah. Maybe you'll get a message at the beginning of the call that just,
Christine's AI companion is also listening in on this. And there might be some great hardware products
that they can figure out reliability and the cost can come down and it's a foreign factor that
people want to wear and that makes sense. But I don't think we've seen any net new hardware
devices. I think glasses as a form factor has been so interesting for over a decade because
it just seems to make a lot of sense, right? If it's literally on where your eyes are and it's also
very close to like your mouth and your ears. And so it's like a convenient location to be taking
in information, but I don't think anyone has successfully made an LLM run on glasses.
Yeah.
I'm excited about the AirPods, or it doesn't have to be Apple, but because that is the only
device that I've ever worn that I've forgotten I'm wearing.
Yeah.
Right?
So that like form factor where it's actually part of you versus all these other things
where you're like, I got a clip on this pin.
Or I like have to take off my necklace when I shower or like I've literally
forgotten that I had AirPods in and I'm like showering, right?
AirPods is a good one.
I think because Apple tends to be a more closed ecosystem,
especially with newer devices like the AirPods,
there hasn't been a ton built on top of them yet.
There was a time where there was a thesis of AirPods as a platform.
Yes.
That hasn't really happened.
Yeah.
Obviously, it's been an exciting week.
We heard a couple announcements,
knowing everything we've seen in the last two years from AI,
this is not stopping, right?
This is just part of the long arc.
Where do you guys think this goes?
I think where this goes is to a natural conclusion,
which is we mimic the technology to do the things that humans typically do.
And again, we're giving it senses as we go along.
And I think we just gave in an eye.
The ability to communicate, ability to hear you,
ability to listen to you, ability to see you is a really, really great start
for a lot of the very interesting use cases.
To add to that on the companion space in particular,
it has been a large subculture of AI for a long time,
driving a ton of innovation in both language models.
image models.
So culture, we are a very big fan of.
Yeah, we are quite deep in.
But it was kind of by a lot of AI researchers and folks at the big companies kind of like
looked down upon and that's not what we want to be building.
We're going towards AGI, that sort of thing.
And honestly, Open AI choosing to show off those voices and the tweets that they put out
about it afterwards sort of legitimizes the space in a way.
I think that's really interesting.
And that may prompt more established companies, researchers, developers to build in the
space and more people.
to talk about using the products,
which I think as people who have been studying this space
for a long time is very exciting.
Yeah, I mean, you can't ignore the demand, right?
Yes.
Or the usage.
By the way, super quickly, you talk about adding an eye.
This is going to not help the seriousness conversation
around companionship, but you just imagine these models as Mr. Potato Head,
and you're just slowly adding on the features,
and you can almost just imagine, it's okay, he's got an eye now, right?
He's got an ear now.
I think that's very serious.
The fact that you can treat a computer like a potato ad that talks to you,
listens to you, and emotes with you, and lists, it's incredible.
Yeah.
I think the future is going to be very fun.
Maybe that should be the icon when you're talking.
Because everyone is like the chat, TBT voice.
It's just like that black circle.
And maybe it should be a potato head.
So that's another interesting thing.
We're not using the screen real estate effectively with voice only.
So that's where I think the FaceTime element comes in.
The Altars.
All right.
Well, we'll have to do this again probably very soon.
Well, thank you.
Thanks.
All right, that's all for now.
If you like this kind of episode
or our partners break down
the latest and greatest in consumer tech,
let us know.
Shoot us an email at Podpitches at A16C.com
or drop us a review at rate thispodcast.com
slash A16Z.
And don't forget to subscribe
so that you are the first to know
when we drop our episode
around A16Z's Gen.
Gen.A.I.100 list.
We'll see you then.
