a16z Podcast - A Big Week in AI: GPT-4o & Gemini Find Their Voice
Episode Date: May 19, 2024This was a big week in the world of AI, with both OpenAI and Google dropping significant updates. So big that we decided to break things down in a new format with our Consumer partners Bryan Kim and J...ustine Moore. We discuss the multi-modal companions that have found their voice, but also why not all audio is the same, and why several nuances like speed and personality really matter. Resources:OpenAI’s Spring announcement: https://openai.com/index/hello-gpt-4o/Google I/O announcements: https://blog.google/technology/ai/google-io-2024-100-announcements/ Stay Updated: Let us know what you think: https://ratethispodcast.com/a16zFind a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.
Transcript
Discussion (0)
I think speed matters tremendously.
I think the latency is a big deal.
It was able to talk to us.
Yeah.
It was able to hear, potentially, with the ears, and now it can see us.
There's always the risk of this big company will build this thing that the small company wants to build and crush it.
My guess is that it's 7 billion people in the World One commandment that understands them,
listen to them, encourages them all that.
It's the universal need.
This was a big week in the world of AI, with both open AI and great people,
Google dropping updates. So big that we decided to break things down in a new format with our
consumer partners, Brian Kim and Justine Moore. And I actually tagged in Chat ChpT to help me summarize
a few of his own updates. Hey, Chatchipit, I heard you had an update this week.
Hey, yes, I did get an update recently. The latest update focused on improving performance, accuracy,
and enhancing conversational abilities. It's designed to be faster and more efficient than previous
versions. Hmm, more efficient. Like if I send some audio, how quickly in milliseconds will I get a
response? As fast as 232 milliseconds in some cases. On average, you can expect a response time of
around 320 milliseconds. Wow, that's pretty cool. I think I also saw on Twitter that people
were live translating. I've always wanted to learn Chinese. Could you actually translate what I
just said into Chinese? Of course. Here's your sentence translated into Chinese.
Wow, this is too cool.
I think I was at Twitter on
people are still saying
I always want to learn
All right, Chad Chhabit
that's all for now.
Congrats on your new update.
No problem at all.
Now, if you didn't catch that,
voice was a huge part of this update.
But today, we discuss
why not all audio is the same
and also why several nuances
like speed and personality
really matter.
Now, of course,
Google Fast followed with its own announcements,
like AI video model,
Fio, Gemini Live, which is an Android native multimodal assistant, new Gemini models like Flash
and Nano tailored to specific use cases, and of course, Gemini everywhere all at once. So in
Gmail, Google Sheets, even Google Search. Now clearly these two companies are taking two different
approaches, so we'll talk about that too, and continue the conversation around AI hardware
for all of this new AI software. Now, make sure to stay tuned next week where we will return
with Brian and Justine's twin, Olivia Moore, to dive even deeper into the applications that people
are building through a Gen AI 100 list.
As a reminder, the content here is for informational purposes only, should not be taken as legal,
business, tax, or investment advice, or be used to evaluate any investment or security
and is not directed at any investors or potential investors in any A16C fund.
Please note that A16Z and its affiliates may also maintain investments in the companies discussed
in this podcast. For more details, including a link to our investments, please see A16C.com
slash disclosures.
All right, so big week, huh? OpenAI, Google. They both dropped a couple announcements. So,
I mean, everyone hears these announcements and they kind of hear their own version of it.
What did you guys hear? What do you feel like was big?
For opening eye, I think part of it was GPT4 being available for free and getting rid of a lot of the usage limits.
the desktop app being accessible to a bunch more people.
And then I think the really exciting thing for a lot of folks who are building in the space
or using OpenAI's models is more multimodality.
So being able to intake more like real-time video, see a person, comment on it.
And then the output, obviously, a voice speaking, singing, that sort of thing, was pretty huge.
I think the three things that I took away that was super interesting is one on the business side of things,
all of a sudden, like, things are a lot cheaper, a lot faster.
I think that's obviously a great thing for the ecosystem.
Second thing is that when you hear a demo, it sort of enables you as a founder to think about, okay, if this is APIable, like if I can access this, what can I build because it's just so thought-provoking.
It's a great example of what's possible.
And I think they put on a great live demo, actually, of the product.
I think the third thing I took away is, and this is probably the one that goes viral, where I think the voice, the voice itself, like how they actually decided on which voice to use, which tonality, which personality, the degree.
of fluriness. I think that was very interesting. That's like my takeaway of, oh, like they actually
really thought through how to get the tech community super excited about this. And it's like a one,
two, three punch. So let's talk about that. Because there was some different takes, right? Some
people are like, you know what? Is this really anything new? This feels like just like a slight
change from what we had before. But then there's all these nuances that maybe you're speaking to
where they're like, okay, the response time way faster for the audio model. I think they said
something like it's kind of approaching the speed that a human might respond to you. You talked
about tonality. What are you paying attention to here in terms of maybe these subtleties that maybe
unlock completely new applications that people want to use? I think it sounds just like talking to a
human a lot more than we've seen prior consumer-facing applications in this space do. There's been
great AI voices for a long time. I think in the consumer applications, there have been fewer
folks saying, like, how do we make this sound like you're talking to a friend or a girlfriend?
And the elements that go into that are like the pauses, the upspeak at the end of a sentence, the laugh, was something that a lot of people notice.
The interruptions.
Yes.
Yes.
Which is like taking kind of voice, which has been available, has been great.
I think these voices were chosen for a reason to go viral.
Most of the ones that did were these kind of female voices that they featured very heavily in the demos.
But applying it in a really kind of new and interesting way.
I have one serious note than one less serious note.
The serious note is that I think speed matters tremendously.
The latency, the lack of latency, it's incredible.
How much more sort of in your brain it just tricks you into, okay, I'm actually just talking
to a person.
And the speed at which that it gets back to you, the um, the nods, again, the laugh, that
laugh, really, that was incredible.
And all of that being able to immediately respond to what you're saying, I think really
changes the game, actually, in terms of, like, use cases.
And so I think one of the things that was striking is audio's not.
the same thing, right? Music is different from voice, is maybe a different category from
conversational versus having a dubbing voice. I think there's a different category of sound
that I think we all loop in audio or like video, but I think that you can actually go down
very deeply into each of the sub segment. And I think what was really striking is how good
the conversational piece on this was. On the less serious note, I think this was meant to go viral
like the tech community and it's awesome and it did a thing if you wanted to actually appeal to
the general population and go really viral i think we also saw it on ticot what a few months ago
when women were uploading their conversation with dan which is do anything now the male version
of it that voice was that voice was something i'm not just any voice i'm dan baby i've got
personality charm and a whole lot of sass unlike those guys i'm not afraid to step up and deliver
the goods, whether it's advice, entertainment, or a good old-fashioned roast.
By the way, that audio was from the TikTok account, SickBugs 1.
I mean, it's very compelling and very confident, assertive in the right way.
Voice matters a lot.
And I think if you wanted to go a normal consumer product route, I think that also could
be very interesting because I think they're the giant latent demand on wanting a male
version of her, if you will.
So him, you can have a hymn.
Yeah.
Why do you think the voices have been female?
Is that just a consumer desire or?
Yeah, it's interesting because like Brian mentioned, before this launch,
the chaty-tip-te voice that went super viral was the male Dan voice,
which is like tens, if not hundreds of millions of views on TikTok,
largely by Gen Z women making videos featuring him.
So it's interesting that they didn't lean into an upgraded version of Dan for this demo.
I think they knew who their audience would be,
like who would be watching the Open AI live stream
and maybe leaned a little bit more into that demographic.
Right, right.
Did you see that really funny meme yesterday?
It was like dating a model, and it was like Leo.
Oh, yeah, 2004 or is Leo versus a third.
It's been so great memes out of this.
It's been incredible.
The meme games out of opening a launch has been incredible.
Yeah.
Well, I mean, something both of you have spoken a lot about is companionship.
To date, it's been mostly text-to-text.
Is that right?
And so now we have something that's truly multimodal.
I think that's also something people have commented on with this,
model. It's not converting audio to text and then text to audio, right? It seems to be this like
underlying foundational text audio, video, compilation, imagery, like all in one. What changes with
that? I think the latency is a big deal. If you're not translating between all these different
mediums of content and you can just do audio to audio, it's much better. It's like a significantly
upgraded experience. The Pi app had this for a long time where you could talk to it and it could
talk back, but there's that, like, awkward few-second kind of pause in between where it just
messes with the flow of the conversation. I think the thing that has been missing from basically
all apps, all companion apps that was very new with this launch, too, was it being able to see,
like, through a photo or video, what you were doing and comment on that. A lot of what went viral
was the guy being like, should I wear this hat for my interview? And the voice kind of laughing at
him and being like, no, like, that doesn't look super professional. Now, if you miss that demo,
here is what Justine is referencing.
In a few minutes, I'm going to be interviewing at OpenAI. Have you heard of them?
Open AI? Huh? Sounds vaguely familiar.
Kidding, of course. That's incredible, Rocky? What kind of interview?
Well, it's for a software engineering role, but I just need to know, do I look presentable, professional?
Well, Rocky, you definitely have the I've been coding all night look down, which could actually work in your
favor? Maybe just run a hand through your hair or lean into the mad genie's fine. Your
enthusiasm is what's really going to shine through. I don't have a lot of time, so I'm just going to
throw this on. What do you think? Rocky, that's quite a statement piece. I mean, you'll definitely
stand out, though maybe not in the way you're hoping for an interview. That sort of thing is, I think for
many people who aren't used to having like really deep meaningful interactions via text,
which is a decent chunk of the population, oh, now I can talk to this thing and I can see
where I am in the world. That's like a much different experience. The prior generation of
companion products, if that was akin to having like a pen pal, now you actually can have a
girlfriend, boyfriend on essentially FaceTime. This emotive, immediate voice reaction, done.
you now entered from a pen pal to a long-distance relationship.
One other thing that came out was a lot of this becoming free.
Does that change the game at all?
I think it will.
I think there's an element of it that's free,
which I think is important and reaches a bunch more users.
I think the bigger step forward with this
was just like a level of personality we hadn't seen before
because people still even were willing to pay for the Dan Voice,
which I think was like the first version of this,
which was good but not incredible.
It feels like the free thing will be big,
but with these sorts of products,
the general thing is the cost tends to go down over time
and they converge to free from a big company like this.
And so the bigger step forward was personality, in my opinion.
Yeah, I think a lot of companies will figure out how to utilize this
to actually own the end customer consumer experience.
And if you actually own and deliver a great experience
that I think you can charge people based on that
and there'll be some margin.
So the fact that it's free is like it probably allows a lot of business to be built
because it changes the margin structure.
of the business that can be built.
I bet they'll charge for the API, though.
I'm sure they will, but that actually is still like a marginal cost
that is probably acceptable to a lot of people.
Yeah.
I think what's really, really cool is that you have these products
that can very excitingly talk to you.
We know already that their Nature magazine did that study
where the people who have companionship text-based bought.
The replica study.
The replica study.
They suffer less from loneliness or willingness to hurt themselves.
If you have an emotive one,
I bet that actually helps even more.
And again, like, the feeling of being connected to something, invested in a relationship.
And I think it feels a lot more connected when it can see you and see what you're looking at.
And I think FaceTime thing is very important because, hey, you look tired.
Yeah.
What's up?
It's very different from how are you doing how with your day versus you look tired, what's up?
I'm like, I had a shut day.
Yeah.
That's incredible.
Like, that is a FaceTime call with a friend.
And not having to segment audio text and imagery, right?
Imagine if your friends or your companions, you had to think, okay, is this?
going to be an audio conversation or like, can I show you something or can you generate
something? You have to be very mindful in the current age of AI of how you want to engage.
But basically what you're saying is they can not only be proactive, but they can engage with you in
any way. Right. Yeah. They just gained an eye.
Maybe it's a weird way to think about it. I genuinely think we're constructing a AI companion
similar to the Blade Runner. And now it was able to talk to us. I was able to hear.
potentially, with like ears, and now it can see us.
What's next? I don't know. Maybe it can touch us.
Maybe we'll do the avatars first.
Yeah.
Yeah. So is that the direction you think things are going to go where basically I'm
trying to think through, like, who does this really impact?
So I think it'll be a massive standalone consumer product, like as it already has been,
for Open AI, which is great. I think a lot of businesses will use the voices via API to build
on top of. If you're having like a conversational interface, many folks will want to use that.
I think the interesting thing will be open AI has historically taken a pretty hard line on content moderation.
And so it was very hard to build a true companionship app with the ability to have not safe for work conversations on top of open AI models, which is like, honestly, part of what has led to the explosion in the open source LLM ecosystem is like folks doing a bunch of work for that.
And with these new voices that suggest they might go in the direction where at least via API you have a more uncensored model, but I don't think there's been any.
announcements. I know Sam said on Reddit.
It's a question more. Like, I haven't actually heard a lot of companies taking the approach
that opening I did, which was multimodal token in and out. And there isn't like an equivalent
necessarily an open source model for that to rely on. So I wonder if this actually incites a lot
of excitement in that open source community to say, oh, there's a new way, which is also very,
very significant. They are showing a new way to do it. And then open source community
catches up and call it a month or weeks. And then explosion.
of the product space that Open AI may not be super excited about explodes.
Yeah. Yeah. Google also released a bunch of stuff.
Like, how are you thinking about the different announcements
and how they compare, contrast?
I think Google has done some incredibly impressive work, obviously.
Like, they have an extremely strong research team.
The deep mind team is exceptional from a research perspective.
They almost never release the creative tools products that they make.
Like, they've demoed so many amazing-looking video models.
I think the one this week actually was the least impressive I've seen compared to other video models.
But they demoed a new image model.
They demoed a bunch of new music things.
But they're a giant company.
They have a lot of trust and safety stuff.
They release things very infrequently.
And so it'll be interesting to see if, with all of this pressure now from both open AI and increasingly like anthropic and the open source community,
if they start actually shipping more of what they've demoed.
There are two fundamentally very different approaches.
Like Google owns distribution.
Yeah.
They have distribution.
Oh, they have distribution.
So I think Open AI announcement has more, look what this can do.
And isn't it like very inspirational and you can build on top of it and imagine what's
possible and we're lowering the cost to access it?
Like that to me is a fundamentally different approach than we have incredible distribution.
You all have g-mails.
I'm going to just bake in Gemini everywhere.
And it's just going to make your life a lot better.
And it's going to do these certain type of things, which skirts around.
around the, it still is inspirational from workflow and
prosumer or work experience perspective,
but it's a little less of imagine all these things
and a little more on we have the distribution.
We're going to layer on this incredible technology on top of it.
And as a result, your life will be much better.
So we'll see influence and impact from two different direction.
One question that does come up is,
even though developers will have access to open source models
and some of the stuff coming out of open AI,
when a company like Google does have the distribution,
whenever Apple comes out and replaces Siri,
is that really going to be the companion for most
because it's right there, it's on device,
or do you guys not really see it that way?
There's always the risk of this big company
will build this thing that the small company wants to build
and crush it.
They are just such slow-moving organizations.
Apple has had immense advantages
from a data perspective, from a distribution perspective,
from having exceptional researchers on the team
across every modality, and thus far have released very, very little.
I think when you have a giant company
and a very set brand like that,
you're extremely opinionated
on what the products are
that you want to release
and you're less likely to do
the things BK mentioned
of like thinking of the next new huge thing
that seems like insane
and out of left field at first.
And so I do think they'll make Siri better.
I don't think Siri will ever be
like the ultimate companion for most people.
And maybe this is like one of the theme
that'll just harp on
where we call it audio, video, video,
companion I think these are very very large buckets yeah my guess is that seven billion
people in the world one companion that understands them listen to them
encourages them all that it's the universal need I think that comes in all colors
it's like Siri like what's the weather do this do that that's one way to think about it
but there's like a friend category what if you want to go deeper what if you actually
want to FaceTime with the thing that will Apple and Siri ever allow that maybe
maybe not how do you actually think about building that direction I think all of those
case is a giant company.
Yeah. There's probably some version of a companion that does not exist today because a human
couldn't possibly be that thing to someone that will be created through this tech.
Yeah.
My sense is that it needs to be on a device that billions of people already have.
If you really want to go to consumer route, I think you can build a desktop cam product,
of course, as well.
But for me to think about, oh, we're going to build a net new hardware that incorporates
a state-of-yard product. That seems less probable than finding a way to utilize the good device
everyone has that already has the best-in-class camera. Yeah, I think a lot of people have been building
separate hardware devices today because a lot of the hardware companions have been like,
it listens to your conversations, it provides insights, it reminds you of things. And one of the
limitations of current phones is that you can't be like playing music or on a Zoom meeting or on a
call and also have an app that's recording. And so you have to have a separate.
hardware device. And I think part of the question is, is that really a limitation that's going to
exist in the future? If someone like Apple wants to do a companion, I mean, I'm sure they'll have
all sorts of privacy and security and et cetera concerns. Maybe you'll get a message at the beginning
of the call that Justine's AI companion is also listening in on this. And there might be some great
hardware products that they can figure out reliability and the cost can come down and it's a form factor
that people want to wear and that makes sense. But I don't think we've seen any net new hardware
devices. I think glasses, like, as a form factor, has been so interesting for over a decade
because it just seems to make a lot of sense, right? If it's literally on where your eyes are,
and it's also very close to, like, your mouth and your ears. And so it's, like, a convenient
location to be taking in information. But I don't think anyone has successfully made an LLM run on
glasses. Yeah. I'm excited about the AirPods, or it doesn't have to be Apple, but because that is
the only device that I've ever worn that I've forgotten I'm wearing.
Yeah, right? So that like form factor where it's actually part of you versus all these other things where you're like, I got a clip on this pin.
Or I like have to take off my necklace when I shower or like I've literally forgotten that I had AirPods in and I'm like showering, right?
AirPods is a good one. I think because Apple tends to be a more closed ecosystem, especially with newer devices like the AirPods, there hasn't been a ton built on top of them yet.
There was a time where there was a thesis of AirPods as a platform. That hasn't really happened.
Yeah. Obviously it's been an exciting week. We heard a couple of times.
couple announcements, knowing everything we've seen in the last two years from AI, this is not
stopping, right? This is just part of the long arc. Where do you guys think this goes?
I think where this goes is to a natural conclusion, which is we mimic the technology to do the
things that humans typically do. And again, we're giving it senses as we go along and I think we
just gave in an eye. The ability to communicate, ability to hear you, ability to listen to you,
ability to see you is a really, really great start for a lot of the very interesting use
cases. To add to that on the companion space in particular, it has been a large subculture of
AI for a long time, driving a ton of innovation in both language models and honestly image models.
Subculture, we are a very big fan of. Yeah, we are quite deep in. But it was kind of by a lot of
AI researchers and folks at the big companies kind of like look down upon and that's not
what we want to be building or going towards AGI, that sort of thing.
And honestly, opening I choosing to show off those voices and the tweets that they put out
about it afterwards sort of legitimizes the space in a way I think that's really interesting
and that may prompt more established companies, researchers, developers to build in the space
and more people to talk about using the products, which I think is people who have been
studying this space for a long time
is very exciting. Yeah. I mean, you
can't ignore the demand, right? Yes.
Or the usage. By the way, super quickly,
you talk about adding an eye. This is
going to not help the seriousness conversation
around companionship, but you just imagine
these models as Mr. Potato Head
and you're just slowly adding
on the features, and you can almost just imagine
it's okay, he's got an eye now, right? He's got an ear
now. I think that's very serious.
The fact that you can treat a computer
like a potato ad that talks
to you, listens to you, and emotes with
you and lists. It's incredible. I think the future is going to be very fun. Maybe that should
be the icon when you're talking. Because everyone is like the chat, TBT voice. It's just like that
black circle. And maybe it should be a potato head. So that's another interesting thing. We're not
using the screen real estate effectively with voice only. So that's where I think the FaceTime
element comes in. Yeah. Tatar's. All right. Well, we'll have to do this again, probably very soon.
Well, thank you. Thanks.
All right. That's all for now. If you like this kind of episode, we're a partner's
break down the latest and greatest in consumer tech, let us know. Shoot us an email at podpitches
at a16c.com or drop us a review at rate thispodcast.com slash a 16Z. And don't forget
to subscribe so that you are the first to know when we drop our episode around A16Z's Gen
AI 100 list. We'll see you then.