a16z Podcast - A Big Week in AI: GPT-4o & Gemini Find Their Voice

Starting point is 00:00:00 I think speed matters tremendously. I think the latency is a big deal. It was able to talk to us. Yeah. It was able to hear, potentially, with the ears, and now it can see us. There's always the risk of this big company will build this thing that the small company wants to build and crush it. My guess is that it's 7 billion people in the World One commandment that understands them, listen to them, encourages them all that.

Starting point is 00:00:24 It's the universal need. This was a big week in the world of AI, with both open AI and great people, Google dropping updates. So big that we decided to break things down in a new format with our consumer partners, Brian Kim and Justine Moore. And I actually tagged in Chat ChpT to help me summarize a few of his own updates. Hey, Chatchipit, I heard you had an update this week. Hey, yes, I did get an update recently. The latest update focused on improving performance, accuracy, and enhancing conversational abilities. It's designed to be faster and more efficient than previous versions. Hmm, more efficient. Like if I send some audio, how quickly in milliseconds will I get a

Starting point is 00:01:06 response? As fast as 232 milliseconds in some cases. On average, you can expect a response time of around 320 milliseconds. Wow, that's pretty cool. I think I also saw on Twitter that people were live translating. I've always wanted to learn Chinese. Could you actually translate what I just said into Chinese? Of course. Here's your sentence translated into Chinese. Wow, this is too cool. I think I was at Twitter on people are still saying I always want to learn

Starting point is 00:01:36 All right, Chad Chhabit that's all for now. Congrats on your new update. No problem at all. Now, if you didn't catch that, voice was a huge part of this update. But today, we discuss why not all audio is the same

Starting point is 00:01:49 and also why several nuances like speed and personality really matter. Now, of course, Google Fast followed with its own announcements, like AI video model, Fio, Gemini Live, which is an Android native multimodal assistant, new Gemini models like Flash and Nano tailored to specific use cases, and of course, Gemini everywhere all at once. So in

Starting point is 00:02:11 Gmail, Google Sheets, even Google Search. Now clearly these two companies are taking two different approaches, so we'll talk about that too, and continue the conversation around AI hardware for all of this new AI software. Now, make sure to stay tuned next week where we will return with Brian and Justine's twin, Olivia Moore, to dive even deeper into the applications that people are building through a Gen AI 100 list. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16C fund.

Starting point is 00:02:52 Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16C.com slash disclosures. All right, so big week, huh? OpenAI, Google. They both dropped a couple announcements. So, I mean, everyone hears these announcements and they kind of hear their own version of it. What did you guys hear? What do you feel like was big? For opening eye, I think part of it was GPT4 being available for free and getting rid of a lot of the usage limits. the desktop app being accessible to a bunch more people.

Starting point is 00:03:30 And then I think the really exciting thing for a lot of folks who are building in the space or using OpenAI's models is more multimodality. So being able to intake more like real-time video, see a person, comment on it. And then the output, obviously, a voice speaking, singing, that sort of thing, was pretty huge. I think the three things that I took away that was super interesting is one on the business side of things, all of a sudden, like, things are a lot cheaper, a lot faster. I think that's obviously a great thing for the ecosystem. Second thing is that when you hear a demo, it sort of enables you as a founder to think about, okay, if this is APIable, like if I can access this, what can I build because it's just so thought-provoking.

Starting point is 00:04:10 It's a great example of what's possible. And I think they put on a great live demo, actually, of the product. I think the third thing I took away is, and this is probably the one that goes viral, where I think the voice, the voice itself, like how they actually decided on which voice to use, which tonality, which personality, the degree. of fluriness. I think that was very interesting. That's like my takeaway of, oh, like they actually really thought through how to get the tech community super excited about this. And it's like a one, two, three punch. So let's talk about that. Because there was some different takes, right? Some people are like, you know what? Is this really anything new? This feels like just like a slight change from what we had before. But then there's all these nuances that maybe you're speaking to

Starting point is 00:04:49 where they're like, okay, the response time way faster for the audio model. I think they said something like it's kind of approaching the speed that a human might respond to you. You talked about tonality. What are you paying attention to here in terms of maybe these subtleties that maybe unlock completely new applications that people want to use? I think it sounds just like talking to a human a lot more than we've seen prior consumer-facing applications in this space do. There's been great AI voices for a long time. I think in the consumer applications, there have been fewer folks saying, like, how do we make this sound like you're talking to a friend or a girlfriend? And the elements that go into that are like the pauses, the upspeak at the end of a sentence, the laugh, was something that a lot of people notice.

Starting point is 00:05:34 The interruptions. Yes. Yes. Which is like taking kind of voice, which has been available, has been great. I think these voices were chosen for a reason to go viral. Most of the ones that did were these kind of female voices that they featured very heavily in the demos. But applying it in a really kind of new and interesting way. I have one serious note than one less serious note.

Starting point is 00:05:54 The serious note is that I think speed matters tremendously. The latency, the lack of latency, it's incredible. How much more sort of in your brain it just tricks you into, okay, I'm actually just talking to a person. And the speed at which that it gets back to you, the um, the nods, again, the laugh, that laugh, really, that was incredible. And all of that being able to immediately respond to what you're saying, I think really changes the game, actually, in terms of, like, use cases.

Starting point is 00:06:20 And so I think one of the things that was striking is audio's not. the same thing, right? Music is different from voice, is maybe a different category from conversational versus having a dubbing voice. I think there's a different category of sound that I think we all loop in audio or like video, but I think that you can actually go down very deeply into each of the sub segment. And I think what was really striking is how good the conversational piece on this was. On the less serious note, I think this was meant to go viral like the tech community and it's awesome and it did a thing if you wanted to actually appeal to the general population and go really viral i think we also saw it on ticot what a few months ago

Starting point is 00:07:03 when women were uploading their conversation with dan which is do anything now the male version of it that voice was that voice was something i'm not just any voice i'm dan baby i've got personality charm and a whole lot of sass unlike those guys i'm not afraid to step up and deliver the goods, whether it's advice, entertainment, or a good old-fashioned roast. By the way, that audio was from the TikTok account, SickBugs 1. I mean, it's very compelling and very confident, assertive in the right way. Voice matters a lot. And I think if you wanted to go a normal consumer product route, I think that also could

Starting point is 00:07:40 be very interesting because I think they're the giant latent demand on wanting a male version of her, if you will. So him, you can have a hymn. Yeah. Why do you think the voices have been female? Is that just a consumer desire or? Yeah, it's interesting because like Brian mentioned, before this launch, the chaty-tip-te voice that went super viral was the male Dan voice,

Starting point is 00:08:01 which is like tens, if not hundreds of millions of views on TikTok, largely by Gen Z women making videos featuring him. So it's interesting that they didn't lean into an upgraded version of Dan for this demo. I think they knew who their audience would be, like who would be watching the Open AI live stream and maybe leaned a little bit more into that demographic. Right, right. Did you see that really funny meme yesterday?

Starting point is 00:08:25 It was like dating a model, and it was like Leo. Oh, yeah, 2004 or is Leo versus a third. It's been so great memes out of this. It's been incredible. The meme games out of opening a launch has been incredible. Yeah. Well, I mean, something both of you have spoken a lot about is companionship. To date, it's been mostly text-to-text.

Starting point is 00:08:42 Is that right? And so now we have something that's truly multimodal. I think that's also something people have commented on with this, model. It's not converting audio to text and then text to audio, right? It seems to be this like underlying foundational text audio, video, compilation, imagery, like all in one. What changes with that? I think the latency is a big deal. If you're not translating between all these different mediums of content and you can just do audio to audio, it's much better. It's like a significantly upgraded experience. The Pi app had this for a long time where you could talk to it and it could

Starting point is 00:09:16 talk back, but there's that, like, awkward few-second kind of pause in between where it just messes with the flow of the conversation. I think the thing that has been missing from basically all apps, all companion apps that was very new with this launch, too, was it being able to see, like, through a photo or video, what you were doing and comment on that. A lot of what went viral was the guy being like, should I wear this hat for my interview? And the voice kind of laughing at him and being like, no, like, that doesn't look super professional. Now, if you miss that demo, here is what Justine is referencing. In a few minutes, I'm going to be interviewing at OpenAI. Have you heard of them?

Starting point is 00:09:52 Open AI? Huh? Sounds vaguely familiar. Kidding, of course. That's incredible, Rocky? What kind of interview? Well, it's for a software engineering role, but I just need to know, do I look presentable, professional? Well, Rocky, you definitely have the I've been coding all night look down, which could actually work in your favor? Maybe just run a hand through your hair or lean into the mad genie's fine. Your enthusiasm is what's really going to shine through. I don't have a lot of time, so I'm just going to throw this on. What do you think? Rocky, that's quite a statement piece. I mean, you'll definitely stand out, though maybe not in the way you're hoping for an interview. That sort of thing is, I think for

Starting point is 00:10:44 many people who aren't used to having like really deep meaningful interactions via text, which is a decent chunk of the population, oh, now I can talk to this thing and I can see where I am in the world. That's like a much different experience. The prior generation of companion products, if that was akin to having like a pen pal, now you actually can have a girlfriend, boyfriend on essentially FaceTime. This emotive, immediate voice reaction, done. you now entered from a pen pal to a long-distance relationship. One other thing that came out was a lot of this becoming free. Does that change the game at all?

Starting point is 00:11:20 I think it will. I think there's an element of it that's free, which I think is important and reaches a bunch more users. I think the bigger step forward with this was just like a level of personality we hadn't seen before because people still even were willing to pay for the Dan Voice, which I think was like the first version of this, which was good but not incredible.

Starting point is 00:11:39 It feels like the free thing will be big, but with these sorts of products, the general thing is the cost tends to go down over time and they converge to free from a big company like this. And so the bigger step forward was personality, in my opinion. Yeah, I think a lot of companies will figure out how to utilize this to actually own the end customer consumer experience. And if you actually own and deliver a great experience

Starting point is 00:12:03 that I think you can charge people based on that and there'll be some margin. So the fact that it's free is like it probably allows a lot of business to be built because it changes the margin structure. of the business that can be built. I bet they'll charge for the API, though. I'm sure they will, but that actually is still like a marginal cost that is probably acceptable to a lot of people.

Starting point is 00:12:21 Yeah. I think what's really, really cool is that you have these products that can very excitingly talk to you. We know already that their Nature magazine did that study where the people who have companionship text-based bought. The replica study. The replica study. They suffer less from loneliness or willingness to hurt themselves.

Starting point is 00:12:39 If you have an emotive one, I bet that actually helps even more. And again, like, the feeling of being connected to something, invested in a relationship. And I think it feels a lot more connected when it can see you and see what you're looking at. And I think FaceTime thing is very important because, hey, you look tired. Yeah. What's up? It's very different from how are you doing how with your day versus you look tired, what's up?

Starting point is 00:12:59 I'm like, I had a shut day. Yeah. That's incredible. Like, that is a FaceTime call with a friend. And not having to segment audio text and imagery, right? Imagine if your friends or your companions, you had to think, okay, is this? going to be an audio conversation or like, can I show you something or can you generate something? You have to be very mindful in the current age of AI of how you want to engage.

Starting point is 00:13:20 But basically what you're saying is they can not only be proactive, but they can engage with you in any way. Right. Yeah. They just gained an eye. Maybe it's a weird way to think about it. I genuinely think we're constructing a AI companion similar to the Blade Runner. And now it was able to talk to us. I was able to hear. potentially, with like ears, and now it can see us. What's next? I don't know. Maybe it can touch us. Maybe we'll do the avatars first. Yeah.

Starting point is 00:13:50 Yeah. So is that the direction you think things are going to go where basically I'm trying to think through, like, who does this really impact? So I think it'll be a massive standalone consumer product, like as it already has been, for Open AI, which is great. I think a lot of businesses will use the voices via API to build on top of. If you're having like a conversational interface, many folks will want to use that. I think the interesting thing will be open AI has historically taken a pretty hard line on content moderation. And so it was very hard to build a true companionship app with the ability to have not safe for work conversations on top of open AI models, which is like, honestly, part of what has led to the explosion in the open source LLM ecosystem is like folks doing a bunch of work for that. And with these new voices that suggest they might go in the direction where at least via API you have a more uncensored model, but I don't think there's been any.

Starting point is 00:14:41 announcements. I know Sam said on Reddit. It's a question more. Like, I haven't actually heard a lot of companies taking the approach that opening I did, which was multimodal token in and out. And there isn't like an equivalent necessarily an open source model for that to rely on. So I wonder if this actually incites a lot of excitement in that open source community to say, oh, there's a new way, which is also very, very significant. They are showing a new way to do it. And then open source community catches up and call it a month or weeks. And then explosion. of the product space that Open AI may not be super excited about explodes.

Starting point is 00:15:16 Yeah. Yeah. Google also released a bunch of stuff. Like, how are you thinking about the different announcements and how they compare, contrast? I think Google has done some incredibly impressive work, obviously. Like, they have an extremely strong research team. The deep mind team is exceptional from a research perspective. They almost never release the creative tools products that they make. Like, they've demoed so many amazing-looking video models.

Starting point is 00:15:40 I think the one this week actually was the least impressive I've seen compared to other video models. But they demoed a new image model. They demoed a bunch of new music things. But they're a giant company. They have a lot of trust and safety stuff. They release things very infrequently. And so it'll be interesting to see if, with all of this pressure now from both open AI and increasingly like anthropic and the open source community, if they start actually shipping more of what they've demoed.

Starting point is 00:16:06 There are two fundamentally very different approaches. Like Google owns distribution. Yeah. They have distribution. Oh, they have distribution. So I think Open AI announcement has more, look what this can do. And isn't it like very inspirational and you can build on top of it and imagine what's possible and we're lowering the cost to access it?

Starting point is 00:16:23 Like that to me is a fundamentally different approach than we have incredible distribution. You all have g-mails. I'm going to just bake in Gemini everywhere. And it's just going to make your life a lot better. And it's going to do these certain type of things, which skirts around. around the, it still is inspirational from workflow and prosumer or work experience perspective, but it's a little less of imagine all these things

Starting point is 00:16:49 and a little more on we have the distribution. We're going to layer on this incredible technology on top of it. And as a result, your life will be much better. So we'll see influence and impact from two different direction. One question that does come up is, even though developers will have access to open source models and some of the stuff coming out of open AI, when a company like Google does have the distribution,

Starting point is 00:17:10 whenever Apple comes out and replaces Siri, is that really going to be the companion for most because it's right there, it's on device, or do you guys not really see it that way? There's always the risk of this big company will build this thing that the small company wants to build and crush it. They are just such slow-moving organizations.

Starting point is 00:17:28 Apple has had immense advantages from a data perspective, from a distribution perspective, from having exceptional researchers on the team across every modality, and thus far have released very, very little. I think when you have a giant company and a very set brand like that, you're extremely opinionated on what the products are

Starting point is 00:17:45 that you want to release and you're less likely to do the things BK mentioned of like thinking of the next new huge thing that seems like insane and out of left field at first. And so I do think they'll make Siri better. I don't think Siri will ever be

Starting point is 00:18:00 like the ultimate companion for most people. And maybe this is like one of the theme that'll just harp on where we call it audio, video, video, companion I think these are very very large buckets yeah my guess is that seven billion people in the world one companion that understands them listen to them encourages them all that it's the universal need I think that comes in all colors it's like Siri like what's the weather do this do that that's one way to think about it

Starting point is 00:18:27 but there's like a friend category what if you want to go deeper what if you actually want to FaceTime with the thing that will Apple and Siri ever allow that maybe maybe not how do you actually think about building that direction I think all of those case is a giant company. Yeah. There's probably some version of a companion that does not exist today because a human couldn't possibly be that thing to someone that will be created through this tech. Yeah. My sense is that it needs to be on a device that billions of people already have.

Starting point is 00:18:58 If you really want to go to consumer route, I think you can build a desktop cam product, of course, as well. But for me to think about, oh, we're going to build a net new hardware that incorporates a state-of-yard product. That seems less probable than finding a way to utilize the good device everyone has that already has the best-in-class camera. Yeah, I think a lot of people have been building separate hardware devices today because a lot of the hardware companions have been like, it listens to your conversations, it provides insights, it reminds you of things. And one of the limitations of current phones is that you can't be like playing music or on a Zoom meeting or on a

Starting point is 00:19:35 call and also have an app that's recording. And so you have to have a separate. hardware device. And I think part of the question is, is that really a limitation that's going to exist in the future? If someone like Apple wants to do a companion, I mean, I'm sure they'll have all sorts of privacy and security and et cetera concerns. Maybe you'll get a message at the beginning of the call that Justine's AI companion is also listening in on this. And there might be some great hardware products that they can figure out reliability and the cost can come down and it's a form factor that people want to wear and that makes sense. But I don't think we've seen any net new hardware devices. I think glasses, like, as a form factor, has been so interesting for over a decade

Starting point is 00:20:13 because it just seems to make a lot of sense, right? If it's literally on where your eyes are, and it's also very close to, like, your mouth and your ears. And so it's, like, a convenient location to be taking in information. But I don't think anyone has successfully made an LLM run on glasses. Yeah. I'm excited about the AirPods, or it doesn't have to be Apple, but because that is the only device that I've ever worn that I've forgotten I'm wearing. Yeah, right? So that like form factor where it's actually part of you versus all these other things where you're like, I got a clip on this pin. Or I like have to take off my necklace when I shower or like I've literally forgotten that I had AirPods in and I'm like showering, right? AirPods is a good one. I think because Apple tends to be a more closed ecosystem, especially with newer devices like the AirPods, there hasn't been a ton built on top of them yet.

Starting point is 00:21:00 There was a time where there was a thesis of AirPods as a platform. That hasn't really happened. Yeah. Obviously it's been an exciting week. We heard a couple of times. couple announcements, knowing everything we've seen in the last two years from AI, this is not stopping, right? This is just part of the long arc. Where do you guys think this goes? I think where this goes is to a natural conclusion, which is we mimic the technology to do the things that humans typically do. And again, we're giving it senses as we go along and I think we just gave in an eye. The ability to communicate, ability to hear you, ability to listen to you, ability to see you is a really, really great start for a lot of the very interesting use

Starting point is 00:21:40 cases. To add to that on the companion space in particular, it has been a large subculture of AI for a long time, driving a ton of innovation in both language models and honestly image models. Subculture, we are a very big fan of. Yeah, we are quite deep in. But it was kind of by a lot of AI researchers and folks at the big companies kind of like look down upon and that's not what we want to be building or going towards AGI, that sort of thing. And honestly, opening I choosing to show off those voices and the tweets that they put out about it afterwards sort of legitimizes the space in a way I think that's really interesting and that may prompt more established companies, researchers, developers to build in the space

Starting point is 00:22:20 and more people to talk about using the products, which I think is people who have been studying this space for a long time is very exciting. Yeah. I mean, you can't ignore the demand, right? Yes. Or the usage. By the way, super quickly, you talk about adding an eye. This is going to not help the seriousness conversation around companionship, but you just imagine

Starting point is 00:22:38 these models as Mr. Potato Head and you're just slowly adding on the features, and you can almost just imagine it's okay, he's got an eye now, right? He's got an ear now. I think that's very serious. The fact that you can treat a computer like a potato ad that talks to you, listens to you, and emotes with

Starting point is 00:22:55 you and lists. It's incredible. I think the future is going to be very fun. Maybe that should be the icon when you're talking. Because everyone is like the chat, TBT voice. It's just like that black circle. And maybe it should be a potato head. So that's another interesting thing. We're not using the screen real estate effectively with voice only. So that's where I think the FaceTime element comes in. Yeah. Tatar's. All right. Well, we'll have to do this again, probably very soon. Well, thank you. Thanks. All right. That's all for now. If you like this kind of episode, we're a partner's break down the latest and greatest in consumer tech, let us know. Shoot us an email at podpitches

Starting point is 00:23:30 at a16c.com or drop us a review at rate thispodcast.com slash a 16Z. And don't forget to subscribe so that you are the first to know when we drop our episode around A16Z's Gen AI 100 list. We'll see you then.

Your Ad Here

a16z Podcast - A Big Week in AI: GPT-4o & Gemini Find Their Voice

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.