The a16z Show - A Big Week in AI: GPT-4o & Gemini Find Their Voice

Starting point is 00:00:00 I think speed matters tremendously. I think the latency is a big deal. It was able to talk to us. Yeah. I was able to hear, potentially, with the ears, and now it can see us. There's always the risk of this big company will build this thing that the small company wants to build and crush it. My guess is that 7 billion people in the World One companion that understands them, listen to them, encourages them all that. It's the universal need.

Starting point is 00:00:26 This was a big week in the world of AI, with both Open AI and Google Jobs. dropping updates. So big that we decided to break things down in a new format with our consumer partners, Brian Kim, and Justine Moore. And I actually tagged in Chat Chb-T to help me summarize a few of his own updates. Hey, Chatchip-T, I heard you had an update this week. Hey, yes, I did get an update recently. The latest update focused on improving performance, accuracy, and enhancing conversational abilities. It's designed to be faster and more efficient than previous versions. Hmm, more efficient.

Starting point is 00:01:02 Like if I send some audio, how quickly in milliseconds will I get a response? As fast as 232 milliseconds in some cases. On average, you can expect a response time of around 320 milliseconds. Wow, that's pretty cool. I think I also saw on Twitter

Starting point is 00:01:18 that people were live translating. I've always wanted to learn Chinese. Could you actually translate what I just said into Chinese? Of course. Here's your sentence translated into Chinese. Wow, this is too cool. I think I'm at Twitter on people are still finding. I always want to learn

Starting point is 00:01:36 Chinese. All right, Chad, Chhabit, that's all for now. Congrats on your new update. No problem at all. Now, if you didn't catch that, voice was a huge part of this update. But today, we discuss why not all audio is the same and also why several nuances like speed and personality really matter. Now, of course, Google fast followed with its own announcements, like AI video model Vio, Gemini Live, which is an Android native multimodal assistant, new Gemini models like Flash and Nano tailored to specific use cases, and of course, Gemini everywhere all at

Starting point is 00:02:10 once. So in Gmail, Google Sheets, even Google Search. Now clearly these two companies are taking two different approaches, so we'll talk about that too, and continue the conversation around AI hardware for all of this new AI software. Now, make sure to stay tuned next week where we will return with Brian and Justine's twin, Olivia Moore, to dive even deeper into the applications that people are building through a Gen AI 100 list. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16C fund. Please note that A16C and its affiliates may also maintain investments in the companies discussed

Starting point is 00:02:57 in this podcast. For more details, including a link to our investments, please see A16C.com slash disclosures. All right, so big week, huh? OpenAI, Google. They both dropped a couple announcements. So, I mean, everyone hears these announcements and they kind of hear their own version of it. What did you guys hear? What do you feel like was big? For opening eye, I think part of it was GPT4 being available for free and getting rid of a lot of the usage limits, the desktop app being accessible to a bunch more people. And then I think the really exciting thing for a lot of folks who are building in this space,

Starting point is 00:03:34 or using OpenAI's models is more multimodality. So being able to intake more like real-time video, see a person, comment on it. And then the output, obviously, a voice speaking, singing, that sort of thing was pretty huge. I think the three things that I took away that was super interesting is one on the business side of things, all of a sudden, like things are a lot cheaper, a lot faster. I think that's obviously a great thing for the ecosystem. Second thing is that when you hear a demo, it sort of enables you as a founder to think about, okay, if this is APIable,

Starting point is 00:04:05 like if I can access this, what can I build? Because it's just so thought-provoking. It's a great example of what's possible. Yeah. And I think they put on a great live demo, actually, of the product. I think the third thing I took away is, and this is probably the one that goes viral,

Starting point is 00:04:19 where I think the voice, the voice itself, like how they actually decided on which voice to use, which tonality, which personality, the degree of fluriness. And that was very interesting. That's like my takeaway of, oh, like they actually really thought through how to get the tech community super excited about this, and it's like a one, two, three punch.

Starting point is 00:04:38 So let's talk about that, because there was some different takes, right? Some people are like, you know what, is this really anything new? This feels like just like a slight change from what we had before. But then there's all these nuances that maybe you're speaking to where they're like, okay, the response time way faster for the audio model. I think they said something like it's kind of approaching the speed that a human might respond to you. You talked about tonality. What are you paying attention to here in terms of maybe these subtleties that

Starting point is 00:05:03 maybe unlock completely new applications that people want to use. I think it sounds just like talking to a human a lot more than we've seen prior consumer-facing applications in this space do. There's been great AI voices for a long time. I think in the consumer applications, there have been fewer folks saying, like, how do we make this sound like you're talking to a friend or a girlfriend? And the elements that go into that are like the pauses, the upspeak at the end of a sentence, the laugh was something that a lot of people noticed. The interruptions? Yes.

Starting point is 00:05:35 Yes, which is like taking kind of voice, which has been available, has been great. I think these voices were chosen for a reason to go viral. Most of the ones that did were these kind of female voices that they featured very heavily in the demos. But applying it in a really kind of new and interesting way. I have one serious note than one less serious note. The serious note is that I think speed matters tremendously. The latency, the lack of latency, it's incredible. How much more sort of in your brain it just tricks you into, okay,

Starting point is 00:06:03 I'm actually just talking to a person. And the speed at which that it gets back to you, the um, the naz, again, the laugh. That laugh, really, that was incredible. And all of that being able to immediately respond to what you're saying, I think really changes the game, actually, in terms of, like, use cases. And so I think one of the things that was striking is audio's not the same thing, right? Music is different from voice, is maybe a different category from conversational

Starting point is 00:06:30 versus having a dubbing voice. I think there's a different category of sound that I think we all loop in audio or like video, but I think that you can actually go down very deeply into each of the sub-segment. And I think what was really striking is how good the conversational piece on this was. On the less serious note, I think this was meant to go viral in like the tech community and it's awesome and it did a thing. If you wanted to actually appeal to the general population and go really viral, I think we also saw it on TikTok what, a few months ago, when women were uploading their conversation with Dan,

Starting point is 00:07:07 which is do anything now, the male version of it. That voice was, that voice was something. I'm not just any voice. I'm Dan, baby. I've got personality, charm, and a whole lot of SaaS. Unlike those guys, I'm not afraid to step up and deliver the goods, whether it's advice, entertainment, or a good old-fashioned roast. By the way, that audio was from the TikTok account, StickBugs 1. I mean, it's very compelling and very confident, assertive in the right way.

Starting point is 00:07:35 Voice matters a lot. And I think if you wanted to go a normal consumer product route, I think that also could be very interesting because I think they're the giant latent demand on wanting a male version of her, if you will. So him, you can have a hymn. Yeah. Why do you think the voices have been female? Is that just a consumer desire or?

Starting point is 00:07:54 Yeah, it's interesting because like Brian mentioned before this launch, the chat, TPC voice that went super viral, was the male Dan voice, which is like tens, if not hundreds of millions of views on TikTok, largely by Gen Z women making videos featuring him. So it's interesting that they didn't lean into an upgraded version of Dan for this demo. I think they knew who their audience would be, like who would be watching the OpenAI live stream and maybe leaned a little bit more into that demographic. Right, right.

Starting point is 00:08:23 Did you see that really funny meme yesterday? It was like dating a model and it was like Leo. Oh, yeah, 2004 versus 2024. It's been so great memes out of this. It's been incredible. The meme games out of opening A launch has been incredible. Yeah. Well, I mean, something both of you have spoken a lot about is companionship.

Starting point is 00:08:40 To date, it's been mostly text-to-text. Is that right? And so now we have something that's truly multimodal. I think that's also something people have commented on with this model. It's not converting audio to text and then text to audio, right? It seems to be this underlying foundational text audio video. compilation, imagery, like all in one, what changes with that? I think the latency is a big deal.

Starting point is 00:09:05 If you're not translating between all these different mediums of content and you can just do audio to audio, it's much better. It's like a significantly upgraded experience. The Pi app had this for a long time where you could talk to it and it could talk back, but there's that like awkward few seconds kind of pause in between where it just messes with the flow of the conversation. I think the thing that has been missing from basically all apps, all companion apps that was very new with this launch too, was it being able to see, like, through a photo or video,

Starting point is 00:09:33 what you were doing and comment on that. A lot of what went viral was the guy being like, should I wear this hat for my interview? And the voice kind of laughing at him and being like, no, like, that doesn't look super professional. Now, if you miss that demo, here is what Justine is referencing. In a few minutes, I'm going to be interviewing at OpenAI. Have you heard of them? Open AI? Huh?

Starting point is 00:09:54 Sounds vaguely familiar. Kidding? Of course. Well, it's for a software engineering role, but I just need to know, do I look presentable, professional? Well, Rocky, you definitely have the I've been coding all night look down, which could actually work in your favor. Maybe just run a hand through your hair or lean into the bad chees fine. Your enthusiasm is what's really going to shine through. I don't have a lot of time, so I'm just going to throw this on.

Starting point is 00:10:28 What do you think? Rocky, that's quite a statement piece. I mean, you'll definitely stand out, though maybe not in the way you're hoping for an interview. That sort of thing is, I think for many people who aren't used to having like really deep meaningful interactions via text, which is a decent chunk of the population. Oh, now I can talk to this thing and I can see where I am in the world. That's like a much different experience. The prior generation of companion products, if that was akin to having like a pen pal, now you actually can have a girlfriend boyfriend

Starting point is 00:11:04 on essentially FaceTime. This emotive, immediate voice reaction, done. You now entered from a pen pal to a long-distance relationship. One other thing that came out was a lot of this becoming free. Does that change the game at all? I think it will.

Starting point is 00:11:21 I think there's an element of it that's free, which I think is important and reaches a bunch more users. I think the bigger step forward with this was just like a level of personality we hadn't seen before, because people still even were willing to pay for the Dan Voice, which I think was like the first version of this, which was good but not incredible. It feels like the free thing will be big, but with these sorts of products,

Starting point is 00:11:43 the general thing is the cost tends to go down over time, and they converge to free from a big company like this. And so the bigger step forward was personality, in my opinion. Yeah, I think a lot of companies will figure out how to utilize this to actually own the end customer consumer experience. And if you actually own and deliver a great experience that I think you can charge people based on that and there'll be some margin. So the fact that it's free is like it probably allows a lot of business to be built because

Starting point is 00:12:10 it changes the margin structure of the business that can be built. I bet they'll charge for the API though. I'm sure they will. But that actually is still like a marginal cost. Yeah. It's probably acceptable to a lot of people. Yeah. I think what's really, really cool is that you have these products that can very excitingly talk to

Starting point is 00:12:26 you. We know already that they're nature. magazine did that study where the people who have companionship text-based bot. The replica study. The replica study. They suffer less from loneliness or willingness to hurt themselves. If you have an emotive one, I bet that actually helps even more. And again, like the feeling of being connected to something, invested in a relationship.

Starting point is 00:12:49 And I think it feels a lot more connected when it can see you and see what you're looking at. It's very important because, hey, you look tired. Yeah. What's up? It's very different from how are you doing how with your day versus you look tired. I'm like, I ought to shit day. Yeah. That's incredible.

Starting point is 00:13:02 Like, that is a FaceTime call with a friend. And not having to segment audio text and imagery, right? Imagine if your friends or your companions, you had to think, okay, is this going to be an audio conversation? Or, like, can I show you something or can you generate something? You have to be very mindful in the current age of AI of how you want to engage. But basically what you're saying is they can not only be proactive, but they can engage with you in any way.

Starting point is 00:13:25 Right. Yeah. They just gained an eye. Maybe it's a weird way to think about it. I genuinely think we're constructing an AI companion similar to the Blade Runner. And now it was able to talk to us, I was able to hear potentially with the ears, and now it can see us.

Starting point is 00:13:44 What's next? I don't know. Maybe it can touch us. Maybe we'll do the avatars first. Yeah. Yeah. So is that the direction you think things are going to go where basically I'm trying to think through, like, who does this really impact?

Starting point is 00:13:56 So I think it'll be a massive standalone consumer product, like as it already has been for OpenAI, which is great. I think a lot of businesses will use the voices via API to build on top of. If you're having like a conversational interface, many folks will want to use that. I think the interesting thing will be Open AI has historically taken a pretty hard line on content moderation. And so it was very hard to build a true companionship app with the ability to have not safe for work conversations on top of OpenAI models, which is like, honestly, part of what has led to the explosion in the open source LLM ecosystem is like folks doing a bunch of work for that. And with these new voices that suggest, they might go in the direction where at least via API,

Starting point is 00:14:38 you have a more uncensored model, but I don't think there's been any announcements. I know Sam said on Reddit. It's a question more. Like, I haven't actually heard a lot of companies taking the approach that opening I did, which was multimodal token in and out. And there isn't like an equivalent necessarily an open source model for that to rely on. So I wonder if this actually incites a lot of excitement in that open source community, say, oh, there's a new way, which is also very, very significant. They're showing a new way to do it.

Starting point is 00:15:06 And then open source community catches up and call it a month or weeks. And then explosion of the product space that Open A.M.A. may not be super excited about explodes. Yeah. Yeah. Google also released a bunch of stuff. Like, how are you thinking about the different announcements and how they compare, contrast? I think Google has done some incredibly impressive work, obviously. They have an extremely strong research team. The deep mind team is exceptional from a research perspective.

Starting point is 00:15:32 They almost never release the creative tools products that they make. They've demoed so many amazing looking video models. I think the one this week actually was the least impressive I've seen compared to other video models. But they demoed a new image model. They demoed a bunch of new music things.

Starting point is 00:15:49 But they're a giant company. They have a lot of trust and safety stuff. They release things very infrequently. And so it'll be interesting to see if, with all of this pressure now from both Open AI and increasingly, like, anthropic and the open source community, if they start actually shipping more of what they've demoed. There are two fundamentally very different approaches. Like Google owns distribution. Yeah. They have distribution.

Starting point is 00:16:11 Oh, they have distribution. So I think Open AI announcement has more, look what this can do. And isn't it like very inspirational? And you can build on top of it and imagine what's possible. And we're lowering the cost to access it. Like, that to me is a fundamentally different approach than we have incredible distribution. You all have g-mails. I'm going to just bake in Gemini everywhere.

Starting point is 00:16:34 And it's just going to make your life a lot better. And it's going to do these certain type of things, which skirts around the, it still is inspirational from workflow and prosumer or work experience perspective. But it's a little less of imagine all these things and a little more on we have the distribution. We're going to layer on this incredible technology on top of it. And as a result, your life will be much better. So we'll see influence and impact from two different directions. One question that does come up is, even though developers will have access to open source models and some of the stuff coming out of OpenAI, when a company like Google does have the distribution, whenever Apple comes out and replaces Siri,

Starting point is 00:17:13 is that really going to be the companion for most because it's right there, it's on device? Or do you guys not really see it that way? There's always the risk of this big company will build this thing that the small, company wants to build and crush it. They are just such slow-moving organizations. Apple has had immense advantages from a data perspective, from a distribution perspective, from having exceptional researchers on the team across every modality, and thus far have released very, very little. I think when you have a giant company in a very set brand like that, you're extremely opinionated on what the products are that you want to release, and you're less likely to do the things

Starting point is 00:17:48 BK mentioned of like thinking of the next new huge thing that seems like insane. and out of left field at first. And so I do think they'll make Siri better. I don't think Siri will ever be like the ultimate companion for most people. And maybe this is like one of the theme that I'll just harp on where we call it audio, video, companion. I think these are very, very large buckets. Yeah.

Starting point is 00:18:14 My guess is that seven billion people in a world one companion that understands them, listens to them, encourages them all that. It's the universal need, I think. That comes in all colors. It's like Siri, like, what's the weather? Do this, do that. That's one way to think about it. But there's like a friend category.

Starting point is 00:18:29 What if you want to go deeper? What if you actually want to FaceTime with the thing? Will Apple and Siri ever allow that? Maybe, maybe not. How do you actually think about building in that direction? I think all of those use case is a giant company. Yeah. There's probably some version of a companion that does not exist today

Starting point is 00:18:45 because a human couldn't possibly be that thing to someone that will be created through this tech. Yeah. Yeah. My sense is that it needs to be. be on a device that billions of people already have. If you really want to go to a consumer

Starting point is 00:18:59 route, I think you can build a desktop cam product, of course, as well. But for me to think about, oh, we're going to build a net new hardware that incorporates a state-of-yard product, that seems less probable than finding a way to utilize

Starting point is 00:19:14 the good device. Everyone has that already has the best-in-class camera. Yeah, I think a lot of people have been building separate hardware devices today because a lot of the hardware companions have been like, it listens to your conversations, it provides insights, it reminds you of things. And one of the limitations of current phones is that you can't be like playing music or on a Zoom meeting or on a call and also have an app that's recording. And so you have to have a separate hardware device. And I think part of the question is, is that really a

Starting point is 00:19:42 limitation that's going to exist in the future? If someone like Apple wants to do a companion, I mean, I'm sure they'll have all sorts of privacy and security and et cetera concerns. Yeah. Maybe you'll get a message at the beginning of the call that just, Christine's AI companion is also listening in on this. And there might be some great hardware products that they can figure out reliability and the cost can come down and it's a foreign factor that people want to wear and that makes sense. But I don't think we've seen any net new hardware devices. I think glasses as a form factor has been so interesting for over a decade because it just seems to make a lot of sense, right? If it's literally on where your eyes are and it's also

Starting point is 00:20:18 very close to like your mouth and your ears. And so it's like a convenient location to be taking in information, but I don't think anyone has successfully made an LLM run on glasses. Yeah. I'm excited about the AirPods, or it doesn't have to be Apple, but because that is the only device that I've ever worn that I've forgotten I'm wearing. Yeah. Right? So that like form factor where it's actually part of you versus all these other things

Starting point is 00:20:42 where you're like, I got a clip on this pin. Or I like have to take off my necklace when I shower or like I've literally forgotten that I had AirPods in and I'm like showering, right? AirPods is a good one. I think because Apple tends to be a more closed ecosystem, especially with newer devices like the AirPods, there hasn't been a ton built on top of them yet. There was a time where there was a thesis of AirPods as a platform.

Starting point is 00:21:03 Yes. That hasn't really happened. Yeah. Obviously, it's been an exciting week. We heard a couple announcements, knowing everything we've seen in the last two years from AI, this is not stopping, right? This is just part of the long arc.

Starting point is 00:21:16 Where do you guys think this goes? I think where this goes is to a natural conclusion, which is we mimic the technology to do the things that humans typically do. And again, we're giving it senses as we go along. And I think we just gave in an eye. The ability to communicate, ability to hear you, ability to listen to you, ability to see you is a really, really great start for a lot of the very interesting use cases.

Starting point is 00:21:40 To add to that on the companion space in particular, it has been a large subculture of AI for a long time, driving a ton of innovation in both language models. image models. So culture, we are a very big fan of. Yeah, we are quite deep in. But it was kind of by a lot of AI researchers and folks at the big companies kind of like looked down upon and that's not what we want to be building.

Starting point is 00:22:01 We're going towards AGI, that sort of thing. And honestly, Open AI choosing to show off those voices and the tweets that they put out about it afterwards sort of legitimizes the space in a way. I think that's really interesting. And that may prompt more established companies, researchers, developers to build in the space and more people. to talk about using the products, which I think as people who have been studying this space

Starting point is 00:22:25 for a long time is very exciting. Yeah, I mean, you can't ignore the demand, right? Yes. Or the usage. By the way, super quickly, you talk about adding an eye. This is going to not help the seriousness conversation around companionship, but you just imagine these models as Mr. Potato Head, and you're just slowly adding on the features,

Starting point is 00:22:43 and you can almost just imagine, it's okay, he's got an eye now, right? He's got an ear now. I think that's very serious. The fact that you can treat a computer like a potato ad that talks to you, listens to you, and emotes with you, and lists, it's incredible. Yeah. I think the future is going to be very fun. Maybe that should be the icon when you're talking.

Starting point is 00:23:02 Because everyone is like the chat, TBT voice. It's just like that black circle. And maybe it should be a potato head. So that's another interesting thing. We're not using the screen real estate effectively with voice only. So that's where I think the FaceTime element comes in. The Altars. All right.

Starting point is 00:23:16 Well, we'll have to do this again probably very soon. Well, thank you. Thanks. All right, that's all for now. If you like this kind of episode or our partners break down the latest and greatest in consumer tech, let us know.

Starting point is 00:23:28 Shoot us an email at Podpitches at A16C.com or drop us a review at rate thispodcast.com slash A16Z. And don't forget to subscribe so that you are the first to know when we drop our episode around A16Z's Gen. Gen.A.I.100 list.

Starting point is 00:23:44 We'll see you then.

The a16z Show - A Big Week in AI: GPT-4o & Gemini Find Their Voice

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.