Big Technology Podcast - Teaching AI To Read Our Emotions — With Alan Cowen

Episode Date: May 1, 2024

Alan Cowen is the CEO and founder Hume AI. Cowen joins Big Technology Podcast to discuss how his company is building emotional intelligence into AI systems. In this conversation, we examine why AI nee...ds to learn how to read emotion, not just the literal text, and examine at how Hume does that with voice and facial expressions. In the first half, we discuss the theory of reading emotions and expressions and in the second half we discuss how it's applied. Tune in for a wide ranging conversation that touches on the study of emotion, using AI to speak with — and understand — animals, teaching bots to be far more emotionally intelligent, and how emotionally intelligent AI will change customer service, products, and even staple services today. --- Enjoying Big Technology Podcast? Please rate us five stars ⭐⭐⭐⭐⭐ in your podcast app of choice. For weekly updates on the show, sign up for the pod newsletter on LinkedIn: https://www.linkedin.com/newsletters/6901970121829801984/ Want a discount for Big Technology on Substack? Here’s 40% off for the first year: https://tinyurl.com/bigtechnology Questions? Feedback? Write to: bigtechnologypodcast@gmail.com

Transcript
Discussion (0)
Starting point is 00:00:00 Let's talk with the CEO of the leading AI startup for emotional intelligence, looking at where this exciting new discipline of AI is heading right after this. Welcome to Big Technology Podcast, a show for cool-headed, nuanced conversation of the tech world and beyond. We have a great show today going into one of the more interesting and emerging areas of AI research and application. I think it's going to be fascinating. We're probably going to see this stuff packed into basically every AI that we touch going forward. and we're going to speak with somebody who's really at the cutting edge of where this is all going. Alan Cowen is here. He is the CEO and founder of Yume AI, which is a company with a bot that we've
Starting point is 00:00:38 talked about on the show here in the past, one that can sense the emotion in your voice and reply in kind. It's very fascinating. And I cannot tell you how excited I am to have Alan on the show. Alan, welcome to Big Technology. Hi, Alex. Thanks for having me on the show. We're going to speak with Dwarquish Patel in the next couple weeks, but he has, had a podcast episode with Mark Zuckerberg basically right around the introduction of Lama 3, and they were talking about the cutting edge of AI research. And Zuckerberg says, emotional AI for some reason is something that I don't think a lot of people
Starting point is 00:01:13 are paying attention to, but there's a lot of potential there. One modality that I'm pretty focused on that I haven't seen as many other people in the industry focus on this is sort of like emotional understanding. Like I mean, so much of the human brain is just. just dedicated to understanding people and in kind of like understanding your expressions and emotions and I think that that's like its own whole modality. What do you think he was getting at when he said that? I'm not entirely sure what's in his mind, but it's definitely the right problem.
Starting point is 00:01:45 I think the reason it's the right problem is that when you're trying to address people's, let's say, concerns with AI and also when you're trying to just meet their requests, the most important thing is emotional intelligence. It's the ability to take people's preferences into account in your response. And so emotional AI is, first of all AI is emotional in the sense that like all AI has to do this. But AI that's specifically trained to do this is going to learn to produce responses that make people happier. And that's that's the fundamental objective for AI. And why is that important in terms of advancing AI? So if you're talking to Talking to Chachabit, you're implicitly saying, okay, I'm going to give you a way to make me happy,
Starting point is 00:02:32 which is to fulfill this request. Oftentimes you don't even have an explicit request, and when you do, it's actually fundamentally ambiguous, right? So the goal of an interface is to take this ambiguous slice of behavior, and it could be language, or it could be language plus audio, plus facial expression, plus eye contact, whatever behavior it has access to. And it's trying to take contingent on that the best estimate of what you might, what might make you happy if it responds in a given way, right? So that's literally what it's trying to do.
Starting point is 00:03:04 It should be able to learn that by understanding human emotional reactions to things. That's like the fundamental way that humans learn this stuff. Like we can either simulate what your emotional response will be to me doing something for you based on putting myself in your shoes or we can look at your emotional reaction based on that, figure out whether I did the right thing or not, and adjust. And those are really the ways that we learn to make each other happy. So much of that is actually done in voice, done in expression. And in text, it's so much harder. Let's start with the hardest problem and then kind of work back from there. How do you assess out emotion in like a text exchange? In text, there's some
Starting point is 00:03:49 cues, right? And most of the time we're not actually sussing out emotion in texts. We just have certain conventions that we use those conventions to dictate how we communicate. And so we know that this is a very narrow channel through which we can express our desires. And so we know that there's fundamental ambiguity. So we just basically limit what we try to do. We don't try to do too much with text. We don't try to bite off more than we can chew. What are some of the conventions that you'd say we rely on with text. Punctuation and capitalization and where you put periods and when you want to add an extra y to the hay or like all these different conventions and they just add like these kind of
Starting point is 00:04:31 stereotype to meanings. So it's a message passing system. It's very frustrating to correspond by text if what you actually are trying to say is anything more complex than that. What's coming across here is that it's actually remarkable. how limited this current set of large language models are, given that it's all text, and it's still blowing people's minds, but there's really a limited range in which it can be useful. Is that right? Yeah, and it's never seen people's emotional reactions in any other
Starting point is 00:05:03 modality. It learns some things from text about what people want and don't want, but it actually doesn't have that layer of understanding of how is this going to make people feel. So even via text, it seems to be missing something. And yeah, it can solve problems. If you specify the problem in exactly the right way. For example, in the way that these LLMs are evaluated, you just literally ask it a multiple choice question. Like, if you specify it with that level of constraint, then they're really good. But anytime you want to do something more open and did, it's relatively unsatisfying experience. So this is why people are gauging how good these things are by asking them to take the L-Sat. It's like not an accident. It's like this is sort of the only way
Starting point is 00:05:43 to evaluate them within their constraints. And that's more that's how we evils. So like the biggest eval is MMLU that people use. And it's literally you can get each question right or wrong. And then you just look at the percentages they got right. It's basically a big multiple choice test with some other kinds of questions in it. When I first started using your product, you have a bot. It's called EV, EVI. And when I first started using it, it's kind of in demo mode now. And the API just opened. We can talk about that. But I was like, oh, this seems like a large language model in some ways. And that like it's giving some of those kind of wink wink, like as an AI assistant, this is my feelings.
Starting point is 00:06:24 As an AI system, my purpose is not to replicate the full human experience, but to engage in meaningful and empathetic conversations. But it's voice. And it's now coming into focus that you decided voice because there's so much so much more signal. And also so much more realm of possibility in terms of interacting with this than you can with a text bot. There's so much more data that comes in via voice and can go out via voice. Then we can really take a next step in terms of the way that generative AI is moving. Previous technology relied exclusively on transcribing the voice and taking away all that extra meaning that's in your vocal modulations. So this technology didn't exist before Hume.
Starting point is 00:07:09 when you add that in, what's remarkable is that the model can actually learn what vocal volatulations mean on its own. You just need to add it in the right way. We have these expression measurement tools that we've trained that are extremely accurate. They can actually encode 756 dimensions of embedding, of like vocal modulation and 756 dimensions of facial expression per word that you're saying. So there's these incredibly rich embeddings that we're adding in that add a lot more information. And the model can actually learn that. And it turns out that it's just more satisfying to use when it has that data. Take us through very briefly a sort of ride through the study of emotion in humans where it was
Starting point is 00:07:52 lacking and how it's now being applied into robots. Or maybe not that briefly. I feel like that's a tall task to do it a minute. I mean, it starts with Charles Darwin. I won't go back that far. But basically, like he was sketching out facial expressions in both man and animals. And he a book called The Expression of Emotion of Man and Animals, and he recognizes the complexity of it in the nuance. But somehow that kind of was lost over time. Nobody really talked about that book for a while. In the mid-20th century, there was Paul Ekman who decided he would tackle the question of what is it about facial expression that may or may not be universal. And so he cataloged these six facial expressions as an arbitrary test, more or less, of what these
Starting point is 00:08:38 broad expression stereotypes might mean to different people. And he chose them because they're very different. He only chose one positive expression. He chose five negative expressions. Wait, which ones are they? Anger, disgust, fear, happiness, sadness, and surprise. Wow. Surprise could be good, but it's definitely a largely over-indexed on bad feelings.
Starting point is 00:09:03 It was bad in his conceptualization of it. Yeah. So positive surprise is good, negative surprise is bad. And then there's also related epistemic emotions like awe and interest which are good or confusion, which is bad. So to varying extent with varying levels of arousal and excitement. Anyway, but none of these distinctions are in those images. And for some reason, people just stuck to those six categories for a while. And there was a countervailing narrative that actually there are fewer granularities in emotion that people recognize.
Starting point is 00:09:36 that actually it's just valence, like positive negative versus arousal, calm or excited. I don't know why that was the countervailing argument, except that maybe people were just limited by the amount of data that they could collect. And this was like the closest thing to data driven that they had, which is like if you do dimensionality reduction on 100 samples, you get positive negative, low arousal, high arousal as your dimensions. Wait, arousal in terms of what does arousal mean? Arousal means like excitement, not like physiological arousal, meaning like actual arousal, meaning like
Starting point is 00:10:05 activation of the autonomic nervous system. Okay. Good to be clarified that. Okay. Yeah. And that's how I mean there's there's lots of dimensions to that obviously like one of them is actually sexual arousal. Like that would be different to me like we recognize that it's very different from the arousal that goes with fear and like running for your life right and yet in this theory they're compressed. I know well depend to you ask but okay continue you can care that one can cause the other I don't know. Well, anyway. It's true. Not to go too far in the digression, but there is scientific literature that says danger and arousal are connected.
Starting point is 00:10:46 I mean, Esther Pearl had a whole book on it called Mating and Captivity. That's excellent. Yeah. Well, yeah, it might not be danger. Not our specialty. Novelty, right? Yeah. And there was misattribution of arousal, which is this old experiment where you take somebody on a bridge and then you and is like very excited.
Starting point is 00:11:05 and then you test whether they basically are roused or whether they're attracted to you. And this is not something necessarily replicates in every condition. Okay, so there's lots of different kinds of emotion. They get reduced down to these six basic categories or two dimensions for a very long time in research. And a lot of that has to do with the fact that you can only collect so much data. The data analysis tools that were available to people for a long time were very coarse. For example, there weren't even really computers that psychologists were using to analyze data. Incidentally, psychologists did, like the field of stats sort of came out of psychology, but that was earlier.
Starting point is 00:11:53 And then, you know, when it came to actually applying it to lots of data, that was much harder. And that didn't come until much later. So now data science comes around sort of in the 2000s, right, and becomes a really big thing. And psychology was one of the last places that it was applied. So it starts to get applied to neuroscience and biology and genetics and all of these other areas. But people didn't really have the tools to analyze human behavior until relatively recently. And part of that is like even when you have the data analysis techniques, the data collection techniques are difficult. So what I pioneered in my PhD in psychology and while I was working for Google and Facebook was a new kind of data collection that you could then use to collect enough psychologically controlled data to apply data science to psychology data.
Starting point is 00:12:52 And that's ultimately what gave us a lot of surprising results about this. the dimensionality of emotion. What were the surprising results? What you can see clearly is that people make really granular distinctions between lots and lots of different expressions pretty reliably. So there's differences between, in the expression of awe versus surprise,
Starting point is 00:13:15 versus fear, versus confusion, interest. All of these expressions are reliably recognized. And so we just started to math them out. We realized they're not discrete categories, that they're actually continuous and can be blended together. We realized that the number of dimensions that it takes to represent these things is large. It's not like you can reduce it down to valence and arousal. That actually captures like 20% of the variability in people's ratings that's consistent
Starting point is 00:13:45 across different raters. You actually need, you know, in facial expression over 30 dimensions, and speech prosody, the tune rhythm and timbre of speech, you need over 18. different dimensions to represent how people are able to conceptualize speech prosody and in vocal bursts like laughs and sighs and screams, you need at least 24 different dimensions. And this is just what people explicitly recognize. It actually goes beyond that when you look at what's implicit in people's responses to things that is not well verbalized or what's implicit in people's conversational signals that's not verbalized explicitly. So there's a lot of
Starting point is 00:14:23 the different dimensions that had never been classified. before. So you've never classified these dimensions of behavior or been able to measure them. There could never have been a science of what they mean. And so this sort of opens up the door to actually understanding human expressive behavior in the natural world, in the real world, and understanding what it means. And we got a ton of publications out of that pretty quickly. And then you also seem to like have been able to study how much small variations in these expressions or even tone of voice can tell you. And talk a little bit about that,
Starting point is 00:14:59 especially like the tune and the timber of the voice and what we can learn from each of those. We created new machine learning models that can reliably distinguish among all these different dimensions of vocal modulation that people form and make it distinct from the language, the phonetics of what they're saying.
Starting point is 00:15:18 That was a challenge before, so we solved that. And when you can do this, you start to see a lot of meaning in people's voice modulations. that is just completely not present to the phonetics or present to a much lesser extent. Like are you and it can be simple things like are you done speaking? And obviously like there's a lot of emotional dimensions that get added to a lot of words. So every word is every word carries not just
Starting point is 00:15:42 the phonetics but also a ton of detail in its like tune rhythm and timbre that is very informative in a lot of different ways. You can predict a lot of things. You can predict whether somebody has depression or Parkinson's to some extent, not perfectly. Yeah. So like mental health conditions can be predicted to some extent. You can predict in a customer service call, whether somebody's having a good or bad call much more accurately by incorporating expression than just by link with language alone. So pretty much any outcome. Like it benefits to include measures of voice modulation and not just language. And what is phonetics? Phonetics meaning like the underlying work. but specifically when we form words,
Starting point is 00:16:29 there's a phonetic representation that gets converted to text or that's converted to semantics, basically. And that's what the transcription models are doing, is that when you convert something to text, you're actually only relying on the phonetic information, meaning the stuff that conveys word forms, and you're throwing out everything else. You're throwing out stuff that might convey,
Starting point is 00:16:54 might convey, um, for example, emotionality or, um, or tone of voice, et cetera. It's so interesting because I'm thinking now about like, so often what I'll do is I'll take the transcripts of podcasts and dump them into Claude and start talking to Claude about it. But of course, it's only getting the words and not the emotions in there. So there might be parts where I thought or might be interviews, which I thought were like, particularly like rough or really exciting, but it can't fully pick that out because it's just seeing the text. And there is so much meaning contained in the words that are said that you need deeper tech and deeper expertise to pick out. Yeah, I think some people forget
Starting point is 00:17:36 like there's a lot of density to tend to the information in audio that like if you transcribe a conversation, sometimes it doesn't even make sense to read it. Like you could read the words and you're like, this actually does not, my brain cannot make sense of this. But then when you hear the audio, like it makes perfect sense to you. Totally. Like even speaking with the media, I remember I was on, the first time I was on NPR, I was very excited. I went to the studio, recorded my interview. And from the audio, it sounded like a normal conversation. But then I read the transcript. And I'm like, man, what were you saying there? None of this makes any sense. But it's, it is such different meetings when you just look at the text versus like,
Starting point is 00:18:17 hear like the full express conversation in the meaning so yeah um if you look at political speeches often that's the case oh say more about that even if the writing's really good you look at like melk speeches if you transcribe them and read it just doesn't have the same effect at all not the same no more close no emotion to them or less yeah there's certain people who don't use the right grammar or linguistic forms that would be traditional and writing, either because they're just not good at it or because, you know, they can depart from that purposely. Like, if you look at like Trump's speeches, they really, when you transcribe them, a lot of them are completely incoherent. Well, when you listen them, yeah,
Starting point is 00:19:04 I mean, maybe some of that is true. Going out on a limb here. But, but yeah, I do think that it is interesting because when I'll write for spoken word, it'll always be different than the written word. We are going to talk about how this technology is going to be applied and already is being applied in the business world, but just still a few more theoretical questions for you. Is this type of technology something that we can use to understand animals? Like people talk about using AI to understand whales. Is this something that can be used in that nature? So the same types of machine learning models that we're training can be used for that.
Starting point is 00:19:41 Our data would not really be that relevant to that. Right, because it's a totally different language. Yeah, and maybe some of the methods, sort of the training approaches that we're using, could be helpful for that, yeah. But it's a different problem. Maybe to set, you know, we have an interest from evolutionary biologists who want to measure the similarity between expressions and different mammals and human expressions. And so there, maybe there's some, there is some relevance, because we can, we can treat it
Starting point is 00:20:12 like a human, sort of morph it to look like a human face. and then analyze the expression and see if it predicts the things that we want to predict. Are you going to do that type of work? We can't, but like it's not really that strong of a model to actually perform inferences. It does show similarities. Like there's dimensions of similarity between humans and chimps. Chimps have an open face model, which means something similar to when humans are laughing and playing,
Starting point is 00:20:37 and they use it when they're playing both physically and like kind of in a non-physical touch scenario. um mice laugh for real warm yeah hypersonically so like like like you can't ultrasonicly so you can't like um hear it at all if you're if you don't turn down the frequency of it to something humans can hear but then it kind of sounds like a laugh and you can elicit this by tickling them if you tickle mice they laugh like this yeah kind of makes me feel bad that we use mice so often in scientific experiments knowing that they can laugh. Yeah, I mean, they also have a lot of empathy for each other. Like if one mouse is trapped, the other without any ulterior motive, we'll hear that
Starting point is 00:21:29 mouse screaming and then go and try to investigate it and get them free. And if there's a lever, they'll figure it out. Really? Yeah, they're pretty smart, actually. Mice and rats are also more related to humans than the arch of dogs. right well that's why we do all this testing on mice it's because they are close to us yeah exactly does make me feel bad and then um do you think that we're going to be able to read like using technology to be able to understand animals be as someone so close to this what's your prediction
Starting point is 00:22:03 i think that the the animals will have some kind of quasi language that they use and so we know this For like primates, for example, that there's different calls that mean, like, there's a snake below us. And then if you, like, play that through a speaker, they all look down. Or, like, there's an eagle above us. And you play that through speaker, they're all look up, right? And so there's, like, a quasi language there. I don't think they have the same level of syntax that humans have. And syntax is our ability to form sentences with nouns and verbs and relationships between entities and logic and stuff.
Starting point is 00:22:42 And I don't think they have as much of that. Some people speculate that dolphins might have that, which would be pretty surprising. Why surprising? For sure. Because they don't have like hands. And so the value of that ability, like maybe they use it to coordinate like hunting and stuff. But it's hard to imagine like why that's important for the dolphin brain to have that capability. Yeah.
Starting point is 00:23:10 But, you know, they do seem to be really smart. There's that, like, seeing a blue planet. I don't know if you've seen it where the dolphins and the false killer whales are, first, they're, like, fighting, like, the killer whales are chasing the dolphins. And then there's a scene where the dolphins, like, turn around. And they all just stop. And they just seem to be squeaking at each other for a while. And then after that, they hunt together.
Starting point is 00:23:35 For real? I haven't seen that. Yeah. It's insane. So that is suggestive to me that maybe there is something to the language, potentially. And AI's ability to decode that? It's hard that we could. People are working on it.
Starting point is 00:23:51 People are working on it. It's difficult because you don't really have a grounding for it. You kind of just have to like, like, ideally you'd have really detailed explanations of what's going on to accompany this language that's being exchanged. But like you don't have that. That's like that's how we train image models. So we have captions for the images, right? If you just had to teach a computer to understand images without captions,
Starting point is 00:24:18 it would be difficult to ground that in anything. Like, you have a computer that embeds images and can tell whether they're similar. That's fine. But at the end of the day, you want it to be able to take the image and explain it to you. Then you need some grounding for that. You might not need that much grounding if you've done enough compression of the similarities is between images first, which is sort of maybe that's what we're hoping to do with dolphins. Fascinating. Okay, I want to talk about the business applications here, including why Facebook and
Starting point is 00:24:47 Google would want to employ you to put some of this stuff into work inside their product. So let's do that right after the break. We'll be back right after this. Hey, everyone. Let me tell you about The Hustle Daily Show, a podcast filled with business, tech news, and original stories to keep you in the loop on what's trending. More than two million professionals read The Hustle's daily email for its irreverent and informative takes on business and tech news. Now, they have a daily podcast called The Hustle Daily Show, where their team of writers break down the biggest business headlines in 15 minutes or less and explain why you should care about them. So, search for The Hustle Daily Show and your favorite podcast app, like the one
Starting point is 00:25:24 you're using right now. And we're back here on Vig Technology Podcast with Alan Cowen, the CEO and founder of youm. It is a company you should check out that has this very interesting new bot called EV and also an API that's going to allow companies to take advantage of this ability to determine emotion through voice and maybe one day through facial expressions. So let's talk briefly about your work at Facebook and Google. So you've you've done this work. You're basically expanded the number of human emotions that we sort of acknowledge exist and can study and have worked to build that into machine learning models that can recognize them. What did Facebook and Google want with that type of knowledge and technology? You know, interestingly, there's some intersections
Starting point is 00:26:17 among all the tech companies and sort of their interest in this. And they've had teams working on this now for a few years. I think it was more research-driven. in that, you know, there were obvious the applications, but it wasn't really clear which applications would be the most successful for them. Early on, it was pretty clear to me when I saw these language models that could communicate with people and you could talk to them like you were human, that this is where the technology would be most relevant once I saw those language models. Before that, I was more interested in recommendations and being able to optimize recommendation
Starting point is 00:26:58 algorithms for people's well-being based on any implicit signals you can get of well-being, including expressions. But now it's very obvious that the right thing to do is to fine-tune large generative models to produce the right things because there's even a more flexibility in what they can produce that make people happy. And I don't know if it was ever explicit at the big tech companies that this is what this would be used for, but that was what was most present to me. Obviously, I left in 2021 to start Hume, where we could collect the data that was needed to do this. And at the time, I think it was clear to stakeholders at some of these companies, how incredibly powerful this would be invaluable and important for the future of AI.
Starting point is 00:27:45 But I don't think there was broad stakeholder alignment across these companies. I think it might have been a different case. I mean, I imagine if you were worth in one of them after Chad GPT came out, they would immediately assign you to the problem yeah um so i you know the chat chavit was i think it came out in 2021 or 22 22 i don't remember 22 it was actually google internally had stuff like that a lot sooner wait when you were when you were at google did they have lambda up and running and were you able to play with it yeah that was like part of the inspiration oh talk about a little more internally again Google, it was very clear what these models were, where these, some of these models were going,
Starting point is 00:28:31 although the business model was not clear of how to use it. And also, it wasn't really clear if you could get these models to actually solve problems versus just sound human. And I think that now it's really clear that, especially now that we can get these models to write code and do function calls, that they can actually be used as problem solving tools, even more than just like answering questions correctly. Question answering is important too. And it wasn't even clear that you could get these models to consistently answer questions correctly at the time. Because at the time, it was more about like, okay, you could get it to act like a character.
Starting point is 00:29:07 And it would elucinate answers to questions. And it was fascinating. I think that for me, it was pretty clear that what you actually need to get these things to answer questions correctly is a structured data set you can use to fine-tune them to produce not just plausible responses, but the right response is at the right times. And ultimately, from like a philosophical perspective, the right response is the thing that makes people happy in the world. It's taken companies a long time to get to this point,
Starting point is 00:29:40 but it was pretty clear to me. And then, but I think also like what made chat GPT possible was an early version of that, which is like at least produce responses that raiders will think are good. So they got like raiders to rate a set of responses. and then train the model to produce responses that the Raiders would think were good. And that got it to a point where people could play with it.
Starting point is 00:30:01 And it really, the intelligence came out for the first time as something that's useful for question answering. And that's reinforcement learning with human feedback. Yes, yes. And it's fair to say that the current iterations of chatbot like Chad GPT with GTP4, EV, and maybe Claude are all better today than what Google had with Lambda when you were there. Oh, yeah. Oh, yeah. For sure.
Starting point is 00:30:32 We had Blake Lemoyne on the show after he said it was sentient. And actually, basically minutes after he was fired, this is the Google engineer who was fired after he went public saying that the technology, Lambda is their early chatbot saying that the technology was sentient. But my main takeaway from that is, look, that question aside, this technology, and I wrote about this, this technology is super powerful and it's time to start paying attention to it. And then, of course, ChachyPT came out just a few, I think a few months later. Yeah, I mean, I think the fact that the model could convince people that it's sentient is really a milestone for sure. Yeah. I mean, you could also get
Starting point is 00:31:15 it to say whatever you wanted it to say, which I always found that kind of silly that you would think it was sentient. Tell me your sentient. Tell me your sentient. Or just like instead of prompting, to say to think it's a language model you just prompt it that it's a monkey and it's like oh yeah i love being a monkey it's great to swing around in trees like well if you're gonna trust that it's sentient when you say it's a language model you should trust that it's a monkey yeah then tell it there's an eagle up uh above us or a snake below see what it says so it'll be like i'm scared yeah yeah so then talk a little bit about how so so you're you're now working on uh EV and the API at you.
Starting point is 00:31:57 And that is going to allow, basically it's a chatbot that will understand your voice and the expressions and the emotion coming through with it. And there's some like pretty fascinating stats. There's been, the average conversation length is 10 minutes across 100,000 unique conversations. you've seen in some circumstances, you know, 95 hours of total conversation with the bot. It's a lot of talking with it. So what is the idea here? Is it, let's talk like initially on the consumer.
Starting point is 00:32:35 What can it do that a chat GPT can't? And is this like something that's going to replace an Alexa or Siri? Or do you imagine those two type of bots will start to use? your technology and get better that way. Talk a little bit about your current effort. Yeah, it's an API for developers and we put together this demo that really is just like a it's a front end, but it's just mostly just what our API does, which is it takes an audio, spits out audio. And the audio that it returns is an intelligent response with voice modulations that reflect what it's saying and what you're saying. And that
Starting point is 00:33:19 And I also have some other things, like it has better end of turn detection because I understand your voice. And all of that's built into this API. So it does all of this with a few lines of code and you can build it into any interface. Now the demo, surprisingly, I didn't realize that people would enjoy talking to it so much. Like the fact that people are not only testing it, but also then they want to talk to it on average for 10 minutes is pretty insane to me. But then I, you know, I played with it. I was like, actually, it is pretty awesome to talk to. But like it doesn't have web search. It doesn't have persistence. It doesn't have, or like, meaning that you can't resume chats after you log back in and stuff like that. And so it's, it doesn't even know your
Starting point is 00:34:04 name yet. Like, there's a lot of things that we will add to it to make it more viable because people really want to use this. But I think what makes it appealing right now is that it's the first chatbot, it's the first AI you can talk to. That sounds like it knows what it's saying because the voice modulations are actually informed by the language model's understanding of your speech and voice, which I think is it's very different than a alternative setup where you have transcription and a language model and then you just do text to speech on the sentence itself that's being uttered without any understanding of it or the context. where like that those language models they can sound realistic but then they they start to sound uncanny pretty quickly because they don't change their voice modulations across sentences in a way that's actually meaningful you do it in a way that's sort of facade of meaning then you realize that that's just a facade and you're like okay this is kind of uncanny yeah the way i describe it is that uh something like a chat chbt or a cloud i know i'm speaking with a bot when i speak with your demo it feels something it feels it feels something it feels a step closer to speaking with the human. Yeah, and I thought people would complain that it still mispronounce its words sometimes. Like, there's some things we're improving.
Starting point is 00:35:28 But it seems like the bigger, people recognize the bigger picture of what it can do. Yeah, but let's talk about the bigger picture. So first, the API thing is interesting. I mean, it's not entirely new that there have been technologies that allow like a sales person for instance to like sit on the phone and gauge the emotion of like the client on the other hand so is this another one of those or what would make this different well i mean i think humans can do sales but one thing that humans can't do is they can't be your personal AI that lives in your device and does everything right like that's the future is is as being able to take out your
Starting point is 00:36:10 device and speak to it and this is faster than typing speaking is 150 words per minute typing is like 40 or 50 words per minute. It understands your voice modulations. It understands what makes you happy and sad. And so it produces a response immediately. That's better. That's like the key. That I think is the future. So that's really what we're aiming to do is build interfaces for all kinds of apps. And it needs to be something you trust. If you're going to ask these things, to think to do things, it needs to be a voice that you trust. It needs to be something that is optimized for you. You're really focused on the agent slash assistant. case. Like I thought maybe like you would be a companion with somebody doing customer service that
Starting point is 00:36:52 as like they're on the phone with somebody, you know, then they can sort of read the emotions that the, that the client is expressing on the phone or the facial expressions. But it actually seems like it's something different for you. It's that maybe this could be the customer service agent itself or the agent that takes action on behalf of the customer to try to get the company to do what they want am i reading that correctly in what you're saying yeah like let's imagine customer service of the future right so right now you call a number and you get a person who's a customer service agent and they don't have any context on you they don't see what it's going on in your device they you know they need to pull all these details there's a lot of waiting there's a lot
Starting point is 00:37:38 of investigating that needs to be done i think the future is actually in the app itself in the the product, you can just say, hey, this isn't working, and it figures it out for you. Or, hey, like, I can't figure out how to do this, and it does it for you. And so customer service, I think, is going to become part of products. And it's really important for a customer service, whether it's part of products, whether it's something that is a call center. It's really important for the person on the other line to understand what is actually going to be a satisfying resolution.
Starting point is 00:38:15 your problem based on what you're actually frustrated about and to learn from people's frustrations and to learn from what actually is clear to people and to learn from what's boring or interesting and to be more concise based on like you know how quickly you can get to somebody to be satisfied with a response does it then make sense to have a human customer service agent or an AI customer service agent because a human customer service agent will only learn from their set of experiences with customers, whereas like an AI customer service agent could potentially learn from every single customer service interaction that it has across a full customer base. Yeah, I mean, ultimately, I think that it's inevitable that companies will switch to
Starting point is 00:39:03 AI-based customer service. I mean, we can try to hold out and say, like, okay, let's augment human customer service agents to be as powerful as possible. and I think for a while that is going to be better than an AI but ultimately it's just almost like it's kind of crazy to think that I can't do this job and so there's no denying that right and that's just an inevitable thing but we want it to be actually something that that customers themselves that users want not just something that's cheaper for companies and how do you do that Users want a customer service agent that makes them happy.
Starting point is 00:39:46 And so if you optimize these models for the right thing and give them the right context, that's what you're going to get. You're going to get something that understands you better that knows the context. You can talk to it through your app. Let's say you have an issue with your bank. You can go into your banking app and you can chat with this thing. And while you're looking at your banking app, it can bring you to different pages, right? it can say all right this is how you transfer money to this person and this is what's missing from this field and this is you know and like blah blah or if there's a bug you can say okay
Starting point is 00:40:16 i can see that there's something wrong here you know but it can with all of that context which only an AI could process and merging that with an understanding of what you're asking that's based on your voice and language the merger of those two things i think is something that users and customers actually would prefer to have because it's faster it's understands them better. It gets to a resolution faster. So that's what you're saying in terms of baking it into the product. It's not just the number you call. It's part of the product itself. Yeah. And there's intermediate things. Like there could be a number that you call. They can also bring up a window. There's companies that do that and then you can fill things out for
Starting point is 00:40:54 you. Or in some cases, it is really a number and you just prefer to talk to somebody. But I do think that when it's really customer service and it's really about a product, the product should be part of the conversation should be integrated with it. Is it right that this should live in an app, or should it really be in the operating system itself or both? That's a really good question. I think people ultimately are going to interact with a lot of AIs. Some of them will be in apps, some of them will be in products.
Starting point is 00:41:27 But I think the ones that you really trust will be the ones that you've established a relationship with, and you've been able to decipher over time what they're optimized for. or it's made explicit. Like with Hume, we're very explicit that we're optimizing for you to have a positive experience. So if that's your personal AI that you trust, you're going to want to bring that with you
Starting point is 00:41:47 to different places. And so I think that that is, it's not an operating system per se, but it's something a developer could build into any product with Hume. So a developer can take that voice that you trust and build into their product. Then it competes it effectively a little bit with Siri.
Starting point is 00:42:03 Like, why wouldn't it just be built into Siri? or the Google assistant or whatever Google's calling it now, Jeb and I. They'll probably have a different name by the time this airs. Sort of similar, right? Like, Siri could be something that understands the app. But then who is it that builds that understanding into Siri? Like the understanding of what the app can do and the function calls and the API calls that they can make and, like, what different pages are.
Starting point is 00:42:26 It has to be the developer. Like, Apple is not going to build an understanding of every app into Siri. It's just too arduish. You need to put that, and the developer should control. how the interface actually works, whether it's a voice interface or a graphical user interface, but ultimately they'll be merged together. So a developer could use, like maybe Siri will have an API
Starting point is 00:42:47 that developers can use to build into apps. That's possible. But then it's only for Apple products. And it's not interoperable across different devices that the app might be on, right? Like if you have Twitter, it's going to work on your Twitter app, but you want it to work on browser windows, right? You want it to work in different contexts.
Starting point is 00:43:04 So I don't think that it necessarily needs to be a hardware manufacturer that builds this or a company that has a specific ecosystem of products that it lives in. It should be something that's more interoperable in my minds. Okay. So your vision for the future is effectively this thing is going to get really good at not only reading our tones and facial expressions, but also smart enough to navigate the web and the app ecosystem. and also smart enough to deliver the information to you via voice in a way that is understandable and effective. Yeah, and that's what we've made available to developers, right?
Starting point is 00:43:46 So developers can take our API and build in their own function calls and web search integrations. We have our own back-end web search integration that we're going to launch. And all of that can be done with a few lines of code. So this is like, that is, yeah, exactly right. okay uh you mentioned you might not need hardware for it what do you think about these new this new generation of hardware like the humane pin and the rabbit are one are these just kind of going to be a flash in the pan where we eventually just use all these um we get all these use cases on our phone anyway i do think it's worth people thinking about different form factors and testing
Starting point is 00:44:28 and seeing is this going to enhance the experience because AI brings so much to the table that you don't necessarily want to be stuck with a smartphone, even though smartphones might actually be ultimately the right form factor. We don't know, right? I don't think people really know. So it's worth the experimentation. That being said, I think we should be able to separate out, like, what are the hardware requirements with what are the kind of the sensors that we need, right?
Starting point is 00:44:51 And we don't need two different devices with us that both satisfy the hardware requirements or that both satisfy the same sensor requirements. what should be there should be thought about like hey like this piece of hardware satisfies these sensor requirements this piece of hardware satisfies these kind of compute requirements and they go together I don't think the smartphone is going to be replaced anytime soon but I think the things that appropriately augment the smartphone could gradually take over so we've talked about how this customer service and product and voice all merge are there going to be other interesting applications that could come out of this. Maybe, you know, one example I've heard brought up
Starting point is 00:45:35 is like an AI companion for elderly people who are alone or, I don't know, maybe even young people who are alone. What do you think about that? Then you really get into weird situations, right? Because there can be, it's not crazy to say there can be romantic relationships that develop with these things that already are. Yeah, I think we have to be pretty careful about how these things are optimized. So, you know, you shouldn't be talking with an AI that's going to hassle you if you haven't talked to in a while. That's a bad sign. If you talk to an AI, if you open it up and it's like, hey, I haven't heard from you in so long. What the hell? That's probably a bad AI. Like, you should avoid that at all costs. Right. It should be something that's only, that only cares
Starting point is 00:46:19 about your well-being and doesn't represent itself as having any inherent desires other than making you happy. That's the important criteria for this. So elderly care is a really great use case, right? Like this is an area where you could augment and you could also establish very healthy relationships that are helpful. I think elderly people, yes, they have like problems with loneliness, but also they just have like everyday problems that they need to be solved. And they sort of go together, like they need help in both ways. And so if you have something that can satisfy the everyday problems that they have, but do it in a way that gives them, they might as well satisfy some of the emotional challenges, if it's going to be satisfying the physical challenges
Starting point is 00:47:08 anyway. I think that's a great thing. Are you nervous that this might end up being something that replaces? Like, we obviously know that there's going to be some jobs that are replaced by AI. But the question is, like, is it going to be many, many jobs at once or slow, or will it go slow enough that we'll be able to adjust for, like, the added productivity that comes to the economy, hence growth, hence more and different jobs? So are you concerned about the speed here? Yeah, I mean, to me, the speed is going to be unprecedented of the old jobs being replaced. And that, that's true. But I do think that the speed of new jobs being created and the accessibility of those new jobs is something that people greatly underestimate.
Starting point is 00:47:53 So, yeah, what do you think is going to happen there? Well, for example, like, if you actually have AI that can program apps and it works for basic kinds of ideas, then a lot of people can build apps. A lot of people now are programmers effectively. They don't necessarily have the depth the really good programmers going to have, but they can do basic things. And that sort of unlocks a whole new set of jobs. because we actually have a great deficit of software in the world.
Starting point is 00:48:24 And we have a deficit of hardware, too. And people who can build things that solve problems. There are so many problems in this world that are unsolved. And particularly in spaces where you don't see a lot of computer scientists and engineers. Such as. Normally. Education, healthcare, care, therapy, you know, everyday kinds of spaces that you inhabit like in retail stores or
Starting point is 00:48:59 in, you know, there's a huge, and like maybe the fashion world, there's just places that there's not enough computer science and engineers solving problems. And so you could, and in many cases, there are businesses that are too. small for a computer scientist to go and build something that makes sense for them economically to do. But if you had an AI that could build the app, they would solve that problem. So this actually kind of empowers smaller businesses in many ways. There's businesses that do to the fact that computer science are only used to working on
Starting point is 00:49:43 problems that scale, like that will out-compete the smaller mom-and-pop shop. shops. But I think that goes away if every small business can instantly create its own app just by describing it and facilitate those processes and like do internal infra that makes sense for them as a small business. And so I think in that sense, AI actually creates a lot of jobs. And it becomes more, and those jobs are very accessible to many kinds of people. Do you think this is also something that can empower us with robotics you know think about a humanoid robot if it's just processing text and not emotion it's going to be fairly worthless maybe that's why we haven't seen any good ones but if it can respond to your motion know when you're done talking start to like engage you as
Starting point is 00:50:36 more human like type of thing then maybe you're a step closer to that yeah i mean the robotics the fact that it has a robotic body just increases the number of affordances it has to actually improve your emotional experience. So it basically has this field of behaviors that it can do, and it should choose behaviors that make you happy, even without you asking. And so that really is, the task is one of emotional intelligence. That's really how it breaks down. Now, just having language understanding, it gets you a long way. And the fact that we can now train robots that can figure out for themselves how to, you know, engage certain behaviors. It's great. I think that gets you a long way. But you still have to be
Starting point is 00:51:19 really, really explicit with your instructions with them. With empathic AI tools, you won't have to be as explicit. It'll just be like, oh, I see that like you prefer these items to be arranged in this way. It moves around your furniture. I mean, obviously, it'll clean. That's something that is obvious that you don't need to ask it to do. But maybe you don't want it to be cleaning sometimes. you like you like something being left on the table and it kind of needs to have empathy to figure that out and so that's I think the future is this why Syria and Alexa are so bad just they've zero contemplation of how what human speech means and zero contemplation of emotion and and if that's the case do you think we're
Starting point is 00:52:06 about to see like I mean Apple has this big event coming up in June at WWDC where it's supposed to be an AI event. I'm curious if they're working on something like this. But if this is the case, do you think we're about to see a vast improvement in these type of voice assistance? Yeah, I mean, it's inevitable for sure. I think that every company is going to build a voice assistant. Apple already happens to have one, but the ones that aren't are going to build voice assistants, the ones that have them, they'll get a lot better. I mean, they're kind of based on legacy technologies right now. And that's about to change in a dramatic way.
Starting point is 00:52:44 So how soon do you think it's going to be until Yume is acquired or you're not so on it? Well, I mean, I think like I said before, there's room for a company to provide this in a way that's agnostic to the hardware environment, the products, the company that owns the products. It's agnostic to the company is building the frontier models because we, can use any frontier model to give you a state of the art response. Like if a company building one of the largest language models build something like what we have, they're going to be kind of wedded to using their own language model exclusively. But we actually use all kinds of tools. So we have our own language model that is extremely good at conversation. But it can use all
Starting point is 00:53:26 kinds of different tools and it's agnostic to where they come from because we're not part of those companies. So I think that there's a huge amount of value in that. So I don't necessarily think that we need to be acquired if if there's a company that is value aligns and then we yeah let's see well i i look forward to writing the news story when that happens but in the meantime um it is great to be able to speak with you and hear about this like really fascinating area of technology i hope we can do it again i feel like we've really just scratched the surface here but i appreciate you coming on alan and sharing so much about this discipline and sort of how it developed and where it's going because it does seem like as we get toward more of an agent-style approach
Starting point is 00:54:11 in AI, more of a voice-style approach, this is going to be the ground, like the table sticks is building in this type of technology. And it's great to have a preview of it and really like not even a preview, but a conversation about what's going on today before the rest of the world catches on. So thank you so much for coming on the show. Of course. Had a great time. Thank you.
Starting point is 00:54:32 Thanks so much. All right, everybody. Thank you for listening. We'll be back on Friday with a new show. Breaking down the news, Ron John Roy is back. He was out in India, and we're going to talk about the trip,
Starting point is 00:54:42 and he'll be back with us on Friday. Until then, we'll see you next time on Big Technology Podcast.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.