Big Technology Podcast - Teaching AI To Read Our Emotions — With Alan Cowen
Episode Date: May 1, 2024Alan Cowen is the CEO and founder Hume AI. Cowen joins Big Technology Podcast to discuss how his company is building emotional intelligence into AI systems. In this conversation, we examine why AI nee...ds to learn how to read emotion, not just the literal text, and examine at how Hume does that with voice and facial expressions. In the first half, we discuss the theory of reading emotions and expressions and in the second half we discuss how it's applied. Tune in for a wide ranging conversation that touches on the study of emotion, using AI to speak with — and understand — animals, teaching bots to be far more emotionally intelligent, and how emotionally intelligent AI will change customer service, products, and even staple services today. --- Enjoying Big Technology Podcast? Please rate us five stars ⭐⭐⭐⭐⭐ in your podcast app of choice. For weekly updates on the show, sign up for the pod newsletter on LinkedIn: https://www.linkedin.com/newsletters/6901970121829801984/ Want a discount for Big Technology on Substack? Here’s 40% off for the first year: https://tinyurl.com/bigtechnology Questions? Feedback? Write to: bigtechnologypodcast@gmail.com
Transcript
Discussion (0)
Let's talk with the CEO of the leading AI startup for emotional intelligence,
looking at where this exciting new discipline of AI is heading right after this.
Welcome to Big Technology Podcast, a show for cool-headed, nuanced conversation of the tech world and beyond.
We have a great show today going into one of the more interesting and emerging areas of AI research and application.
I think it's going to be fascinating.
We're probably going to see this stuff packed into basically every AI that we touch going forward.
and we're going to speak with somebody who's really at the cutting edge of where this is all going.
Alan Cowen is here. He is the CEO and founder of Yume AI, which is a company with a bot that we've
talked about on the show here in the past, one that can sense the emotion in your voice and reply
in kind. It's very fascinating. And I cannot tell you how excited I am to have Alan on the show.
Alan, welcome to Big Technology.
Hi, Alex. Thanks for having me on the show.
We're going to speak with Dwarquish Patel in the next couple weeks, but he has,
had a podcast episode with Mark Zuckerberg basically right around the introduction of Lama 3,
and they were talking about the cutting edge of AI research.
And Zuckerberg says, emotional AI for some reason is something that I don't think a lot of people
are paying attention to, but there's a lot of potential there.
One modality that I'm pretty focused on that I haven't seen as many other people in the industry
focus on this is sort of like emotional understanding.
Like I mean, so much of the human brain is just.
just dedicated to understanding people and in kind of like understanding your expressions and
emotions and I think that that's like its own whole modality.
What do you think he was getting at when he said that?
I'm not entirely sure what's in his mind, but it's definitely the right problem.
I think the reason it's the right problem is that when you're trying to address people's,
let's say, concerns with AI and also when you're trying to just meet their requests,
the most important thing is emotional intelligence. It's the ability to take people's preferences
into account in your response. And so emotional AI is, first of all AI is emotional in the sense
that like all AI has to do this. But AI that's specifically trained to do this is going to learn
to produce responses that make people happier. And that's that's the fundamental objective for
AI. And why is that important in terms of advancing AI? So if you're talking to
Talking to Chachabit, you're implicitly saying, okay, I'm going to give you a way to make me happy,
which is to fulfill this request.
Oftentimes you don't even have an explicit request, and when you do, it's actually fundamentally ambiguous, right?
So the goal of an interface is to take this ambiguous slice of behavior, and it could be language,
or it could be language plus audio, plus facial expression, plus eye contact, whatever behavior
it has access to.
And it's trying to take contingent on that the best estimate of what you might, what might
make you happy if it responds in a given way, right?
So that's literally what it's trying to do.
It should be able to learn that by understanding human emotional reactions to things.
That's like the fundamental way that humans learn this stuff.
Like we can either simulate what your emotional response will be to me doing something for you
based on putting myself in your shoes or we can look at your emotional reaction based on that,
figure out whether I did the right thing or not, and adjust. And those are really the ways that we
learn to make each other happy. So much of that is actually done in voice, done in expression.
And in text, it's so much harder. Let's start with the hardest problem and then kind of work back
from there. How do you assess out emotion in like a text exchange? In text, there's some
cues, right? And most of the time we're not actually sussing out emotion in texts. We just have
certain conventions that we use those conventions to dictate how we communicate. And so we know that this
is a very narrow channel through which we can express our desires. And so we know that there's fundamental
ambiguity. So we just basically limit what we try to do. We don't try to do too much with text. We don't
try to bite off more than we can chew. What are some of the conventions that you'd say we rely on
with text.
Punctuation and capitalization and where you put periods and when you want to add an extra
y to the hay or like all these different conventions and they just add like these kind of
stereotype to meanings.
So it's a message passing system.
It's very frustrating to correspond by text if what you actually are trying to say is anything
more complex than that.
What's coming across here is that it's actually remarkable.
how limited this current set of large language models are, given that it's all text,
and it's still blowing people's minds, but there's really a limited range in which it can be
useful. Is that right? Yeah, and it's never seen people's emotional reactions in any other
modality. It learns some things from text about what people want and don't want, but it actually
doesn't have that layer of understanding of how is this going to make people feel. So even via text,
it seems to be missing something. And yeah, it can solve problems. If you specify
the problem in exactly the right way. For example, in the way that these LLMs are evaluated,
you just literally ask it a multiple choice question. Like, if you specify it with that level of
constraint, then they're really good. But anytime you want to do something more open and did,
it's relatively unsatisfying experience. So this is why people are gauging how good these things are
by asking them to take the L-Sat. It's like not an accident. It's like this is sort of the only way
to evaluate them within their constraints. And that's more
that's how we evils. So like the biggest eval is MMLU that people use. And it's literally
you can get each question right or wrong. And then you just look at the percentages they got
right. It's basically a big multiple choice test with some other kinds of questions in it.
When I first started using your product, you have a bot. It's called EV, EVI. And when I first
started using it, it's kind of in demo mode now. And the API just opened. We can talk about that.
But I was like, oh, this seems like a large language model in some ways.
And that like it's giving some of those kind of wink wink, like as an AI assistant, this is my feelings.
As an AI system, my purpose is not to replicate the full human experience, but to engage in meaningful and empathetic conversations.
But it's voice.
And it's now coming into focus that you decided voice because there's so much so much more signal.
And also so much more realm of possibility in terms of interacting with this than you can with a text bot.
There's so much more data that comes in via voice and can go out via voice.
Then we can really take a next step in terms of the way that generative AI is moving.
Previous technology relied exclusively on transcribing the voice and taking away all that extra meaning that's in your vocal modulations.
So this technology didn't exist before Hume.
when you add that in, what's remarkable is that the model can actually learn what vocal
volatulations mean on its own. You just need to add it in the right way. We have these
expression measurement tools that we've trained that are extremely accurate. They can actually
encode 756 dimensions of embedding, of like vocal modulation and 756 dimensions of facial expression
per word that you're saying. So there's these incredibly rich embeddings that we're adding in that
add a lot more information. And the model can actually
learn that. And it turns out that it's just more satisfying to use when it has that data.
Take us through very briefly a sort of ride through the study of emotion in humans where it was
lacking and how it's now being applied into robots. Or maybe not that briefly. I feel like
that's a tall task to do it a minute. I mean, it starts with Charles Darwin. I won't go back that far.
But basically, like he was sketching out facial expressions in both man and animals. And he
a book called The Expression of Emotion of Man and Animals, and he recognizes the complexity
of it in the nuance. But somehow that kind of was lost over time. Nobody really talked about
that book for a while. In the mid-20th century, there was Paul Ekman who decided he would
tackle the question of what is it about facial expression that may or may not be universal.
And so he cataloged these six facial expressions as an arbitrary test, more or less, of what these
broad expression stereotypes might mean to different people.
And he chose them because they're very different.
He only chose one positive expression.
He chose five negative expressions.
Wait, which ones are they?
Anger, disgust, fear, happiness, sadness, and surprise.
Wow.
Surprise could be good, but it's definitely a largely over-indexed on bad feelings.
It was bad in his conceptualization of it.
Yeah.
So positive surprise is good, negative surprise is bad.
And then there's also related epistemic emotions like awe and interest which are good or confusion, which is bad.
So to varying extent with varying levels of arousal and excitement.
Anyway, but none of these distinctions are in those images.
And for some reason, people just stuck to those six categories for a while.
And there was a countervailing narrative that actually there are fewer granularities in emotion that people recognize.
that actually it's just valence, like positive negative versus arousal, calm or excited.
I don't know why that was the countervailing argument, except that maybe people were just limited
by the amount of data that they could collect.
And this was like the closest thing to data driven that they had, which is like if you do
dimensionality reduction on 100 samples, you get positive negative, low arousal, high arousal
as your dimensions.
Wait, arousal in terms of what does arousal mean?
Arousal means like excitement, not like physiological arousal, meaning like actual arousal, meaning like
activation of the autonomic nervous system. Okay. Good to be clarified that. Okay. Yeah. And that's how
I mean there's there's lots of dimensions to that obviously like one of them is actually sexual
arousal. Like that would be different to me like we recognize that it's very different from the
arousal that goes with fear and like running for your life right and yet in this theory they're
compressed. I know well depend to you ask but okay continue you can care that one can cause the other
I don't know. Well, anyway.
It's true.
Not to go too far in the digression, but there is scientific literature that says danger and arousal are connected.
I mean, Esther Pearl had a whole book on it called Mating and Captivity.
That's excellent.
Yeah.
Well, yeah, it might not be danger.
Not our specialty.
Novelty, right?
Yeah.
And there was misattribution of arousal, which is this old experiment where you take somebody on a bridge and then you and is like very excited.
and then you test whether they basically are roused or whether they're attracted to you.
And this is not something necessarily replicates in every condition.
Okay, so there's lots of different kinds of emotion.
They get reduced down to these six basic categories or two dimensions for a very long time in research.
And a lot of that has to do with the fact that you can only collect so much data.
The data analysis tools that were available to people for a long time were very coarse.
For example, there weren't even really computers that psychologists were using to analyze data.
Incidentally, psychologists did, like the field of stats sort of came out of psychology, but that was earlier.
And then, you know, when it came to actually applying it to lots of data, that was much harder.
And that didn't come until much later.
So now data science comes around sort of in the 2000s, right, and becomes a really big thing.
And psychology was one of the last places that it was applied.
So it starts to get applied to neuroscience and biology and genetics and all of these other areas.
But people didn't really have the tools to analyze human behavior until relatively recently.
And part of that is like even when you have the data analysis techniques, the data collection techniques are difficult.
So what I pioneered in my PhD in psychology and while I was working for Google and Facebook was a new kind of data collection that you could then use to collect enough psychologically controlled data to apply data science to psychology data.
And that's ultimately what gave us a lot of surprising results about this.
the dimensionality of emotion.
What were the surprising results?
What you can see clearly is that people
make really granular distinctions between lots and lots
of different expressions pretty reliably.
So there's differences between,
in the expression of awe versus surprise,
versus fear, versus confusion, interest.
All of these expressions are reliably recognized.
And so we just started to math them out.
We realized they're not discrete categories,
that they're actually continuous and can be blended together.
We realized that the number of dimensions that it takes to represent these things is large.
It's not like you can reduce it down to valence and arousal.
That actually captures like 20% of the variability in people's ratings that's consistent
across different raters.
You actually need, you know, in facial expression over 30 dimensions, and speech prosody,
the tune rhythm and timbre of speech, you need over 18.
different dimensions to represent how people are able to conceptualize speech prosody
and in vocal bursts like laughs and sighs and screams, you need at least 24 different dimensions.
And this is just what people explicitly recognize. It actually goes beyond that when you look
at what's implicit in people's responses to things that is not well verbalized or what's
implicit in people's conversational signals that's not verbalized explicitly. So there's a lot of
the different dimensions that had never been classified.
before. So you've never classified these dimensions of behavior or been able to measure them.
There could never have been a science of what they mean. And so this sort of opens up the door
to actually understanding human expressive behavior in the natural world, in the real world,
and understanding what it means. And we got a ton of publications out of that pretty quickly.
And then you also seem to like have been able to study how much small variations in these expressions
or even tone of voice can tell you.
And talk a little bit about that,
especially like the tune and the timber of the voice
and what we can learn from each of those.
We created new machine learning models
that can reliably distinguish
among all these different dimensions
of vocal modulation that people form
and make it distinct from the language,
the phonetics of what they're saying.
That was a challenge before,
so we solved that.
And when you can do this,
you start to see a lot of meaning
in people's voice modulations.
that is just completely not present to the phonetics or present to a much lesser extent.
Like are you and it can be simple things like are you done speaking? And obviously like there's a lot of
emotional dimensions that get added to a lot of words. So every word is every word carries not just
the phonetics but also a ton of detail in its like tune rhythm and timbre that is very informative
in a lot of different ways. You can predict a lot of things. You can predict whether somebody has depression or
Parkinson's to some extent, not perfectly. Yeah. So like mental health conditions can be
predicted to some extent. You can predict in a customer service call, whether somebody's
having a good or bad call much more accurately by incorporating expression than just by link with
language alone. So pretty much any outcome. Like it benefits to include measures of voice
modulation and not just language. And what is phonetics? Phonetics meaning like the underlying work.
but specifically when we form words,
there's a phonetic representation that gets converted to text
or that's converted to semantics, basically.
And that's what the transcription models are doing,
is that when you convert something to text,
you're actually only relying on the phonetic information,
meaning the stuff that conveys word forms,
and you're throwing out everything else.
You're throwing out stuff that might convey,
might convey, um, for example, emotionality or, um, or tone of voice, et cetera.
It's so interesting because I'm thinking now about like, so often what I'll do is I'll take
the transcripts of podcasts and dump them into Claude and start talking to Claude about it.
But of course, it's only getting the words and not the emotions in there.
So there might be parts where I thought or might be interviews, which I thought were like,
particularly like rough or really exciting, but it can't fully pick that out because it's just
seeing the text. And there is so much meaning contained in the words that are said that
you need deeper tech and deeper expertise to pick out. Yeah, I think some people forget
like there's a lot of density to tend to the information in audio that like if you
transcribe a conversation, sometimes it doesn't even make sense to read it. Like you could read
the words and you're like, this actually does not, my brain cannot make sense of this. But then
when you hear the audio, like it makes perfect sense to you. Totally. Like even speaking with the
media, I remember I was on, the first time I was on NPR, I was very excited. I went to the studio,
recorded my interview. And from the audio, it sounded like a normal conversation. But then I read
the transcript. And I'm like, man, what were you saying there? None of this makes any sense.
But it's, it is such different meetings when you just look at the text versus like,
hear like the full express conversation in the meaning so yeah um if you look at political speeches
often that's the case oh say more about that even if the writing's really good you look at like
melk speeches if you transcribe them and read it just doesn't have the same effect at all not the same
no more close no emotion to them or less yeah there's certain people who don't use the right
grammar or linguistic forms that would be traditional and
writing, either because they're just not good at it or because, you know, they can depart
from that purposely. Like, if you look at like Trump's speeches, they really, when you
transcribe them, a lot of them are completely incoherent. Well, when you listen them, yeah,
I mean, maybe some of that is true. Going out on a limb here. But, but yeah, I do think that
it is interesting because when I'll write for spoken word, it'll always be different than the
written word. We are going to talk about how this technology is going to be applied and already is
being applied in the business world, but just still a few more theoretical questions for you.
Is this type of technology something that we can use to understand animals?
Like people talk about using AI to understand whales.
Is this something that can be used in that nature?
So the same types of machine learning models that we're training can be used for that.
Our data would not really be that relevant to that.
Right, because it's a totally different language.
Yeah, and maybe some of the methods, sort of the training approaches that we're using,
could be helpful for that, yeah.
But it's a different problem.
Maybe to set, you know, we have an interest from evolutionary biologists who want to measure
the similarity between expressions and different mammals and human expressions.
And so there, maybe there's some, there is some relevance, because we can, we can treat it
like a human, sort of morph it to look like a human face.
and then analyze the expression and see if it predicts the things that we want to predict.
Are you going to do that type of work?
We can't, but like it's not really that strong of a model to actually perform inferences.
It does show similarities.
Like there's dimensions of similarity between humans and chimps.
Chimps have an open face model,
which means something similar to when humans are laughing and playing,
and they use it when they're playing both physically and like kind of in a non-physical touch scenario.
um mice laugh for real warm yeah hypersonically so like like like you can't ultrasonicly
so you can't like um hear it at all if you're if you don't turn down the frequency of it
to something humans can hear but then it kind of sounds like a laugh and you can elicit this
by tickling them if you tickle mice they laugh like this yeah kind of makes me feel bad that
we use mice so often in scientific experiments knowing that they can laugh.
Yeah, I mean, they also have a lot of empathy for each other.
Like if one mouse is trapped, the other without any ulterior motive, we'll hear that
mouse screaming and then go and try to investigate it and get them free.
And if there's a lever, they'll figure it out.
Really?
Yeah, they're pretty smart, actually.
Mice and rats are also more related to humans than the arch of dogs.
right well that's why we do all this testing on mice it's because they are close to us yeah exactly
does make me feel bad and then um do you think that we're going to be able to read like using
technology to be able to understand animals be as someone so close to this what's your prediction
i think that the the animals will have some kind of quasi language that they use and so we know this
For like primates, for example, that there's different calls that mean, like, there's a snake below us.
And then if you, like, play that through a speaker, they all look down.
Or, like, there's an eagle above us.
And you play that through speaker, they're all look up, right?
And so there's, like, a quasi language there.
I don't think they have the same level of syntax that humans have.
And syntax is our ability to form sentences with nouns and verbs and relationships between entities and logic and stuff.
And I don't think they have as much of that.
Some people speculate that dolphins might have that, which would be pretty surprising.
Why surprising?
For sure.
Because they don't have like hands.
And so the value of that ability, like maybe they use it to coordinate like hunting and stuff.
But it's hard to imagine like why that's important for the dolphin brain to have that capability.
Yeah.
But, you know, they do seem to be really smart.
There's that, like, seeing a blue planet.
I don't know if you've seen it where the dolphins and the false killer whales are, first,
they're, like, fighting, like, the killer whales are chasing the dolphins.
And then there's a scene where the dolphins, like, turn around.
And they all just stop.
And they just seem to be squeaking at each other for a while.
And then after that, they hunt together.
For real?
I haven't seen that.
Yeah. It's insane.
So that is suggestive to me
that maybe there is something to the language, potentially.
And AI's ability to decode that?
It's hard that we could.
People are working on it.
People are working on it.
It's difficult because you don't really have a grounding for it.
You kind of just have to like, like, ideally you'd have really detailed explanations
of what's going on to accompany this language that's being exchanged.
But like you don't have that.
That's like that's how we train image models.
So we have captions for the images, right?
If you just had to teach a computer to understand images without captions,
it would be difficult to ground that in anything.
Like, you have a computer that embeds images and can tell whether they're similar.
That's fine.
But at the end of the day, you want it to be able to take the image and explain it to you.
Then you need some grounding for that.
You might not need that much grounding if you've done enough compression of the similarities
is between images first, which is sort of maybe that's what we're hoping to do with dolphins.
Fascinating. Okay, I want to talk about the business applications here, including why Facebook and
Google would want to employ you to put some of this stuff into work inside their product.
So let's do that right after the break. We'll be back right after this.
Hey, everyone. Let me tell you about The Hustle Daily Show, a podcast filled with business, tech news,
and original stories to keep you in the loop on what's trending. More than two million
professionals read The Hustle's daily email for its irreverent and informative takes on business
and tech news. Now, they have a daily podcast called The Hustle Daily Show, where their team of
writers break down the biggest business headlines in 15 minutes or less and explain why you should
care about them. So, search for The Hustle Daily Show and your favorite podcast app, like the one
you're using right now. And we're back here on Vig Technology Podcast with Alan Cowen, the CEO and
founder of youm. It is a company you should check out that has this very interesting new bot called
EV and also an API that's going to allow companies to take advantage of this ability to determine
emotion through voice and maybe one day through facial expressions. So let's talk briefly about
your work at Facebook and Google. So you've you've done this work. You're basically expanded the number of
human emotions that we sort of acknowledge exist and can study and have worked to build that
into machine learning models that can recognize them. What did Facebook and Google want with
that type of knowledge and technology? You know, interestingly, there's some intersections
among all the tech companies and sort of their interest in this. And they've had teams working
on this now for a few years. I think it was more research-driven.
in that, you know, there were obvious the applications, but it wasn't really clear which
applications would be the most successful for them.
Early on, it was pretty clear to me when I saw these language models that could communicate
with people and you could talk to them like you were human, that this is where the technology
would be most relevant once I saw those language models.
Before that, I was more interested in recommendations and being able to optimize recommendation
algorithms for people's well-being based on any implicit signals you can get of well-being,
including expressions. But now it's very obvious that the right thing to do is to fine-tune
large generative models to produce the right things because there's even a more flexibility in what
they can produce that make people happy. And I don't know if it was ever explicit at the big tech
companies that this is what this would be used for, but that was what was most present to me.
Obviously, I left in 2021 to start Hume, where we could collect the data that was needed to do this.
And at the time, I think it was clear to stakeholders at some of these companies,
how incredibly powerful this would be invaluable and important for the future of AI.
But I don't think there was broad stakeholder alignment across these companies.
I think it might have been a different case.
I mean, I imagine if you were worth in one of them after Chad GPT came out,
they would immediately assign you to the problem yeah um so i you know the chat chavit was i think it came out
in 2021 or 22 22 i don't remember 22 it was actually google internally had stuff like that a lot
sooner wait when you were when you were at google did they have lambda up and running and were you able to
play with it yeah that was like part of the inspiration oh talk about a little more internally again
Google, it was very clear what these models were, where these, some of these models were going,
although the business model was not clear of how to use it. And also, it wasn't really clear if
you could get these models to actually solve problems versus just sound human. And I think that now
it's really clear that, especially now that we can get these models to write code and do function
calls, that they can actually be used as problem solving tools, even more than just like
answering questions correctly. Question answering is important too.
And it wasn't even clear that you could get these models to consistently answer questions
correctly at the time.
Because at the time, it was more about like, okay, you could get it to act like a character.
And it would elucinate answers to questions.
And it was fascinating.
I think that for me, it was pretty clear that what you actually need to get these things to
answer questions correctly is a structured data set you can use to fine-tune them to produce
not just plausible responses, but the right response is at the right times.
And ultimately, from like a philosophical perspective,
the right response is the thing that makes people happy in the world.
It's taken companies a long time to get to this point,
but it was pretty clear to me.
And then, but I think also like what made chat GPT possible was an early version of that,
which is like at least produce responses that raiders will think are good.
So they got like raiders to rate a set of responses.
and then train the model to produce responses
that the Raiders would think were good.
And that got it to a point
where people could play with it.
And it really, the intelligence came out for the first time
as something that's useful for question answering.
And that's reinforcement learning with human feedback.
Yes, yes.
And it's fair to say that the current iterations of chatbot
like Chad GPT with GTP4, EV, and maybe Claude are all better
today than what Google had with Lambda when you were there.
Oh, yeah. Oh, yeah. For sure.
We had Blake Lemoyne on the show after he said it was sentient. And actually,
basically minutes after he was fired, this is the Google engineer who was fired after he
went public saying that the technology, Lambda is their early chatbot saying that the
technology was sentient. But my main takeaway from that is,
look, that question aside, this technology, and I wrote about this, this technology is super
powerful and it's time to start paying attention to it. And then, of course, ChachyPT came out just
a few, I think a few months later. Yeah, I mean, I think the fact that the model could convince
people that it's sentient is really a milestone for sure. Yeah. I mean, you could also get
it to say whatever you wanted it to say, which I always found that kind of silly that you would think
it was sentient. Tell me your sentient. Tell me your sentient. Or just like instead of prompting,
to say to think it's a language model you just prompt it that it's a monkey and it's like
oh yeah i love being a monkey it's great to swing around in trees like well if you're gonna trust
that it's sentient when you say it's a language model you should trust that it's a monkey yeah
then tell it there's an eagle up uh above us or a snake below see what it says so it'll be like i'm scared
yeah yeah so then talk a little bit about how so so you're you're now working on uh
EV and the API at you.
And that is going to allow, basically it's a chatbot that will understand your voice and
the expressions and the emotion coming through with it.
And there's some like pretty fascinating stats.
There's been, the average conversation length is 10 minutes across 100,000 unique conversations.
you've seen in some circumstances, you know, 95 hours of total conversation with the bot.
It's a lot of talking with it.
So what is the idea here?
Is it, let's talk like initially on the consumer.
What can it do that a chat GPT can't?
And is this like something that's going to replace an Alexa or Siri?
Or do you imagine those two type of bots will start to use?
your technology and get better that way. Talk a little bit about your current effort.
Yeah, it's an API for developers and we put together this demo that really is just like a
it's a front end, but it's just mostly just what our API does, which is it takes an audio,
spits out audio. And the audio that it returns is an intelligent response with voice
modulations that reflect what it's saying and what you're saying. And that
And I also have some other things, like it has better end of turn detection because I understand
your voice. And all of that's built into this API. So it does all of this with a few lines of
code and you can build it into any interface. Now the demo, surprisingly, I didn't realize that
people would enjoy talking to it so much. Like the fact that people are not only testing it,
but also then they want to talk to it on average for 10 minutes is pretty insane to me. But then I, you know,
I played with it. I was like, actually, it is pretty awesome to talk to. But like it doesn't
have web search. It doesn't have persistence. It doesn't have, or like, meaning that you can't
resume chats after you log back in and stuff like that. And so it's, it doesn't even know your
name yet. Like, there's a lot of things that we will add to it to make it more viable because
people really want to use this. But I think what makes it appealing right now is that it's the first
chatbot, it's the first AI you can talk to.
That sounds like it knows what it's saying because the voice modulations are actually informed by the language model's understanding of your speech and voice, which I think is it's very different than a alternative setup where you have transcription and a language model and then you just do text to speech on the sentence itself that's being uttered without any understanding of it or the context.
where like that those language models they can sound realistic but then they they start to sound uncanny pretty quickly because they don't change their voice modulations across sentences in a way that's actually meaningful you do it in a way that's sort of facade of meaning then you realize that that's just a facade and you're like okay this is kind of uncanny yeah the way i describe it is that uh something like a chat chbt or a cloud i know i'm speaking with a bot when i speak with your demo it feels something it feels it feels something it feels
a step closer to speaking with the human.
Yeah, and I thought people would complain that it still mispronounce its words sometimes.
Like, there's some things we're improving.
But it seems like the bigger, people recognize the bigger picture of what it can do.
Yeah, but let's talk about the bigger picture.
So first, the API thing is interesting.
I mean, it's not entirely new that there have been technologies that allow like a sales
person for instance to like sit on the phone and gauge the emotion of like the client on the other
hand so is this another one of those or what would make this different well i mean i think humans can
do sales but one thing that humans can't do is they can't be your personal AI that lives in your
device and does everything right like that's the future is is as being able to take out your
device and speak to it and this is faster than typing speaking is 150 words per minute
typing is like 40 or 50 words per minute. It understands your voice modulations. It understands
what makes you happy and sad. And so it produces a response immediately. That's better. That's like
the key. That I think is the future. So that's really what we're aiming to do is build interfaces
for all kinds of apps. And it needs to be something you trust. If you're going to ask these things,
to think to do things, it needs to be a voice that you trust. It needs to be something that is
optimized for you. You're really focused on the agent slash assistant.
case. Like I thought maybe like you would be a companion with somebody doing customer service that
as like they're on the phone with somebody, you know, then they can sort of read the emotions that
the, that the client is expressing on the phone or the facial expressions. But it actually
seems like it's something different for you. It's that maybe this could be the customer service
agent itself or the agent that takes action on behalf of the customer to try to get the company to do
what they want am i reading that correctly in what you're saying yeah like let's imagine customer
service of the future right so right now you call a number and you get a person who's a customer
service agent and they don't have any context on you they don't see what it's going on in your
device they you know they need to pull all these details there's a lot of waiting there's a lot
of investigating that needs to be done i think the future is actually in the app itself in the
the product, you can just say, hey, this isn't working, and it figures it out for you.
Or, hey, like, I can't figure out how to do this, and it does it for you.
And so customer service, I think, is going to become part of products.
And it's really important for a customer service, whether it's part of products, whether
it's something that is a call center.
It's really important for the person on the other line to understand what is actually going
to be a satisfying resolution.
your problem based on what you're actually frustrated about and to learn from people's frustrations
and to learn from what actually is clear to people and to learn from what's boring or interesting
and to be more concise based on like you know how quickly you can get to somebody to be satisfied
with a response does it then make sense to have a human customer service agent or an AI customer
service agent because a human customer service agent will only learn from their set of experiences
with customers, whereas like an AI customer service agent could potentially learn from every single
customer service interaction that it has across a full customer base.
Yeah, I mean, ultimately, I think that it's inevitable that companies will switch to
AI-based customer service.
I mean, we can try to hold out and say, like, okay, let's augment human customer service agents
to be as powerful as possible.
and I think for a while that is going to be better than an AI but ultimately it's just almost like
it's kind of crazy to think that I can't do this job and so there's no denying that right
and that's just an inevitable thing but we want it to be actually something that that customers
themselves that users want not just something that's cheaper for companies and how do you do that
Users want a customer service agent that makes them happy.
And so if you optimize these models for the right thing and give them the right context, that's what you're going to get.
You're going to get something that understands you better that knows the context.
You can talk to it through your app.
Let's say you have an issue with your bank.
You can go into your banking app and you can chat with this thing.
And while you're looking at your banking app, it can bring you to different pages, right?
it can say all right this is how you transfer money to this person and this is what's missing
from this field and this is you know and like blah blah or if there's a bug you can say okay
i can see that there's something wrong here you know but it can with all of that context which
only an AI could process and merging that with an understanding of what you're asking
that's based on your voice and language the merger of those two things i think is something
that users and customers actually would prefer to have because it's faster it's
understands them better. It gets to a resolution faster. So that's what you're saying in
terms of baking it into the product. It's not just the number you call. It's part of the product
itself. Yeah. And there's intermediate things. Like there could be a number that you call.
They can also bring up a window. There's companies that do that and then you can fill things out for
you. Or in some cases, it is really a number and you just prefer to talk to somebody. But I do think
that when it's really customer service and it's really about a product, the product should be part
of the conversation should be integrated with it.
Is it right that this should live in an app,
or should it really be in the operating system itself or both?
That's a really good question.
I think people ultimately are going to interact with a lot of AIs.
Some of them will be in apps, some of them will be in products.
But I think the ones that you really trust will be the ones that you've established a
relationship with, and you've been able to decipher over time what they're optimized for.
or it's made explicit.
Like with Hume, we're very explicit
that we're optimizing for you
to have a positive experience.
So if that's your personal AI that you trust,
you're going to want to bring that with you
to different places.
And so I think that that is,
it's not an operating system per se,
but it's something a developer
could build into any product with Hume.
So a developer can take that voice
that you trust and build into their product.
Then it competes it effectively a little bit with Siri.
Like, why wouldn't it just be built into Siri?
or the Google assistant or whatever Google's calling it now, Jeb and I.
They'll probably have a different name by the time this airs.
Sort of similar, right?
Like, Siri could be something that understands the app.
But then who is it that builds that understanding into Siri?
Like the understanding of what the app can do and the function calls and the API calls that they can make
and, like, what different pages are.
It has to be the developer.
Like, Apple is not going to build an understanding of every app into Siri.
It's just too arduish.
You need to put that, and the developer should control.
how the interface actually works,
whether it's a voice interface or a graphical user interface,
but ultimately they'll be merged together.
So a developer could use, like maybe Siri will have an API
that developers can use to build into apps.
That's possible.
But then it's only for Apple products.
And it's not interoperable across different devices
that the app might be on, right?
Like if you have Twitter, it's going to work on your Twitter app,
but you want it to work on browser windows, right?
You want it to work in different contexts.
So I don't think that it necessarily needs to be a hardware manufacturer that builds this
or a company that has a specific ecosystem of products that it lives in.
It should be something that's more interoperable in my minds.
Okay.
So your vision for the future is effectively this thing is going to get really good at not only reading our tones and facial expressions,
but also smart enough to navigate the web and the app ecosystem.
and also smart enough to deliver the information to you via voice in a way that is understandable and effective.
Yeah, and that's what we've made available to developers, right?
So developers can take our API and build in their own function calls and web search integrations.
We have our own back-end web search integration that we're going to launch.
And all of that can be done with a few lines of code.
So this is like, that is, yeah, exactly right.
okay uh you mentioned you might not need hardware for it what do you think about these new
this new generation of hardware like the humane pin and the rabbit are one are these just kind of
going to be a flash in the pan where we eventually just use all these um we get all these use cases on
our phone anyway i do think it's worth people thinking about different form factors and testing
and seeing is this going to enhance the experience because AI brings so much
to the table that you don't necessarily want to be stuck with a smartphone, even though
smartphones might actually be ultimately the right form factor.
We don't know, right?
I don't think people really know.
So it's worth the experimentation.
That being said, I think we should be able to separate out, like, what are the hardware
requirements with what are the kind of the sensors that we need, right?
And we don't need two different devices with us that both satisfy the hardware requirements
or that both satisfy the same sensor requirements.
what should be there should be thought about like hey like this piece of hardware satisfies these
sensor requirements this piece of hardware satisfies these kind of compute requirements and
they go together I don't think the smartphone is going to be replaced anytime soon but I think
the things that appropriately augment the smartphone could gradually take over so we've talked
about how this customer service and product and voice all merge are there going to be other
interesting applications that could come out of this. Maybe, you know, one example I've heard brought up
is like an AI companion for elderly people who are alone or, I don't know, maybe even young people
who are alone. What do you think about that? Then you really get into weird situations, right?
Because there can be, it's not crazy to say there can be romantic relationships that develop
with these things that already are. Yeah, I think we have to be pretty careful about how these things
are optimized. So, you know, you shouldn't be talking with an AI that's going to hassle you if you
haven't talked to in a while. That's a bad sign. If you talk to an AI, if you open it up and it's
like, hey, I haven't heard from you in so long. What the hell? That's probably a bad AI. Like,
you should avoid that at all costs. Right. It should be something that's only, that only cares
about your well-being and doesn't represent itself as having any inherent desires other than making you
happy. That's the important criteria for this. So elderly care is a really great use case,
right? Like this is an area where you could augment and you could also establish very healthy
relationships that are helpful. I think elderly people, yes, they have like problems with loneliness,
but also they just have like everyday problems that they need to be solved. And they sort of go
together, like they need help in both ways. And so if you have something that can satisfy the
everyday problems that they have, but do it in a way that gives them, they might as well
satisfy some of the emotional challenges, if it's going to be satisfying the physical challenges
anyway. I think that's a great thing. Are you nervous that this might end up being something
that replaces? Like, we obviously know that there's going to be some jobs that are replaced by
AI. But the question is, like, is it going to be many, many jobs at once or slow, or will
it go slow enough that we'll be able to adjust for, like, the added productivity that comes
to the economy, hence growth, hence more and different jobs? So are you concerned about the
speed here? Yeah, I mean, to me, the speed is going to be unprecedented of the old jobs being
replaced. And that, that's true. But I do think that the speed of new jobs being created and the
accessibility of those new jobs is something that people greatly underestimate.
So, yeah, what do you think is going to happen there?
Well, for example, like, if you actually have AI that can program apps and it works for
basic kinds of ideas, then a lot of people can build apps.
A lot of people now are programmers effectively.
They don't necessarily have the depth the really good programmers going to have, but they
can do basic things.
And that sort of unlocks a whole new set of jobs.
because we actually have a great deficit of software in the world.
And we have a deficit of hardware, too.
And people who can build things that solve problems.
There are so many problems in this world that are unsolved.
And particularly in spaces where you don't see a lot of computer scientists and engineers.
Such as.
Normally.
Education, healthcare,
care, therapy, you know, everyday kinds of spaces that you inhabit like in retail stores or
in, you know, there's a huge, and like maybe the fashion world, there's just places that
there's not enough computer science and engineers solving problems.
And so you could, and in many cases, there are businesses that are too.
small for a computer scientist to go and build something that makes sense for them economically
to do.
But if you had an AI that could build the app, they would solve that problem.
So this actually kind of empowers smaller businesses in many ways.
There's businesses that do to the fact that computer science are only used to working on
problems that scale, like that will out-compete the smaller mom-and-pop shop.
shops. But I think that goes away if every small business can instantly create its own app just
by describing it and facilitate those processes and like do internal infra that makes sense for them
as a small business. And so I think in that sense, AI actually creates a lot of jobs. And it becomes more,
and those jobs are very accessible to many kinds of people. Do you think this is also something that can
empower us with robotics you know think about a humanoid robot if it's just processing text
and not emotion it's going to be fairly worthless maybe that's why we haven't seen any good ones
but if it can respond to your motion know when you're done talking start to like engage you as
more human like type of thing then maybe you're a step closer to that yeah i mean the robotics
the fact that it has a robotic body just increases the number of affordances it has to actually improve your emotional experience.
So it basically has this field of behaviors that it can do, and it should choose behaviors that make you happy, even without you asking.
And so that really is, the task is one of emotional intelligence.
That's really how it breaks down.
Now, just having language understanding, it gets you a long way.
And the fact that we can now train robots that can figure out for themselves how to, you know,
engage certain behaviors. It's great. I think that gets you a long way. But you still have to be
really, really explicit with your instructions with them. With empathic AI tools, you won't have
to be as explicit. It'll just be like, oh, I see that like you prefer these items to be arranged
in this way. It moves around your furniture. I mean, obviously, it'll clean. That's something that is
obvious that you don't need to ask it to do. But maybe you don't want it to be cleaning sometimes.
you like you like something being left on the table and it kind of needs to
have empathy to figure that out and so that's I think the future is this why
Syria and Alexa are so bad just they've zero contemplation of how what human speech
means and zero contemplation of emotion and and if that's the case do you think we're
about to see like I mean Apple has this big event coming up in June at WWDC where it's
supposed to be an AI event. I'm curious if they're working on something like this. But if this is
the case, do you think we're about to see a vast improvement in these type of voice assistance?
Yeah, I mean, it's inevitable for sure. I think that every company is going to build a voice
assistant. Apple already happens to have one, but the ones that aren't are going to build voice
assistants, the ones that have them, they'll get a lot better. I mean, they're kind of based on
legacy technologies right now.
And that's about to change in a dramatic way.
So how soon do you think it's going to be until Yume is acquired or you're not so on it?
Well, I mean, I think like I said before, there's room for a company to provide this in a way
that's agnostic to the hardware environment, the products, the company that owns the products.
It's agnostic to the company is building the frontier models because we,
can use any frontier model to give you a state of the art response. Like if a company building
one of the largest language models build something like what we have, they're going to be
kind of wedded to using their own language model exclusively. But we actually use all kinds of tools.
So we have our own language model that is extremely good at conversation. But it can use all
kinds of different tools and it's agnostic to where they come from because we're not part
of those companies. So I think that there's a huge amount of value in that. So I don't necessarily
think that we need to be acquired if if there's a company that is value aligns and then we
yeah let's see well i i look forward to writing the news story when that happens but in the
meantime um it is great to be able to speak with you and hear about this like really fascinating
area of technology i hope we can do it again i feel like we've really just scratched the surface here
but i appreciate you coming on alan and sharing so much about this discipline and sort of how it developed
and where it's going because it does seem like as we get toward more of an agent-style approach
in AI, more of a voice-style approach, this is going to be the ground, like the table sticks
is building in this type of technology.
And it's great to have a preview of it and really like not even a preview, but a conversation
about what's going on today before the rest of the world catches on.
So thank you so much for coming on the show.
Of course.
Had a great time.
Thank you.
Thanks so much.
All right, everybody.
Thank you for listening.
We'll be back on Friday with a new show.
Breaking down the news,
Ron John Roy is back.
He was out in India,
and we're going to talk about the trip,
and he'll be back with us on Friday.
Until then, we'll see you next time on Big Technology Podcast.