Lex Fridman Podcast - Rohit Prasad: Amazon Alexa and Conversational AI
Episode Date: December 14, 2019Rohit Prasad is the vice president and head scientist of Amazon Alexa and one of its original creators. This conversation is part of the Artificial Intelligence podcast. If you would like to get mor...e information about this podcast go to https://lexfridman.com/ai or connect with @lexfridman on Twitter, LinkedIn, Facebook, Medium, or YouTube where you can watch the video versions of these conversations. If you enjoy the podcast, please rate it 5 stars on Apple Podcasts or support it on Patreon. This episode is presented by Cash App. Download it (App Store, Google Play), use code "LexPodcast". The episode is also supported by ZipRecruiter. Try it: http://ziprecruiter.com/lexpod Here's the outline of the episode. On some podcast players you should be able to click the timestamp to jump to that time. 00:00 - Introduction 04:34 - Her 06:31 - Human-like aspects of smart assistants 08:39 - Test of intelligence 13:04 - Alexa prize 21:35 - What does it take to win the Alexa prize? 27:24 - Embodiment and the essence of Alexa 34:35 - Personality 36:23 - Personalization 38:49 - Alexa's backstory from her perspective 40:35 - Trust in Human-AI relations 44:00 - Privacy 47:45 - Is Alexa listening? 53:51 - How Alexa started 54:51 - Solving far-field speech recognition and intent understanding 1:11:51 - Alexa main categories of skills 1:13:19 - Conversation intent modeling 1:17:47 - Alexa memory and long-term learning 1:22:50 - Making Alexa sound more natural 1:27:16 - Open problems for Alexa and conversational AI 1:29:26 - Emotion recognition from audio and video 1:30:53 - Deep learning and reasoning 1:36:26 - Future of Alexa 1:41:47 - The big picture of conversational AI
Transcript
Discussion (0)
The following is a conversation with Roja Prasad.
He's the vice president and head scientist
of Amazon Alexa and one of its original creators.
The Alexa team embodies some of the most challenging,
incredible, impactful, and inspiring work
that is done in AI today.
The team has to both solve problems
at the cutting edge of natural language processing
and provide a trustworthy, secure, and enjoyable experience
to millions of people.
This is where state-of-the-art methods in computer science meet the challenges of real-world
engineering.
In many ways, Alexa and the other voice assistants are the voices of artificial intelligence
to millions of people and an introduction to AI for people who have only encountered it
in science fiction.
This is an important and exciting opportunity. So the work that Rohe hit and the Alexa team are doing
is an inspiration to me and to many researchers and engineers in the AI community.
This is the Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube,
give it 5 stars
on Apple Podcast, support it on Patreon, or simply connect with me on Twitter.
Alex Friedman spelled F-R-I-D-M-A-N. If you leave a review on Apple Podcasts especially,
but also cast box or comment on YouTube, consider mentioning topics, people, ideas, questions,
quotes, and science, tech or philosophy that you find interesting.
And I'll read them in this podcast.
I won't call out names, but I love comments with kindness and thoughtfulness in them, so I thought I'd share them.
Someone on YouTube highlighted a quote from the conversation with Ray Dalio,
where he said that you have to appreciate all the different ways that people can be A players. This connected with me too.
On teams of engineers, it's easy to think that raw productivity is the measure of excellence,
but there are others.
I worked with people who brought a smile to my face every time I got to work in the morning.
Their contribution to the team is immeasurable.
I recently started doing podcast ads at the end of the introduction.
I'll do one or two minutes after introducing the episode and never any ads in the middle
that break the flow of the conversation.
I hope that works for you.
It doesn't hurt the listening experience.
This show is presented by CashApp, the number one finance app in the App Store.
I personally use CashApp to send money to friends, but you can also use it to buy, sell,
and deposit Bitcoin in just seconds.
Cash App also has a new investing feature.
You can buy fractions of a stock, save one dollars worth, no matter what the stock price
is.
Broker services are provided by Cash App investing, a subsidiary of Square and member at SIPC.
I'm excited to be working with CashApp to support one of my favorite
organizations called FIRST, best known for their first robotics and LEGO competitions.
They educate and inspire hundreds of thousands of students in over 110 countries and have
a perfect rating on charity navigator, which means the donated money is used to maximum
effectiveness. When you get cashapp from the App Store Google Play
and use code Lex Podcast, you'll get $10 and cashapp will also donate $10 to first,
which again is an organization that I've personally seen inspire girls and boys to dream
of engineering better world. This podcast is also supported by Zip Recruiter.
Hiring great people is hard, and to me is one of the most important elements of a successful
mission-driven team.
I've been fortunate to be a part of and lead several great engineering teams.
The hiring I've done in the past was mostly through tools we built ourselves, but reinventing
the wheel was painful.
Zip Recruiter is a tool that's already available for you.
It seeks to make hiring simple, fast, and smart.
For example, Codable Co-Founder, Gretchen Hebner, uses Zipper Cruder to find a new game artist
to join our education tech company.
By using Zipper Cruder's screening questions to filter candidates, Gretchen found it easier
to focus on the best candidates, and finally finally hiring the perfect person for the role, and less than two
weeks from start to finish. Zippercruder the smartest way to hire. See why Zippercruder
is effective for businesses of all sizes by signing up as I did for free at
zippercruder.com slash lex pod that zippercruder.com slash lex pod that's zippercruder.com slash lex pod.
And now here's my conversation with Rohit Prasad. In the movie Her, I'm not sure if you've ever seen a human falls in love with a voice of an AI
system. Let's start at the highest philosophical level before we get to deep learning and some of
the fun things. Do you think this
what the movie her shows is within our reach?
I think not specifically about her, but I think what we are seeing is a massive increase
in adoption of AI assistance or AI and all parts of our social fabric. And I think it's what I do believe is that the utility
these AI's provide.
Some of the functionalities that are shown
are absolutely within reach.
So some of the functionalities in terms of the interactive
elements, but in terms of the deep connection
that's purely voice-based,
do you think such a close connection is possible with voice alone?
It's been a while since I saw her, but I would say in terms of the,
in terms of interactions which are both human-like and in these AI assistants,
you have to value what is also superhuman.
We as humans can be in only one place. AI
assistants can be multiple places at the same time. One with you on your mobile
device, one at your home, one at work. So you have to respect these superhuman
capabilities too. Plus as humans we have certain attributes we're very
good at, very good at reasoning. AI assistants not yet there, but in realm of AI assistance, what they're great at is
computation, memory, it's infinite and pure.
These are the attributes you have to start respecting.
So I think the comparison with human like versus the other aspect, which is also superhuman,
has to be taken into consideration.
So I think we need to elevate the discussion to not just human-like.
So there's certainly elements we just mentioned, Alex is everywhere, computation is speaking.
So this is a much bigger infrastructure than just the thing that sits there in the room
with you. But it certainly feels to us mere humans that there's just another little
Creature there when you're interacting with it. You're not interacting with the entirety of the infrastructure You're interacting with the device that the feeling is
Okay, sure we anthropomorphize things, but that feeling is still there. So
What do you think we as humans the purity of the interaction with a smart assistant?
What do you think we look for in that interaction?
I think in the certain interactions, I think, will be very much where
it does feel like a human, because it has a person of its own.
And in certain ones, it wouldn't be.
So, I think a simple example to think of it is if you're walking through the house
and you just want to turn on your lights on and off and you're issuing a command, that's not very much like a human-like interaction
and that's where the AI shouldn't come back and have a conversation with you.
Just, it should simply complete that command.
So I think the blend of, we have to think about this is not human-human alone, it is a
human-machine interaction and certain aspects of humans are needed and certain aspects are in situations demanded to be like a machine
So I told you it's gonna be first off cause in parts
What was the difference between human and machine in that interaction when we interact two humans
especially those are friends and a lot of ones
versus
You and a machine that you also are close with.
I think you have to think about the roles the AI plays, right?
So and it differs from different customer to customer, different situation to situation,
especially I can speak from a Lexus perspective. It is a companion, a friend, at times,
an assistant, and aer down the line.
I think most AIs will have this kind of attributes, and it will be very situational in nature.
Where does the boundary, I think the boundary depends on exact context in which you're interacting with the AI.
The depth and the richness of natural language conversation
has been by Alan Turing, been used to try to define what it means to be intelligent.
You know, there's a lot of criticism of that kind of test, but what do you think is a good
test of intelligence in your view in the context of the Turing test? And Alexa, with the Alexa
prize, this whole realm, do you think about this human intelligence, what it means to define it, what it means to reach that level?
I do think the ability to converse is a sign of an ultimate intelligence. I think that is no question about it.
So if you think about all aspects of humans, there are sensors we have,
and those are basically a data collection mechanism.
And based on that, we make some decisions with our sensory brains.
From that perspective, I think there are elements we have to talk about how we sense the world,
and then how we act based on what we sense.
Those elements clearly machines have.
But then there's the other aspects of computation that is way better.
I also mentioned about memory again in terms of being near infinite depending on the storage capacity you have.
And the retrieval can be extremely fast and pure in terms of like there's no ambiguity of who did I see when.
Right. I mean, if you machines can remember that quite well.
So it again on a philosophical level, I do subscribe to the fact
that to be able to converse and as part of that to be able to reason based on the world knowledge
you've acquired and the sensory knowledge that is there is definitely very much the essence of
intelligence. But intelligence can go beyond human level intelligence based on what machines are getting
capable of.
So, what do you think maybe stepping outside of Alexa broadly as an AI field?
What do you think is a good test of intelligence?
Put it another way outside of Alexa because so much of Alexa is a product, is an experience
for the customer.
On the research side, what would impress the heck out of you if you saw? What is the test where you said, wow, this thing is now starting
to encroach into the realm of what we loosely think of as human intelligence.
So, well, we think of it as AGI and human intelligence. So together,? So in some sense. And I think we are quite far from that.
I think an unbiased view I have is that the Alexa's
intelligence capabilities are great test.
I think of it as there are many other true points,
like self-driving cars, game playing, like Go or Chess.
Let's take those two for as an example clearly requires a lot of data driven learning and intelligence
But it's not as hard a problem as conversing with as an AI is with the humans to accomplish certain tasks or open domain chat as you mentioned
like surprise in
those settings the key differences that the end goal is not defined
unlike game playing, you also do not know exactly what state you are in in a particular
goal completion scenario. In certain sense, sometimes you can if it is a simple goal, but
if you're even certain examples like planning a weekend or you can imagine how many things change along the way.
You look for weather, you make change your mind,
and you change the destination,
or you want to catch a particular event,
and then you decide, no, I want this other event,
I want to go to.
So these dimensions of how many different steps
are possible when you're conversing
as a human with a machine makes it an extremely
daunting problem and I think it is the ultimate test for intelligence. And don't you think the
natural language is enough to prove that conversation? From a scientific standpoint natural language
is a great test but I would go beyond I don't want to limit it to as
natural language as simply understanding an intent or parsing for entities and
so forth. We are really talking about dialogue. So I would say human-machine
dialogue is definitely one of the best tests of intelligence. So can you briefly
speak to the Alexa Prize for people who are not familiar with it and also just maybe
Were things stand and what have you learned and what's surprising? What have you seen that surprising from this incredible competition?
Absolutely. It's a very exciting competition
Alexa prize is essentially a grand challenge and
Conversational artificial intelligence where we threw the conflict to the
universities who do active research in the field to say, can you build what we call a social bot
that can converse with you coherently and engagingly for 20 minutes? That is an extremely hard
challenge talking to someone in a who you're meeting for the first time or even if you're you've met them quite often
to speak at 20 minutes on any topic and evolving nature of topics is super hard. We have completed two
successful years of the competition. First was one with the University of Washington, second
University of California. We are in our third instance. We have an extremely strong team of 10 cohorts
and the third instance of the Alexa prizes underway now.
And we are seeing a constant evolution.
First year was definitely a learning.
It was a lot of things to be put together.
We had to build a lot of infrastructure
to enable these universities to be able
to build magical experiences and do
high quality research. Just a few quick questions, sorry, for the introduction. What does failure
look like in the 20-minute session? So what does it mean to fail not to reach the 20-minute
month? Awesome question. So there are one, first of all, I forgot to mention one more detail.
It's not just 20 minutes, but the quality of the conversation, too, that matters.
And the beauty of this competition before I answer that question on what failure means,
is first that you actually converse with millions and millions of customers as the social bots.
So during the judging phases, there are multiple phases.
Before we get to the finals, which is a very controlled judging in a situation where we have,
we bring in judges and we have interactors who interact with these social bots.
That is a much more controlled setting, but till the point we get to the finals,
all the judging is essentially by the customers of Alexa.
And there you basically rate on a simple question how good your experience was.
So that's where we are not testing for a 20 minute boundary
being crossed, because you do want it to be very much like a clear
cut winner, big chosen, and it's an absolute bar.
So did you really break that 20 minute barrier?
Is why we have to test it in a more controlled setting
with actors, essentially, and tractors, and see how the conversation goes. So this is why it's a subtle difference between how
it's being tested in the field with real customers versus in the lab to award the prize. So on the latter
one, what it means is that essentially, there are three judges and two of them have to say this conversation has
started essentially. Got it. And the judges are human experts. Judges are human experts.
Okay, great. So there's in the third year, so what's been the evolution? How far is it in the
DARPA challenge in the first year? No, the autonomous vehicle is nobody finished in the second year, a few
more finished in the desert. So how far along in this, I would say, much harder challenge
are we?
This challenge has come a long way to the extent that we're definitely not close to the
20 minute barrier being with coherence and engaging conversation. I think we are still
five to 10 years away in that horizon to complete that.
But the progress is immense.
What you're finding is the accuracy
and what kind of responses these social bots generate
is getting better and better.
What's even amazing to see that now there's humor coming in.
The bots are quite awesome.
You're talking about ultimate science of
intelligence. I think humor is a very high bar in terms of what it takes to create humor.
And I don't mean just being goofy. I really mean good sense of humor is also a sign of intelligence
in my mind and something very hard to do. So these social bots are now exploring not only what we think of natural language abilities,
but also personality attributes and aspects of when to inject an appropriate joke, when
you don't know the domain, how you come back with something more intelligible so that you
can continue the conversation.
If you and I are talking about AI and we are domain experts, we can speak to it, but if
you suddenly switch a topic to that, I don't know off. How do I change
that conversation? So you're starting to notice these elements as well. And that's coming
from partly by the nature of the 20 minute challenge that people are getting quite
clever on how to really converse and essentially must some of the understanding defects of the exist.
So some of this, this is not Alexa the product.
This is somewhat for fun, for research,
for innovation and so on.
I have a question sort of in this modern era,
there's a lot of, he look at Twitter and Facebook
and so on, there's discourse, public discourse going on.
And some things that are a little bit too edgy,
people get blocked and so on.
I'm just at a curiosity,
are people in this context pushing the limits?
Is anyone using the F word?
Is anyone sort of pushing back sort of arguing,
I guess I should say, as part of the dialogue,
to really draw people
in.
First of all, let me just back up a bit and think of why we are doing this, right?
So you said it's fun.
I think fun is more part of the engaging part for customers.
It is one of the most used skills as well in our skills tour.
But up that apart, the real goal was essentially what was happening is with
a lot of AI research moving to industry. We felt that academia has the risk of not being
able to have the same resources at disposal that we have, which is lots of data, massive
computing power, and a clear ways to test these AI advances with real customer benefits.
So we brought all these three together in the Alexa price.
That's why it's one of my favorite projects in Amazon.
And with that, the secondary effect is,
yes, it has become engaging for our customers as well.
We're not there in terms of where we want it to be, right?
But it's a huge progress.
But coming back to your question on,
how do the
conversations evolve? Yes, there is some natural attributes of what you said in terms of argument
and some amount of swearing. The way we take care of that is that there is a sensitive filter we have
built that says keywords. It's more than keywords a little more in terms of, of course, there's keyword
base too, but there's more in terms of these words can be very contextual, as you can see,
and also the topic can be something
that you don't want a conversation to happen,
because this is a communal device as well.
A lot of people use these devices.
So we have put a lot of guardrails
for the conversation to be more useful for advancing AI,
and not so much of these other issues you attributed
to what's happening in AI field as well.
Right, so this is actually a serious opportunity.
I didn't use the right word fun.
I think it's an open opportunity to do some of the best innovation in conversational
agents in the world.
Absolutely.
Why just universities? Why just universities?
Why just universities? Because as I said, it really felt young minds.
Young minds, it's also to, if you think about the other aspect of where the whole
industry is moving with AI, there's a dearth of talent in, in, given the demands. So you do want
universities to have a clear place where they
can invent and research and not fall behind with that they can't motivate students. Imagine
all grad students left to, to, to industry like us or, or faculty members, which has happened
too. So this is a way that if you're so passionate about the field where you feel industry and academia need to work well, this is a great example and a great way for universities to participate.
So what do you think it takes to build a system that wins the Luxur prize?
I think you have to start focusing on
aspects of reasoning
that it is there are still more lookups of what intents the customer
is asking for and responding to those rather than really reasoning about the elements of
the conversation. For instance, if you have, if you're playing, if the conversation is about games and it's
about a recent sports event, there's so much context involved and you have to understand
the entities that are being mentioned so that the conversation is coherent rather than
you certainly just switch to knowing some fact about a sports entity and you're just
relaying that rather than understanding the true context of the game.
If you just said, I learned this fun fact about Tom Brady rather than really say how he
played the game the previous night, then the conversation is not really that intelligent.
You have to go to more reasoning elements of understanding the context of the dialogue
and giving more appropriate responses which tells you that we are still quite far because a lot of
times it's more facts being looked up and something that's close enough as an answer but not really
the answer. So that is where the research needs to go more and actual true understanding and reasoning.
And that's why I feel it's a great way to do it
because you have an engaged set of users working
to make help these AI advances happen in this case.
You mentioned customers, they're quite a bit,
and there's a skill.
What is the experience for the user that is helping,
is just to clarify, this isn't as far
as I understand the Alexa, so this skill is a standalone for the Alexa prize.
I mean, it's focused on the Alexa prize.
It's not you ordering certain things and it was like, oh, I'm going to check in the weather
or playing Spotify.
It's a separate skill.
Exactly.
And so you're focused on helping that, I don't know, how do people, how do customers think of it?
Are they having fun?
Are they helping to each system?
What's the experience like?
I think it's both actually.
And let me tell you how you invoke this skill.
So all you have to say, Alexa, let's chat.
And then the first time you say Alexa, let's chat,
it comes back with a clear message
that you're interacting with one of this,
you know, three social bots. And there's a clear, so you know exactly how you interact, right?
And that is why it's very transparent.
You are being asked to help, right?
And we have a lot of mechanisms where as the, we are in the first phase of feedback phase
then you send a lot of emails to our customers and then they know that
this, the team needs a lot of interactions to improve the accuracy of the system. So we
know we have a lot of customers who really want to help these university bots and they are
conversing with that. And some are just having fun with just saying, Alex, let's chat.
And also some adversarial behavior to see whether how much do you understand
as a social bot.
So I think we have a good, healthy mix of all three situations.
So what is the, if we talk about solving the Alexa challenge, the Alexa prize, what's
the data set of really engaging pleasant conversations look like?
Because if we think of this as a supervised learning problem,
I don't know if it has to be, but if it does, maybe you can comment on that.
Do you think there needs to be a data set of what it means to be an engaging, successful,
fulfilling conversation? I think that's part of the research question here.
This was, I think, we at least got the first part right,
which is, have a way for universities
to build and test in a real world setting.
Now you're asking in terms of the next phase of questions,
which we are also asking, by the way,
what does success look like from an optimization function?
That's what you're asking in terms of,
we as researchers are used to having a great
corpus of annotated data and then making,
or then, you know, sort of, tune our algorithms on those,
right? And fortunately and unfortunately,
in this world of Alexa Prize, that is not the way we are going
after it. So you have to focus more on learning based on life feedback.
That is another element that's unique where just now I started with giving you how you ingress
and experience this capability as a customer.
What happens when you're done?
So they ask you a simple question on a scale of one to five.
How likely are you to interact with this social bot again? So, they ask you a simple question on a scale of one to five,
how likely are you to interact with this social bot again.
That is a good feedback and customers can also leave more open-ended feedback.
And I think partly that to me is one part of the question you're asking,
which I'm saying is a mental model shift that as researchers also,
you have to change your mindset that this is not a DARPA evaluation or a NSF 100 study and you have a nice corpus.
This is where it's real world. You have real data.
The scale is amazing. That's a beautiful thing. And then the customer, the user can quit the conversation in any time.
That is also a signal for how good you were at that point.
So, and then on scale one to five, one to three, do they say how likely are you or is it just a binary?
One to five. One to five. Wow, okay. That's such a beautifully constructed challenge. Okay.
You said the only way to make a smart assistant really smart
is to give it eyes and let it explore the world.
I'm not sure you might have been taken out of context,
but can you comment on that?
Can you elaborate on that idea?
Is that I personally also find that idea super exciting
from a social robotics,
personal robotics perspective?
Yeah, a lot of things do get taken out of context.
My this particular one was just as a philosophical discussion
we were having on terms of what does intelligence look like.
And the context was in terms of learning,
I think, just we said we as humans
are empowered with many different sensory abilities.
I do believe that eyes are an important aspect
of it in terms of, if you think about how we as humans learn, it is quite complex and
it's also not unimodal that you are fed a ton of text or audio and you just learn that
way. No, you learn by experience, you learn by seeing, you're taught by humans,
and we are very efficient in how we learn. Machines on the contrary are very inefficient
on how they learn, especially these AI's. I think the next wave of research is going
to be with less data, not just less human, not just with less labeled data,
but also with a lot of weak supervision.
And where you can increase the learning rate,
I don't mean less data in terms of not having
a lot of data to learn from,
that we are generating so much data,
but it is more about from a aspect of how fast can you learn?
So improving the quality of the data,
that's the quality of data and the
learning process. I think more on the learning process. I think we have to we as humans learn with
a lot of noisy data, right? And I think that's the part that I don't think should change. What should
change is how will it? Right. So if you at, you mentioned supervised learning, we have making transformative shifts from moving to more unsupervised, more weak supervision. Those are the key aspects
of how to learn. And I think in that setting, you, I hope you agree with me that having
other senses is very crucial in terms of how you learn.
So absolutely. And from a machine learning perspective, which I hope will get a chance to talk to a few aspects that are fascinating there, but to stick on the point of sort of a body, you know, embodiment.
So Alexa has a body as a very minimalistic, beautiful interface, or there's a ring and so on.
I mean, I'm not sure of all the flavors of the devices that Alexa lives on, but there's a minimalistic
basic interface. And nevertheless, we humans, so I have a room by all kinds of robots all over everywhere.
So, what do you think the Alexa, the future looks like if it begins to shift what is body looks like, what maybe
beyond Alexa, what do you think are the different devices in the home as they start to embody
their intelligence more and more?
What do you think that looks like philosophically?
A future, what do you think that looks like?
I think let's look at what's happening today.
You mentioned, I think, other devices as an Amazon devices, but I also wanted to point out Alexa is already
integrated a lot of third party devices,
which also come in lots of forms and shapes.
Some in robots, right?
Some in microwaves, some in appliances
of that you use in every life.
So I think it is, it's not just the shape Alexa takes
in terms of form factors, but it's also where all it's available.
It's getting in cars, it's getting in different appliances and homes, even toothbrushes.
So I think you have to think about it as not a physical assistant.
It will be in some embodiment as you said, we already have these nice devices.
But I think it's also important to think of it. It is a virtual assistant. It is superhuman in
the sense that it is in multiple places at the same time. So I think the actual embodiment,
in some sense, to me doesn't matter. I think you have to think of it as not as human-like
and more of what its capabilities are
that derive a lot of benefit for customers
and how there are different ways to delight it
and delight customers and different experiences.
And I think I'm a big fan of it
not being just human like it should be
human like in certain situations, like surprise social bot in terms of conversation is a great
way to look at it, but there are other scenarios where human like I think is underselling
the abilities of this AI.
So if I could trivialize what we're talking about.
So if you look at the way Steve Jobs thought
about the interaction with the device that Apple produced,
there was a extreme focus on controlling the experience
by making sure there's only this Apple produced devices.
You see the voice of Alexa
being taking all kinds of forms,
depending on what the customer's want.
And that means it can be anywhere.
From the microwave to a vacuum cleaner to the home.
And so on, the voice is the essential element of the interaction.
I think voice is an essence.
It's not all, but it's a key aspect.
I think to your question in terms of you should be able to recognize Alexa.
And that's a huge problem.
I think in terms of a huge scientific problem, I should say, like what are the traits?
What makes it look like Alexa, especially in different settings, and especially if it's
primarily voice what it is.
But Alexa is not just voice either, right?
I mean, we have devices with a screen.
Now you're seeing just other behaviors of Alexa.
So I think they're in very early stages of what that means.
And this will be an important topic for the following years.
But I do believe that being able to recognize and tell
when it's Alexa versus it's not,
is going to be important from an Alexa perspective.
I'm not speaking for the entire AI community,
but from, but I think attribution
and as we go into more of understanding who did what that identity of the AI is crucial
in the coming world.
I think from the broad AI community perspective, that's also a fascinating problem. So basically
if I close my eyes and listen to the voice, what would it take for me to recognize that this is Alexa?
Exactly. Or at least the Alexa that I've come to know from my personal experience in my home through my interactions.
Yeah. And the Alexa here in the US is very different. The Alexa in UK.
And the Alexa in India, even though they are all speaking English, or the Australian version.
So again, where, so now think about when you go into a different culture,
different community, but you travel there, what do you recognize Alexa? I think these are super hard
questions actually. So there's a there's a team that works on personality. So we talk about those
different flavors of what it means culturally speaking in the UK, US, what does it mean to add?
So the problem that we just stated, which is fascinating, how do we make it purely
recognizable that it's Alexa
assuming that the qualities of the voice are not sufficient
It's also the content of what is being said. How do we do that? How does the personality come into play? What's
What's that research you will look like instead I mean, it's such a fascinating.
It's a very fascinating folks who,
from both the UX background and human factors
are looking at these aspects and these exact questions.
But I'll definitely say it's not just how it sounds,
the choice of words, the tone,
not just, I mean, the voice identity of it,
but the tone matters, the speed matters,
how you speak, how you enunciate words, how, what choice of words are you using, how
turns are you or how lengthy in your explanations you are, all of these are factors.
And you also, you mentioned something crucial that it's may have, you may have personalized it, Alexa, to a some extent, in your homes or in the devices you're interacting with.
So you as your individual, how you prefer Alexa sounds can be different than how I prefer.
And we may, and the amount of custom is ability you want to give is also a key debate we always have.
But I do want to point
out it's more than the voice actor that recorded and it sounds like that actor. It is more about
the choices of words, the attributes of tonality, the volume in terms of how you raise your pitch and
so forth, all of that matters. This is such a fascinating problem from a product perspective.
I could
see those debates just happening inside of the Alexa team of how much personalization
do you do for the specific customer because you're taking a risk if you over-personalize.
Because you don't, if you create a personality for a million people, you can test that
better, you can create a rich, fulfilling experience
that will do well. But if the more you personalize it, the less you can test it, the less you
can know that it's a great experience. So how much personalization, what's the right balance?
I think the right balance depends on the customer. Give them the control. So I'll say, I think
the more control you give customers, the better it is for everyone.
And I'll give you some key personalization features.
I think we have a feature called Remember this, which is where you can tell Alexa to
remember something.
There you have an explicit sort of control in customer's hand, because they have to
say Alexa, remember exp IZ.
What kind of things would that be used for?
So you can like you or something.
I have stored my tire specs for my car
because it's so hard to go and find and see what it is
when you're having some issues.
I store my mileage plan, numbers for all the frequent
flyer ones where I'm sometimes just looking at it
and it's not handy.
So those are my own personal choices I've made for Alexa to
remember something on my behalf, right?
So again, I think the choice was be explicit about how you provide
that to a customer as a control.
So I think these are the aspects of what you do, like think about
where we can use speaker recognition capabilities that it's,
if you taught Alexa that you are Lex
and this person in your household is person two,
then you can personalize the experiences.
Again, these are very in the CX customer experience patterns
are very clear about and transparent
when a personalization action is happening.
And then you have other ways like you go through explicit control right now through your app that
your multiple service providers, let's say for music, which one is your preferred one. So when you
say play sting, depend on your whether you have preferred Spotify or Amazon music or Apple music,
that the decision is made where to play it from.
So what's Alexa's backstory from her perspective?
Is there a, I remember just asking as probably a lot of us are just the basic questions about love and so on
of Alexa just to see what the answer would be just as. It feels like there's a little bit of a back, like
there's a, it feels like there's a little bit of personality, but not too much. Is Alexa
have a metaphysical presence in this human, the universe will live in or is it something
more ambiguous? Is there a past? Is there a birth? Is there a family kind of idea, even
for joking purposes and so on? I think, well, it does tell you if I think you, I should
double check this, but if you said when were you born, I think we do respond. I need to
double check that, but I'm pretty positive about it. I think you do it, because I think
I've tested that. But that's like, that's like like how like I was born in your brand of champagne
in whatever the year. Good thing. Yeah. So on terms of the metaphysical, I think it's early,
does it have the historic knowledge about herself to be able to do that? Maybe have we crossed
that boundary not yet, right? In terms of being thank you. Have you thought about it quite a bit,
but I wouldn't say that we have come to a clear decision
in terms of what it should look like,
but you can imagine though,
and I bring this back to the Alexa Price social bot one,
there you will start seeing some of that.
Like these bots have their identity,
and in terms of that, you may find,
this is such a great research topic
that some academia team may think of these problems and start solving them too.
So let me ask a question. It's kind of difficult, I think, but it feels infastening to me because
I'm fascinated with psychology. It feels that the more personality you have,
the more dangerous it is in terms of a customer perspective of product. If you want to create
a product that's useful, by dangerous, I mean creating an experience that upsets me.
And so, how do you get that right? Because if you look at the relationships, maybe I'm just a screwed up Russian, but if
you look at the human to human relations, some of our deepest relationships have fights,
have tension, have the push and pull, have a little flavor in them.
Do you want to have such flavor in an interaction with Alexa?
How do you think about that?
So there's one other common thing that you didn't say, but we think of it as paramount
for any deep relationship.
That's trust.
Trust.
So I think if you trust every attribute you said, a fight, some tension is all healthy.
But what is sort of
unnegotiable in this instance is trust.
And I think the bar to earn customer trust for AI is very high
in some sense more than a human.
It's not just about personal information or your data.
It's also about your actions on a daily basis.
How trustworthy are you in terms of consistency,
in terms of how accurate are you in understanding me? Like if you're talking to a person on the phone,
if you have a problem with your let's say your internet or something, if the person is not understanding,
you lose trust right away. You don't want to talk to that person. That whole example gets amplified
by a factor of 10 because when you're a human interacting with an AI,
you have a certain expectation.
Either you expect it to be very intelligent
and then you get upset why is it behaving this way.
Or you expect it to be not so intelligent
and when it surprises you're like,
really, you're trying to be too smart.
So I think we grapple with these hard questions as well.
But I think the key is,
actions need
to be trustworthy from these AI's, not just about data protection, your personal information
protection, but also from how accurately it accomplishes all commands or all interactions.
Well, it's tough to hear because trust, yeah, absolutely right, but trust is such a high
bar with AI systems because people, and I see this because I work with autonomous vehicles. I mean the bar this place on AI system is
Unreasonably high. Yeah, that is going to be as I agree with you and I think of it as it's a challenge
It's a challenge and it's also keeps my job
Right, so so from that perspective that I totally I think of it at both sides as a customer and as a researcher.
I think as a researcher, yes, occasionally it will frustrate me that why is the bar
so high for these AI's?
And as a customer, then I say, absolutely, it has to be that high.
So I think that's the trade-off we have to balance, but doesn't change the fundamentals that trust has to be
owned. And the question then becomes, is, are we holding the AIs to a different bar in accuracy
and mistakes than we hold humans? That's going to be a great societal questions for years to come,
I think, for us. Well, one of the questions that we grapple as a society now that I think about
a lot, I think a lot of people in the AI think about a lot,
and Alexis taking on head-on as privacy. The reality is us giving over data to any AI system
can be used to enrich our lives in profound ways. So if it may be basically any product that does
anything awesome for you, the more data has the more awesome things it can do.
And yet, on the other side, people imagine the worst case possible scenario of
what can you possibly do with that data. People, it's go buzzed on the trust, as you
said before. There's a fundamental distrust of certain groups of governments and so on, depending on
the government, depending on who's in power, depending on all these kinds of factors.
And so here's Alexa in the middle of all of it in the home, trying to do good things
for the customers.
So how do you think about privacy in this context, the smart assistance in the home?
How do you maintain, how do you earn trust?
Absolutely. So, as you said, trust is the key here.
So, you start with trust and then privacy is a key aspect of it.
It has to be designed from very bigening about that.
And we believe in two fundamental principles.
One is transparency and second is control.
So, by transparency, transparency I mean when we build
what is now called smart speaker or the first echo, we were quite judicious about making these
write trade-offs on customers behalf that it is pretty clear when the audio is being sent to
cloud. The light ring comes on when it has heard you say the word wake word and then the streaming happens, right?
So when the light ring comes up, we also had,
we put a physical mute button on it.
Just so if you didn't want it to be listening,
even for the wake word, then you turn the mute button on
and that disables the microphones.
That's just the first decision on essentially transparency and control
Then even when we launched we gave the control in the hands of the customers that you can go and look at any of your individual
Attrances that is recorded and delete them anytime and we have cut to to that promise right so and that is super again a great instance of showing
How you have the control.
Then we made it even easier.
You can say, Alexa, delete what I said today.
So that is now making it even just more control in your hands with what's most convenient
about this technology is voice.
You delete it with your voice now.
So these are the types of decisions we continually make. We just recently launched
this feature called what we think of it as if you wanted humans not to review your data
because you mentioned supervised learning, right? So you in supervised learning humans have
to give some annotation. And that also is now a feature where you can essentially if you've's a good question. So, I think that's a good question. So, I think that's a good question. So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question.
So, I think that's a good question. So, I think that's a good question. So, I think that's a good question. So, I think that's a good question. So the control of the ability to leak, because we collect, we have studies here running
at MIT, the collects huge amounts of data and people consent and so on.
The ability to delete that data is really empowering.
And almost nobody ever asked to delete it, but the ability to have that control is really
powerful.
But still, you know, there's these popular anecdotes, anecdotal evidence that people say
they like to tell that them and a friend were talking about something, I don't know, uh, sweaters for
cats. And all of a sudden they'll have advertisements for cat sweaters on Amazon. There's that that's
a popular anecdote as if something is always listening. What can you explain that anecdote that experienced that people have?
What's the psychology of that?
What's that experience?
And can you, you've answered it, but let me just ask, is Alexa listening?
No, Alexa listens only for the wake word on the device, right?
And the wake word is the words like Alexa, Amazon, Echo, but you only choose one at a time.
So you choose one and it listens only for that on our devices.
So that's first.
From a listening perspective, we have to be very clear that it's just the wake word.
So you said, why is there this anxiety, if you may?
Yeah, exactly.
It's because there's a lot of confusion.
What it really listens to, right?
And I think it's partly on us to keep educating our customers
and the general media more in terms of like,
how what really happens and we've done a lot of it.
And our pages on information are clear,
but still people have to have more,
there's always a hunger for information and clarity.
And we'll constantly look at how best to communicate.
If you go back and read everything, yes,
it states exactly that.
And then people could still question it.
And I think that's absolutely okay to question.
What we have to make sure is that we are,
because our fundamental philosophy is customer first,
customer obsession is our leadership principle.
If you put as researchers, I put myself in the shoes of the customer and all decisions
in Amazon are made with that and Lansro and trust has to be earned and we have to keep
earning the trust of our customers in this setting.
And to your other point on like, is there something showing up based on your conversations?
No. I think the answer
is like you a lot of times when those experiences happen you have to also be know that okay it may
be a winter season. People are looking for sweaters right and it shows up on your amazon.com
because it is popular. So there are many of these you mentioned that personality or personalization
turns out we are not that unique either.
Yeah.
So those things we as humans start thinking,
oh, must be because something was heard
and that's why this other thing showed up.
The answer is no.
Probably it is just the season for sweaters.
I'm not gonna ask you this question
because it's just because you're also,
because people have so much paranoia.
But for my, let me just say from my perspective, I hope there's a day when a customer can ask Alexa to listen all the time
to improve the experience, because I personally don't see the negative
because if you have the control and if you have the trust, there's no reason why
I shouldn't be listening all the time to the conversations to learn more about you. Because ultimately, as long as you have control and trust, every
data you provide to the device, the device wants, is going to be useful. And so to me,
as a machine learning person, I think it worries me how sensitive people are about their
data relative to how empowering it could be for the devices around them, enriching it
could be for their own life to improve the product.
So it's something I think about sort of a lot, how to make
that device is obviously looks, it thinks about a lot as well. I don't know if you want
to comment on that. So have you seen, let me ask you in the form of a question. Okay.
Have you seen evolution in the way people think about their private data in the previous
several years? So as we as a society get more and more comfortable
to the benefits we get by sharing more data. First let me answer that part and then I'll
want to go back to the other aspect you were mentioning. So as a society, on a general,
we are getting more comfortable as a society. It doesn't mean that everyone is and I think
we have to respect that. I don't think one size fits all is always
going to be the answer for all, right, by definition.
So I think that's something to keep in mind in these.
Going back to your on what more magical experiences
can be launched in these kind of AI settings.
I think again, if you give the control, it's
possible, certain parts of it. So we have a feature called follow up mode where you, if
you turn it on and Alexa, after you've spoken to it, will open the mics again, thinking
you'll answer something again, like if you're adding lists to your shopping items, right, or a shopping list or to do list.
You're not done. You want to keep, so in that setting, it's awesome that it opens the mic for you to say,
X and Merl can then bread, right? So these are the kind of things which you can empower.
So I, and then another feature we have, which is called Alexa Guard, I said, it only listens for the way it would, all right.
But if you have a, let's say you're going to say,
Alexa, you leave your home and you want Alexa
to listen for a couple of sound events,
like smoke alarm going off, or someone breaking your glass, right?
So it's like just to keep your peace of mind.
So you can say Alexa on guard, or I'm away,
or then it can be listening for these sound events.
And when you're home,
you come out of that mode, right?
So this is another one where you again,
gave controls in the hands of the user or the customer
to enable some experience that is high utility
and maybe even more delightful in the certain settings
like follow up mode and so forth.
And again, this general principle is the same control in the hands of the customer.
So I know we kind of started with a lot of philosophy and a lot of interesting topics and
we're just jumping our lower the place, but really some of the fascinating things that
the Alexa team and Amazon is doing is in the algorithm side, the data side, the technology,
the deep learning machine learning and so on.
So can you give a brief history of Alexa from the perspective of just innovation, the
algorithms, the data of how it was born, how it came to be, how it was grown, worried
is today?
Yeah, it starts with Amazon.
Everything starts with the customer and we have everything starts with the customer. And we have a process called
working backwards Alexa and more specifically than the product echo. There was a working backwards
document, essentially that reflected what it would be started with a very simple
vision statement, for instance, that morphed into a full-fledged document along the way it changed
into what all it can do.
But the inspiration was the start-frag computer.
So when you think of it that way, everything is possible, but when you launch a product
you have to start with some place.
And when I joined, the product was already in conception, and we started working on the
Farfield speech recognition
because that was the first thing to solve. By that we mean that you should be able to speak to the
device from a distance and in those days that wasn't a common practice and even in the previous
research world I was in was considered to an unsolvable problem then in terms of whether you can converse from a length.
And here I'm still talking about the first part of the problem
where you say, get the attention of the device,
as in by saying what we call the wake word, which
means the word Alexa has to be detected with a very high
accuracy, because it is a very common word.
It has sound units that map with words like I like you
or Alec, Alex.
Right, so it's an undoubtedly hard problem to detect.
The right mentions of Alexa's address to the device
versus I like Alexa.
So you have to pick up that signal
when there's a lot of noise.
Not only noise, but a lot of conversation in the house.
Right? You remember on the device, you're simply listening for the wake word, Alexa.
And there's a lot of words being spoken in the house. How do you know it's Alexa
and directed at Alexa? Because I could say, I love my Alexa, I hate my Alexa,
I want Alexa to do this. And in all these three sentences, I said Alexa,
I didn't want it to wake up.
Yeah.
So, and can I just pause on that second?
What would be your device that I should probably
in the introduction of this conversation give to people?
In terms of, with them turning off their Alexa device,
if they're listening to this podcast conversation out loud.
Like, what's the probability that an Alexa device will go off because we mentioned Alexa
like a million times.
So it will, we have done a lot of different things where we can figure out that there is
the device, the speech is coming from a human versus over the air also I mean in terms of like also it is think about ads are
so we also launch the technology for what a marketing kind of approaches on terms of filtering it out
but yes if this kind of a podcast is happening it's possible your device will wake up a few times right it's a non-solve problem but it is
will wake up a few times, right? It's a non-solve problem, but it is definitely something we care very much about. But that is you want to detect Alexa meant for the device.
So the device, first of all, just even hearing Alexa versus I like something, I mean, that's
a fascinating part. So that was the first, really, first part. That's the first, the worst,
best detector of Alexa. Yeah, the first, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst,
the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the
worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the
worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the
worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, the worst, worst, the worst, the the worst, the worst, the worst, the worst to the device. Of course, you're gonna issue many different requests.
Some may be simple, some may be extremely hard,
but it's a large vocabulary speech recognition problem
essentially, where the audio is now not coming
onto your phone or a handle mic like this
or a close talking mic,
but it's from 20 feet away,
where if you're in a busy household,
your son may be listening to music, your daughter
may be running around with something and asking your mom something and so forth.
So this is like a common household setting where the words you're speaking to Alexa need
to be recognized with very high accuracy.
Now we're still just in the recognition problem.
We haven't yet come to the understanding one, right?
And if it was them, so I once again, what year was this? Is this before
neural networks began to start to seriously prove themselves in the audio space?
Yeah, this is around. So I joined in 2013 in April, right.
So the early research and neural networks coming back
and showing some promising results
and speech recognition space had started happening,
but it was very early.
But we just now build on that on the very first thing
we did when I joined the team.
And remember, it was a very much of a startup environment,
which is great about Amazon.
And we doubled on deep learning right away.
And we knew we'll have to improve accuracy fast.
And because of that, we worked on,
and the scale of data, once you have a device like this,
if it is successful, will improve big time.
Like you'll suddenly have large volumes of data
to learn from, to make the customer experience better.
So how do you scale deep learning?
So we did one of the first works in training
with distributed GPUs, and where the training time
was linear in terms of like in the amount of data.
So that was quite important work,
where it was algorithmic improvements
as well as a lot of engineering improvements
to be able to train on thousands and thousands of speech.
And that was an important factor.
So if you asked me like back in 2013 and 2014,
when we launched Echo,
the combination of large-scale data,
deep learning progress,
near infinite GPUs we had available on AWS,
even then, was all came together for us to be able
to solve the far field speech recognition
to the extent it could be useful to the customers.
It's still not solved, like I mean,
it's not that we're perfect at recognizing speech,
but we're great at it in terms of the settings
that are in homes, right? So, and that agreed at it in terms of the settings that are in
homes, right?
And that was important even in the early stages.
So, first of all, just even, I'm trying to look back at that time.
If I remember correctly, it was, it seems like the task will be pretty daunting.
So, like, so we kind of take it for granted that it works now.
Yes.
Right. So, let me, like, how, first of all, you mentioned startup.
I wasn't familiar how big the team was.
I kind of, because I know there's a lot of really smart people working on it.
So now it's very, very large team.
How big was the team?
How likely were you to fail in the highs of everyone else?
And ourselves in yourself.
So like what I'll give you a very interesting anecdote on that.
When I joined the speech recognition team
was six people.
My first meeting and we had a few more people,
it was 10 people.
Nine out of 10 people thought it can't be done.
Right? Who is the one?
The one was me.
I actually, I should say, and one was, say, my optimistic.
Yeah.
And the, and eight were trying to convince, let's go to the management and say,
let's not work on this problem.
Let's work on some other problem like like either telephony speech for customer service calls
and so forth.
But this was the kind of belief you must have
and I had experience with far field speech recognition
and my eyes lit up when I saw a problem like that saying,
okay, we have been in speech recognition
always looking for that killer app.
And this was a killer use case
to bring something delightful in the hands of customers
You mentioned you the way you kind of think of it in the product way in the future have a press release and FAQ and you think backwards
Did you have the the team have the echo in mind?
So this far field to be sure I can issue actually putting a thing in the home that works the tabletur interact with was that the press release?
or actually putting a thing in the home that works, that's able to interact with,
was that the press release?
What was the...
Very close, I would say, in terms of the,
as I said, the vision was started computer, right?
So, or the inspiration.
And from there, I can't divulge all the exact specifications,
but one of the first things that was magical on Alexa
was music. It brought me back to music because my taste
is still in when I was in undergrad. So I still listen to those songs and it was too
hard for me to be a music fan with a phone, right? So I hate things in my ear. So from that perspective it was quite hard and music was part of the, at
least the documents I have seen, right? So from that perspective I think yes, in terms
of how far are we from the original vision? I can't reveal that but that's why I have
done a fun at work because every day we go in and thinking like these are the new set of challenges to solve.
That's a great way to do great engineering as you think of the process.
I like that idea actually.
Maybe we'll talk about it a bit later.
It was just a super nice way to have a focus.
I'll tell you this, you're a scientist.
A lot of my scientists have adopted that.
They have now, they love it as a process because it was very, as scientists,
you're trained to write great papers, but they are all after you've done the research or
you've proven like, and your PhD dissertation proposal is something that comes closest
or a DARPA proposal or a NSF proposal is the closest that comes to a press release.
But that process is now ingrained in our scientists, which is delightful for me to see.
You write the paper first and then make it happen. That's right. I mean, in fact, it's not
state-of-the-art results. Or you leave the results section open, but you have a thesis about,
here's what I expect, and here's what would change. I think it is a great thing it works for researchers as well.
Yeah. So far field recognition. Yeah. What was the big leap? What what what were the breakthroughs?
And yeah, what was that journey like to today? Yeah, I think the as you said first, there was
a lot of skepticism on whether far field speech recognition will ever work to be good enough, right?
And what we first did was got a lot of training data in a far-field setting.
That was extremely hard to get because none of it existed.
How do you collect data in far-field setup with no customer base at the time?
There's no customer base.
That was first innovation. Once we had that, the next thing was, okay, if you have the data, first of all, we
didn't talk about like what would magical mean in this kind of a setting?
What is good enough for customers, right?
That's always, since you've never done this before, what would be magical?
So it wasn't just a research problem.
You had to put some in terms of accuracy and customer experience features, some stakes
on the ground saying, here's where I think it should get to.
So you established a bar and then how do you measure progress towards it given?
You have no customer.
It's right now.
So from that perspective, we went, so first was the data without customers. Second was doubling down on deep learning as a way to learn.
And I can just tell you that the combination of the two
got our error rates by a factor of five.
From where we were when I started to, within six months of having that data,
we, at that point, I got the conviction that this will work,
right?
So because that was magical in terms of when it started working.
And that reached the magical.
That came close to the magical bar.
That to the bar, right?
That we felt would be where people will use it, which was critical.
Because you really have one chance at this.
If we had launched in November, 2014,
as when we launched, if it was below the bar,
I don't think this category exists
if you don't meet the bar.
Yeah, and just having looked at voice-based interactions
like in the car, or earlier systems,
it's a source of huge frustration for people.
In fact, we use voice-based interaction
for collecting data on subjects to measure frustration.
So as the trainings have for computer vision,
for face data, so we can get a data set of frustrated people.
That's the best way to get frustrated people
is having them interact with the voice-based system
in the car.
So that bar I imagine is pretty high.
It was very high.
And we talked about how also errors are perceived
from AI's versus errors by humans.
But we are not done with the problems that ended up,
we had to solve to get it to launch.
So do you want the next one?
Yeah, the next one.
So the next one was what I think of as multi domain natural language understanding.
It's very, I wouldn't say easy, but it is during those days, solving it, understanding
in one domain, a narrow domain was doable.
But for these multiple domains like music, like information, other kinds of household
productivity alarms, timers, even though it wasn't as big as it is in terms of the number of skills
Alexa has and the confusion space has like grown by three orders of magnitude, it was still daunting
even those days and again, no customer base yet. Again, no customer base. So now you're looking at meaning
understanding and intent understanding and taking actions on behalf of
customers based on their requests. And that is the next hard problem,
even if you have gotten the words recognized, how do you make sense of
them? In those days, there was still a lot of emphasis on rule-based systems for writing grammar
patterns to understand the intent, but we had a statistical first approach even then, where
for a language understanding, we had, even those starting days, an entity recognized
it, and an intent classifier, which was all trained statistically.
In fact, we had to build the deterministic matching
as a follow up to fix bugs that statistical models have, right?
So it was just a different mindset
where we focused on data-driven statistical understanding.
When's the end if you have a huge data set?
Yes, it is contingent on that.
And that's why it came back to how do you get the data?
Before customers, the fact that this is why data becomes crucial
to get to the point that you have the understanding system
built in, built up.
And notice that for you, we were talking about human machine
dialogue, even those early days, even it
was very much
transactional, do one thing, one shot at turns in a great way. There was a lot of debate on how
much should Alexa talk back in terms of if you'd misunderstood you or you said play songs by the
stones and let's say it doesn't know, you know, early days and knowledge can be spars. Who are the stones? Right? I... the rolling stones.
And you don't want the match to be stone temple pilots or rolling stones, right? So you don't know
which one it is. So these kind of other signals to... Now there we had great assets from Amazon in terms
of... you act like what is it? We're kind of... Yeah, how do you solve that from Amazon in terms of UX.
What is it?
What kind of, yeah, how do you solve that problem?
In terms of what we think of it as an Entity to Resolution problem, right?
So because which one is it, right?
I mean, even if you figured out the stones as an entity, you have to resolve it to whether
it's the stones or the stone temple pile, or some other stones.
Maybe I misunderstood.
Is the resolution the job of the, or is the job of UX
communicating with the human to help the resolution?
Well, there is both, right?
It is, you want 90% or high 90s to be done without any further questioning or UX, right?
So, but that is absolutely okay, just like as humans, we ask the question, I didn't understand your legs.
It's fine for Alexa to occasion,
and say I did not understand you, right?
And that's an important way to learn.
And I'll talk about where we have come
with more self-learning with these kind of feedback signals.
But in those days, just solving the ability
of understanding the intent and resolving to an action,
where action could be played a particular artist artist or a predicular song was super hot again to the bar was high as we were talking
about right. So while we launched it in sort of 13 big domains I would say in terms of or I think
we think of it as 13 the big skills we had like music is a massive one. When we launched it and now we have 90,000 plus skills on Alexa.
So what are the big skills?
Can you just go over the, the only thing I use it for is music,
weather and shopping.
So we think of it as music information, right?
So it's, whether it's a part of an information, right?
So when we launch, we didn't have smart home,
but within this, by smart home, I mean,
you connect your smart devices,
you control them with voice.
If you haven't done it, it's worth,
it'll change your life.
I'm turning on the lights and so on.
Yeah, turning on your light to anything that's connected
and has a, it's just,
that's your favorite smart device for you.
Right, light.
Right.
And now you have the smart plug with,
and you don't, we also have this Echo plug, which is, oh yeah, you can Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right. Right active where it even has hunches now that, or looks, hunches like you left your light on.
Let's say you've gone to your bed and you left
the garage light on, so it will help you out in these settings, right?
So that's smart devices, information smart devices,
you said music.
So I don't remember everything we had
with our last timbers were the big ones.
That was, the timbers were very popular right away.
Music also, like you could play song artists, album, everything.
And so that was like a clear win in terms of the customer experience.
So that's again, this is language understanding.
Now things have evolved, right?
So where we want Alexa, definitely to be more accurate, competent, trustworthy based on how well it does these core things.
But we have evolved in many different dimensions. First is what I think of it as being more
conversational for high utility, not just for chat, right? And there at Remar's this year,
which is our AI conference, we launched what is called Alexa Conversations.
this year, which is our AI conference, we launch what is called Alexa Conversations.
That is providing the ability for developers to author multi-turn experiences on Alexa with no code, essentially, where in terms of the good dialogue code. Initially, it was like,
you know, all these IVR systems, you have to fully author if the customer says this, do that. So the whole dialog flow is hand author.
And with Alexa conversations, the way it is
that you just provide a sample interaction data
with your service or API.
Let's say your atom tickets that provides a service
for buying movie tickets.
You provide a few examples of how your customers will
interact with your APIs. And then the Dialogflow is automatically constructed using a
RecreN Neural Network trained on that data.
So that simplifies the developer experience.
We just launched our preview for the developers to try this capability out.
Then the second part of it which shows
even increased utility for customers is you and I when we interact with Alexa or any customer
as I'm coming back to our initial part of the conversation the goal is often unclear
or unknown to the AI. If I say Alexa what movies are playing nearby
am I trying to just buy movie tickets?
Am I actually even,
do you think I'm looking for just movies for curiosity,
whether the Avengers are still in theater or when?
Maybe it's gone and maybe it will come on my,
missed it, so I may watch it on prime,
which happened to me.
So from that perspective now,
you're looking into what is my goal?
And let's say I now complete the movie ticket purchase.
Maybe I would like to get dinner nearby.
So what is really the goal here?
Is it night out or is it movies?
As in just go watch a movie.
The answer is we don't know. So can Alexa now figure
be have the intelligence that I think this meta goal is really night out or at
least say to the customer when you've completed the purchase of movie tickets
from Adam tickets or Fandango or pick your anyone. Then the next thing is do you
want to get to get a Uber to the theater right or do you want to get a Uber to the theatre, right? Or do you want
to book a restaurant next to it and then not ask the same one formation over and over
again? What time? How many people in your party, right? So this is where you shift the cognitive burden from the customer to the AI,
where it's thinking of what is your, it anticipates your goal and takes the next best action
to complete it. Now that's the machine learning problem. But essentially, the way we saw this
first instance, and we have a long way to go to make it scale
to everything possible in the world, but at least for this situation, it is from, at every
instance, Alexa is making the determination whether it should stick with the experience
with Atom tickets or offer, or you based on what you say, whether either you have completed
the interaction, or you said, no, get me an Uber now.
So it will shift context into another experience or skill
on another service.
So that's a dynamic decision making.
That's making Alexa, you can say more conversational
for the benefit of the customer,
rather than simply complete transactions,
which are well thought through,
you as a customer has fully specified what you want to be accomplished.
It's accomplishing that. So it's kind of as we do this with pedestrians,
like intent modeling, it's predicting what your possible goals are,
what's the most likely goal, and then switching that depending on the things you say.
So my question is there, it seems maybe it's a dumb question, but it would help a lot of Alexa remembered
Me what I said previously right is it is it?
Trying to use some memories for the customer. Yeah, it is using a lot of memory within that
So right now not so much in terms of okay, which restaurant you prefer right that is a more long term memory
But within the short term memory, within the session,
it is remembering how many people did you,
so if you said by four tickets,
now it has made an implicit assumption
that you are gonna have, you need four,
at least four seats at a restaurant, right?
So these are the kind of contexts it's preserving
between these skills, but within that session,
what are you asking the right question in terms of for it to be more and more useful,
it has to have more long term memory and that's also an open question and again,
this is still early days.
So for me, I mean, everybody's different, but yeah, I'm definitely not
represented with the general population in the sense that I do the same thing every day.
Like I eat the same, I do, I do everything the same, the same thing.
We're the same thing clearly, this or the black shirt.
So it's frustrating when Alexi doesn't get what I'm saying because I have to correct her every
time in the exact same way. This has to do with certain songs. Like she doesn't know certain
weird songs. And doesn't know certain weird songs.
And doesn't know, I've complained to Spotify about this,
talked to the RD, had a Vardee Spotify,
it's their way to heaven.
I have to correct it every time.
It doesn't play a Led Zeppelin correctly.
It plays a cover of Led Zeppelin.
So...
You should figure out, you should send me your next time it fails.
This is free to send it to me. You will take care of it.
Okay.
Well, you know what?
Let's have one of my favorite things.
I'll have it.
It works for me.
So in like shocked, it doesn't work for you.
This is an official bug report.
I'll put it.
I'll make it public.
I'll give it a retweet it.
We're going to fix this.
There would have been.
Anyway, but the point is, you know, I'm pretty boring and do the same thing.
But I'm sure most people do the same set of things. Do you see Alexa sort of utilizing that in the future for improving the experience?
Yes, and not only utilizing it, it's already doing some of it.
We call it where Alexa is becoming more self-learning.
So Alexa is now auto-correcting millions and millions of utterances in the US without
any human supervision involved.
The way it does it is, let's take an example of a particular song that didn't work for
you.
What do you do next?
Either it played the wrong song and you said Alexa, no, that's not the song I want or
you say Alexa, play that, you try it again.
And that is a signal to Alexa that she may have done something wrong.
And from that perspective, we can learn if there's that failure pattern or that action of
song A was played when song B was requested. And it's very common with station names because
play NPR, you can have NB confused as an M and then you for a certain accent
like mine, people confuse my NNM all the time. And because I have an Indian accent, they're
confused to humans. It is for Alexa too. And in that part, but it starts auto correcting.
And we collect, we correct a lot of these automatically without a human looking at the failures.
So, one of the things that's for me missing in Alexa, I don't know from a representative customer,
but every time I correct it, it would be nice to know that that made a difference.
Yes. You know what I mean? Like the sort of like, I heard you like a sort of...
Some acknowledgement of that.
We work a lot with Tesla, we study the autopilot and so on.
And a large amount of the customers that use Tesla autopilot,
they feel like they're always teaching the system.
They're almost excited by the possibility that they're teaching.
I don't know if Alexa customers generally think of it as they're teaching to improve
the system.
And that's a really powerful thing.
Again, I would say it's a spectrum.
Some customers do think that way and some would be annoyed by Alexa acknowledging that
or so there's a, again, no one, you know, while there are certain patterns, not everyone
is the same in this way.
But we believe that again, customers helping Alexa is a talent for us in terms of improving it. And more self-learning is by, again, this is like fully unsupervised, right? There is no
human in the loop and no labeling happening. And based on your actions as a customer Alexa becomes smarter. Again it's early days, but
I think this whole area of teachable AI is going to get bigger and bigger in the whole space,
especially in the AI assistant space. So that's the second part where I mentioned more conversational,
this is more self-learning. The third is more natural.
And the way I think of more natural is we talked about how Alexa sounds.
And we have done a lot of advances in our text to speech by using, again, neural network
technology for it to sound very human-like.
And the individual texture of the sound, the timing, the tunnel, and the tone
of everything.
I would think in terms of, there's a lot of controls in each of the places for how, I mean,
the speed of the voice, the prosthetic patterns, the actual smoothness of how it sounds.
All of those are factored and we do ton of listening tests to make sure, because that would natural, how it sounds
should be very natural, how it understands requests
is also very important.
And in terms of, we have 95,000 skills
and if we have, imagine that in many of these skills,
you have to remember the skill length.
And say Alexa, ask the tied skill to tell me X, right?
Or, now, if you have to remove the skill name,
that means the discovery and the interaction is unnatural.
And we are trying to solve that by what we think of as, again,
this was, you don't have to have the app metaphor here.
These are not individual apps, right? Even though they're, so you're not sort of opening one
at a time and interacting.
So it should be seamless because it's voice.
And when it's voice, you have to be able to understand
these requests independent of this specificity,
like a skill name.
And to do that, what we have done is again,
built a deep learning base capability
where we shot list a bunch of skills
when you say, like, so get me a car. And then we figure it out, okay, it meant for an Uber skill versus a left
or based on your preferences and then you can rank the responses from the skill and then choose
the best response for the customer. So that's on the more natural. Other examples of more natural
is like we were talking about lists for instance and you want to
you don't want to say Alexa add milk, Alexa add eggs, Alexa add cookies, no Alexa add cookies,
milk and eggs and that in one shot right so that works that helps with the naturalness. We talked
about memory like if you said you can say Alexa remember I have to go to mom's house or you may have entered a calendar event
through your calendar that's linked to Alexa you don't remember whether it's in my calendar or did I tell you to remember something
or some other reminder right so you have to now independent of how customers create these events
it should just say like so when do you have to go to mom's house, Alexa, when do you have to go to mom's house
and it tells you when you have to go to mom's house?
That's a fascinating problem.
Who's that problem on?
So there's people who create skills.
Who's tasked with integrating all of that knowledge together?
So the skills become seamless.
Is it the creators of the skills?
So the infrastructure that Alexa provides problem?
It's both. I think the large problem in terms of making sure your skill quality is high.
That has to be done by our tools because it's just these skills just to put the context.
They are built through Alexa's skill skill, which is a self-serve way of building an
experience on Alexa. This is like any developer in the world could go to Alexa Skillskitt and build an
experience on Alexa. Like if you're a Domino's, you can build a Domino's skills. For instance, that
does beat Zodring. When you have authored that, you do want to, now, if people say Alexa open dominoes or Alexa
ask dominoes to get a particular type of pizza, that will work, but discovery is hard,
you can't just say, Alexa, get me a pizza and then Alexa figures out what to do.
That latter part is definitely our responsibility in terms of when the request is not fully specific.
How do you figure out what's the best skill or a service that can fulfill the customer's
request?
And it can keep evolving.
Imagine going to the situation I said, which was the night out planning that the goal could
be more than that individual request that came up.
A pizza ordering could mean a nighting. Yeah. You're having an event with
your kids in their house and you're so this is welcome to the world of conversationally AI.
This is this is super exciting because it's not the academic problem of NLP of natural
English processing understanding dialogue. This is like real world and there's this lakes are
high in a sense that customers get frustrated quickly, people
get frustrated quickly.
So you have to get it right, if to get that interaction right.
So it's, I love it.
But so from that perspective, what are the challenges today?
What are the problems that really need to be solved in the next few years?
I think first and foremost, as I mentioned that
the basics right are still true. Basically, even the one shot requests, which we think of as
transactional requests, needs to work magically. No question about that. If it doesn't turn your
light on and off, you'll be super frustrated. Even if I can complete the night out for you and not do that,
that is unacceptable for as a customer, right? So that you have to get the foundational understanding
going very well. The second aspect when I said more conversational is, as you imagine, is more
about reasoning. It is really about figuring out what the latent goal is of the customer based on what
I have the information now and the history
and what's the next best thing to do. So that's a complete reasoning and decision making problem
just like yourself driving car, but the goal is still more finite. Here it evolves. Your
environment is super hard and self-revealing and the cost of a mistake is huge. Here, but there are certain similarities,
but if you think about how many decisions Alexa is making
or evaluating at any given time,
it's a huge hypothesis space.
And we're only talked about so far,
about what I think of reactive decision
in terms of you asked for something
and Alexa is reacting to it.
If you bring the proactive part, which
is Alexa having hunches. So any given instance, then, it's really a decision at any given
point based on the information, Alexa has to determine what's the best thing it needs
to do. So these are the ultimate AI problem, what decisions based on the information you
have.
Do you think, just in my perspective, I work a lot with sensing of the human face.
Do you think, and we touched this topic a little bit earlier, but do you think it'll be
a day soon when Alexa can also look at you to help improve the quality of the hunch it
has, or at least detect frustration, or detect, you know, improve the quality of its perception
of what you're trying to do.
I mean, let me bring back to what it already does.
We talked about how based on you, Bartjen, or Alexa, clearly, it's a very high probability
it must have done something wrong.
That's why you watched it.
The next extension of whether frustration is a signal or not,
of course, is a natural thought in terms of how that
should be in a signal to it.
You can get that from voice.
You can get from voice, but it's very hard.
I mean, frustration as a signal historically,
if you think about emotions of different kinds,
you know, there's a whole field of affective computing, something that MIT has also done a lot of
research and is super hard. And you're now talking about a far-field device as in you're talking
to a distance, noisy environment, and in that environment, it needs to have a good sense for your
emotions. This is a very, very hard problem.
Very hard problem, but you haven't shied away from hard problems.
So deep learning has been at the core of a lot of this technology.
Are you optimistic about the current deep learning approaches, the solving the hardest aspects
of what we're talking about? Or do you think there will come a time where new ideas need to,
for the, you know, if you look at
reasoning.
So open AI, deep mind, a lot of folks are now starting to work in reasoning, trying to
see how it can make neural networks a reason.
Do you see that new approaches need to be invented to take the next big leap?
Absolutely.
I think there has to be a lot more investment in, I think, in many different
ways.
And there are these, I would say, nuggets of research forming in a good way, like learning
with less data or like zero-shot learning, one-shot learning.
And the active learning stuff you've talked about is incredible stuff.
So transfer learning is also super critical, especially when you're thinking about Applying knowledge from one task to another or one language to another, right?
It's really ripe. So these are great pieces. Deep learning has been useful to and now we are sort of marrying deep learning with
With transfer learning and active learning. Of course, that's most straightforward in terms of applying deep learning in an active learning setup, but
But I do straightforward in terms of applying deep learning and an active learning setup, but I do think in terms of now looking into more reasoning-based approaches is going to be key
for our next wave of the technology. But there is a good news. The good news is that I think
for keeping on to delight customers that a lot of it can be done by prediction tasks.
keeping on to delight customers that a lot of it can be done by prediction tasks. Yeah.
So, and so we haven't exhausted that.
So, yeah, so it's, we don't need to give up on the deep learning approaches for that.
So that's just I wanted to sort of create a rich, fulfilling, amazing experience that makes
Amazon a lot of money and a lot of everybody a lot of money because it does awesome things.
Deep learning is enough.
The point, the point, I don't think, no, I mean, I wouldn't say deep learning is enough.
I think for the purposes of Alexa, I'm accomplished the task for customers.
I'm saying there's still a lot of things we can do with prediction based approaches that
do not reason.
Right.
I'm not saying that and we haven't exhausted those,
but for the kind of high utility experiences that I'm personally passionate about of what Alexa needs to do, reasoning has to be solved. To the same extent as you can think of
natural language understanding and speech recognition to the extent of understanding intense has been how accurate
it has become, but reasoning we have very, very early days. The mass, the another way, how hard
of a problem do you think that is? Hardest of them. I would say hardest of them because again,
the hypothesis space of is really, really large.
And when you go back in time, like you were saying,
I want to, I want Alexa to remember more things
that once you go beyond a session of interaction,
which is by session, I mean a time span,
which is today, two verses remembering
which restaurant I like.
And then when I'm planning a night out to say,
do you want to go to the same restaurant?
Now you're up the stakes big time and this is where the reasoning dimension also goes
very, very big.
So you think the space, you can be elaborate in that a little bit.
Just philosophically speaking, do you think when you reason about trying to model what
the goal of a person is in the context of interacting with Alexa,
you think that space is huge.
It's huge.
Absolutely.
You think so, like, another sort of devil's advocate would be that we human beings are really
simple and we all want, like, just a small set of things.
And so, you think it's possible because we're not talking about a fulfilling general conversation.
Perhaps actually the Alexa prize is a little bit
a little about after that creating a customer.
Like there's so many of the interactions.
It feels like are clustered in groups that are
don't require general reasoning.
I think yeah, you're right in terms of the head of the distribution of all the possible
things customers may want to accomplish.
The tail is long and it's diverse.
Right?
So, from many, many long tails.
So, from that perspective, I think you have to solve that problem.
Otherwise, and everyone's very different.
Like, I mean, we see this already in terms of the skills, right?
I mean, if you're an avid surfer, which I am not, right?
But somebody is asking Alexa about surfing conditions, right?
And there's a skill that is there for them to get to, right?
That tells you that the tale is massive.
Like in terms of like what kind of skills people have created,
it's humongous in terms of it,
and which means there are these diverse needs.
And when you start looking at the combinations of these, right,
even if you're at pairs of skills and 90,000 choose to,
it's still a big concept of combinations.
You're going, so I'm saying there's still a big concern of combinations.
So I'm saying there's a huge to-do here now.
And I think customers are wonderfully frustrated with things.
And they're going to keep getting to do better things for them.
And they're not known to be super patients.
So you have to do it fast.
You have to do it fast.
So you've mentioned the idea of a press release
the research and development
Amazon Alexa and Amazon General you kind of think of what the future product will look like and you kind of make it happen you work backwards
so
Can you draft for me?
You probably already have one but can you make up one for a 10 20, 30, 40 years out that you see the Alexa team putting out just in broad strokes, something that you dream about.
I think let's start with the five years first.
So and I'll get to the 40 years to
some pretty sure you have a real five year one.
to some pretty sure you have a real fat year on that's I didn't want to
but yeah in broad strokes the star with five years. I think the five years is where I mean I think of
in these spaces it's hard especially if you're in pick of things to think beyond the five years space because a lot of things change right I mean if you ask me five years back will Alexa
will be here I wouldn't have I think it has surpassed my imagination of that time, right?
So, I think then from the next five years perspective, from a AI perspective, what we're
going to see is that notion, which you said, goal oriented dialogues and open domain,
like, like surprise, I think that bridge is going to get closed. They won't be different. And I'll give you why that's the case.
You mentioned shopping.
How do you shop?
Do you shop in one shot, sure, your double A batteries?
Paper towels, yes.
How long does it take for you to buy a camera?
You do ton of research.
Then you make a decision.
So is there, is that a goal oriented dialogue when alex somebody says, alexa, find me a camera?
Is it simply inclusiveness?
Right.
So even in the something that you think of it as shopping, which you said, you yourself
use a lot of.
If you go beyond where it's a reorders or items where you sort of are not brand conscious
and so forth. That was just in shock. Yeah, I've just to come and quickly, I've never bought
anything through Alexa that I haven't bought before an Amazon on a desktop after I clicked in a
bunch, I read a bunch of reviews, that kind of stuff. So it's repurchased. So now you think even for something that you felt like
is a finite goal, I think the space is huge
because even products, the attributes are many
like and you want to look at reviews,
some on Amazon, some outside, some you want to look at
what Cenerta is saying or another consumer forum is saying
about even a product, for instance, right?
So that's just a,
that's just shopping where you could argue
that the ultimate goal is sort of known.
And we haven't talked about,
Alexa, what's the weather and Cape Cod this weekend?
Right? So why am I asking that weather question, right?
So I think of it as,
how do you complete goals with minimum steps for our customers, right? And when you think of it as how do you complete goals with minimum steps for our customers?
And when you think of it that way, the distinction between goal-oriented and conversations for open domain say goes away,
I may want to know what happened in the presidential debate.
And is it I'm seeking just information on I'm looking at who's winning?
Winning the debates, right? So these are all quite
Hard problems. So even the five-year horizon problem. I'm like I
Sure hope we'll solve these in you your your optimistic because that's the hard problem
which part the the reasoning
You know enough to be able to help explore complex goals that are beyond something simplistic. That feels like it could be, well, five years is a nice.
It's a nice bar for me. Yeah, right. I think you will, it's a nice ambition. And do we
have press releases for that? Absolutely. Can I tell you what specifically the roadmap will
be? No. Right. And what, and will we solve all you what specifically the roadmap will be known?
And will we solve all of it in the five years space? No, this will work on this forever.
This is the hardest of the AI problems. And I don't see that being solved even in a 40
year horizon, because even if you limit to the human intelligence, we know we are quite far from that.
In fact, every aspect of our sensing, to neural processing, to how brain stores information
and how it processes, we don't yet know how to represent knowledge.
So we are still in those early stages.
So I wanted to start, that's why at the five year, Because the five year success would look like that and solving these complex goals.
And the 40 year would be where it's just natural to talk to these in terms of more of these complex goals.
Right now we've already come to the point where these transactions you mentioned of asking for weather
or reordering something or getting listening to your favorite tune, it's natural
for you to ask. It's now unnatural to pick up your phone, right? And that I think is the first five
year transformation. The next five year transformation would be, okay, I can plan my weekend with Alexa
or I can plan my next meal with Alexa or my next night out with seamless effort.
with Alexa or my next night out with seamless effort.
So just to pause and look back at the big picture of it all,
it's a year apart of a large team
that's creating a system that's in the home,
that's not human, that gets to interact with human beings.
So we human beings, we these descendants of apes,
have created an artificial intelligence system that's able to have conversations.
I mean, that to me, the two most transformative robots of this century, I think, will be autonomous vehicles,
but they're a little bit transformative in a more boring way.
It's like a tool.
I think conversational agents in the home is like an experience.
How does that make you feel that you're at the center of creating that?
Do you sit back and aw, sometimes what is your feeling about the whole message?
Can you even believe that we're able to create something like this?
I think it's a privilege.
I'm so fortunate where I ended up.
And it's been a long journey.
Like I've been in the space for a long time in Cambridge.
And it's so heartwarming to see the kind of adoption conversational
agents are having now. Five years back, it was almost like, should I move out of this
because we are unable to find the skill or application that customers would love, that
would not simply be a good-to-have thing in research labs. And it's so fulfilling to see it make a difference
to millions and billions of people worldwide. The good thing is that's still very early.
So I have another 20 years of job security doing what I love. So I think from that perspective,
I feel I tell every researcher, there's that joins or every member of my team, this is
a unique privilege. I think
and we have, and I would say not just launching Alexa in 2014, which was first of it kind,
along the way we have, when we launched Alexa Skullskett, it became democratizing AI,
when before that there was no good evidence of an SDK for speech and language.
Now we are coming to this where you and I having this conversation where I'm not saying,
oh, Lex, planning a night out with an AI agent impossible.
I'm saying it's in the realm of possibility.
And not only possibly, we'll be launching this, right?
So some elements of that, every, it will keep getting better.
We know that is a universal truth.
Once you have these kinds
of agents out there being used, they get better for your customers. And I think that's
where I think the amount of research topics we are throwing out at our budding researchers
is just going to be exponentially hard. And the great thing is you can now get immense
satisfaction by having customers use it, not just a paper and new
reps or another conference.
I think everyone, myself included, are deeply excited about that feature.
So I don't think there's a better place to and Rohit.
Thank you so much.
Thank you so much.
It was fun.
Thank you.
Same here.
Thanks for listening to this conversation with Rohit Prasad.
And thank you to our presenting sponsored cash app.
Download it, use code LEXpodcast, you'll get $10 and $10 will go to first.
A STEM education nonprofit that inspires hundreds of thousands of young minds to learn and
to dream of engineering our future.
If you enjoy this podcast, subscribe by YouTube, give it 5 stars on Apple Podcast, support
it on Patreon, or connect with me on Twitter.
And now, let me leave you with some words of wisdom from the great Alan Turing.
Sometimes, it is the people no one can imagine anything of who do the things no one can imagine.
Thank you for listening, and hope to see you next time.
you