Embedded - 178: Alexa Stop
Episode Date: December 7, 2016We spoke with Chris Maury (@CMaury) about using speech recognition to interact with devices. Note: Please turn off your Echo and Dots as we invoke Alexa a lot. Chris is the founder of Conversant... Labs. They created TinCan.ai which can help you wireframe or prototype a conversational user interface.  They can also help you build Alexa Skills, though if you are so inclined, you might try it for yourself: Alexa Skills Kit. Chris will be speaking at the O'Reilly Design Conference in San Francisco, CA in March 2017, giving a tutorial on building voice based user interfaces. You can read more from Chris on his Medium posts medium.com/@CMaury. CMU PocketSphinx Some of the embedded devices Elecia mentioned: Audeme (as heard on The Amp Hour #258) Grove Speech Recognizer from Seeed EasyVR We haven't gotten embedded.fm (or any podcast) to work with Alexa but we aren't sure why. Have you? Â
Transcript
Discussion (0)
Hello. Before we even start the show, I have to warn you that if you have an Amazon Echo or Dot device nearby, please unplug it.
Or unplug it after the first 10 minutes, because we're going to say her name a lot.
And yeah, if you don't want her to tell you jokes or shark facts or whatever it is we ask her to do during the show, now is the time to
unplug your Amazon Echo device.
Hello and welcome to Embedded.
I'm Elysia White alongside Christopher White.
Our guest this week is Chris Morey, who is going to tell us about speech recognition.
Before we get started, we will be announcing the Tinker Kit winner next week.
If you enter by December 9th, it might be you.
Hi, Chris. Welcome to the show.
Hi, Chris. Hi, Alicia. Thanks for having me.
Could you tell us about yourself?
Yeah. So I'm the founder of a company called Conversant Labs, which we build tools for designers and developers who are interested in building voice-based applications, whether that's for smartphones or the Amazon Echo, or really in general.
We're really excited about voice and just making it as easy as possible to get that done. Excellent.
And of course, we have many, many questions about that.
But before that, we usually do this thing called lightning round where we ask you short questions and we hope for short answers.
And if we're behaving ourselves, we don't ask you why and how.
And could you tell us more?
So you ready?
I'll do my best.
Donuts, bagels, or other ring-shaped breakfast treats?
Bagels.
When you do a design, should it be for new users and ease of use,
or should it be for great flexibility for experienced users?
Oh, oh man.
This is sort of Notepad versus Emacs.
I see. Well, I would definitely choose Notepad versus Emacs. I see.
Well, I would definitely choose Notepad over Emacs any day.
I think for the first step, it should be whatever gets your idea recorded as easily and quickly as possible,
where you don't lose the thought or the design.
Related, Form or functionality?
Oh, functionality.
100%.
Least favorite planet?
Least favorite planet?
I don't have one.
I know that's sad.
I did go to space camp in fourth grade,
so I should have an answer to this.
We'll say um venus
should we bring back the dinosaurs in any context yes favorite fictional robot oh this is the
question i couldn't come up with an answer to i know i'm just ruining the premise of
lightning round it's such a good question, though,
and I'm very upset.
Oh, Data from Star Trek,
Next Generation.
Do you listen to podcasts?
Yes.
What is your favorite,
this show excluded?
I don't think he listens.
I know.
I just feel like I have to
disclaim that every time.
There are a lot of very good ones.
I think the one that I am most one that I most eagerly wait for
and are most disappointed when they don't come out regularly
is the Exponent podcast.
They have really good conversations
about the state of technology and technology strategy.
It's very nerdy techno business stuff,
which is kind of trite at this point,
but it's really, really helpful when you're running a technology company.
Cool.
I haven't heard that one.
Yeah,
that sounds cool.
I highly,
highly recommend it.
What speed do you listen to podcasts at?
As fast as the Overcast app will let me,
which is ranges from like 2.5 to 2.8.
But if you guys know of a different podcasting app
that could let me listen to it more quickly,
I would be very grateful.
You have probably a better vocal processing front end
in your cranium than I do.
2.8 is pushing it.
So you build yourself up to it.
And this might get into a longer conversation,
but I use text-to-speech to read
most of the content that I read, whether it's blog posts or books, because my vision is going bad.
And I've worked up from normal reading speeds to 700 words a minute.
That's great. And, you know, after I heard him say on a different show, on the O'Reilly Solid or Design, one of the O'Reilly podcasts, I heard Chris say that he listened so fast and I started to bump up my overcast every time I listened to this week.
And it got, it really, it worked.
I mean, I was surprised I was up at 1.8 from about 1.25 which is what i had been listening
it was yeah it worked as long as i didn't jump yeah yeah okay so now we should talk about why
you're here um which is actually one more lightning round question okay voice recognition
or speech recognition i would say speech. I think that's the more standard word choice.
Because that's what the academics use.
Because we don't really care about recognizing voices
unless you're doing passcodes.
And what you really care about is speech.
Right.
So I think when you're talking about taking audio
and turning it into text,
that would be speech recognition.
I think voice, like you said, would be more closely associated
with identifying who is doing the speaking.
So like speaker identification.
And that's also really, really helpful and useful,
but we're not really there yet.
Well, from a more mundane perspective,
VR is already an overloaded acronym.
Yes.
Yes.
Okay, so speech recognition.
And that leads to Siri and Alexa.
Should we not use that word?
Alexa?
Or Siri.
Alexa, tell me a joke.
No, don't do this to people.
Sorry, everybody.
Every phone call I have
someone's echo goes off
without fail
so there's Siri and
Alex something
and these
are the two most well known
speech
recognition things
and Google
Google Voice Google Now I think speech recognition things and google and google google voice i don't know what the keyword is
google now i think or something it's now it's google assistant okay will it tell me a joke too
probably not a good one well alex alex somethings are also terrible so oh uh actually google hired
um screenwriters and comedians to do their script writing.
And so the jokes on Google Assistant are significantly better than Amazon Echo.
And this is actually, for everyone listening, this is what I use my Echo for.
It has to tell me jokes.
It sets timers.
It occasionally plays music, but it mostly tells me jokes. And if you think that is worth the $50 entry price, I have to say it was for me because the jokes are awful.
But, okay, so Alexa seems better than Siri.
And Alexa, stop.
And so why is that?
So this gets to kind of what the latest kind of technological innovation in the last couple of years is in speech recognition.
So Siri has been around for a handful of years now, and she's finally gotten good enough, as with all the other services, at understanding
what we say. So she can recognize the words that we're saying with a very high percentage of
accuracy, no matter the environment and no matter the accent. What she still struggles with and what
Alexa does a better job of is understanding what we mean. So the vast majority of the time, Siri will just punt your query or command
to a Bing web search
where Alexa is much more focused
on application-specific actions
and enabling developers to build apps
that can respond to your specific requests. So the difference is this
ability to understand the meaning of the words that you are using. And Alexa has done a much
better job of that and Google Assistant as well. But they all go to the internet.
They do all go to the internet. Though starting with iOS 10 and with Apple has said that speech recognition can be done on device for certain devices. And they haven't been very specific about that.
Probably the newest.
Yeah.
Right.
So you said they all get pretty good at recognizing that is not my experience
it's my experience is it is it because i is it because i talk wrong um is it because i have a
higher voice and they're trained for male voices is it because i tend to do the stupid pauses and
then sort of sing for the rest of my sentence.
If I had to pick one, it would probably be the last one that you said.
I'm sure that the data... So the way that all of these systems work is by collecting tons and tons and tons of data,
of audio recordings of different people speaking with different accents and cadences
and frequencies in different
environments, and then using that to train machine learning to recognize the next time
someone says something.
So I would imagine that the data is very or fairly representative of the broader market.
So men and women of all age groups.
But I think being more sing-songy in your elocution
might throw Siri for a loop in the same way
that Siri and speech recognition in general
has a really hard time with kids
because their voices are significantly higher pitched.
And it is true that shouting at the devices works better.
Of course, that's going to be more clipped.
So I think that gets into another factor,
which is the quality of the microphones on the device that you're using. So if you're using Siri very close to your mouth, she's going to do a better job of understanding you distances, which in my experience with Siri, she is nowhere near as good at.
And not even like understanding what I'm saying, but just even hearing, hey, Siri.
Well, that's going to be definitely true because Siri has one microphone, maybe two, and Alexa's got five or something.
And so she can do beamforming and be able to hear things far more clearly by augmenting the signal from the different microphones.
Right.
That's why the Echo Dots can put a light on that shows where it thinks you're coming from.
Oh, I've not noticed that, but that's really cool.
Are the underlying technologies between these really different?
I mean, as far as I understand, at the very lowest level,
you come in through the microphone, you do some signal processing,
you grade it into pauses, which is where I probably fail,
and then you try to build up the phonemes
from the frequency information.
Once you have phonemes, then you can start to build words
and then words and then sentences and you do some matching.
Is that all the same for all of them or do they do special stuff?
I would say at this point,
it's going to be very similar across all the platforms.
And the only thing that's going to vary is, to vary is the data set that they're using.
Google had their Google Voice product that they used to collect all of this data,
speech data to train so that they could transcribe voicemails.
And then that's now powering their speech recognition.
I think for the speech recognition, it really is going to be very similar
and there aren't going to be major differences where, and moving forward, where the differences
are going to emerge is, or the differences in quality, I should say, are in the understanding
what you mean with your words. So the voice applications that we use, whether it's Siri or Alexa,
there's the speech recognition piece, and then there's the natural language understanding piece.
And that's taking the text that it's transcribed from the speech recognition and trying to assign
meaning or intent to those phrases. And like the speech recognition, the underlying technology relies on machine
learning and is similar, but it relies much more on the data, which we haven't really built up that
data set because with speech data, you just need people talking. But with the natural language
understanding, you need data from people booking flights, from interacting with calendars, from
whatever that action is. You need data of people talking and expressing the intent to complete that action.
So those are kind of, I think, in terms of the differences in experiences moving forward
or across Alexa and Google Assistant and Google Home and Siri,
it's going to be because of the amount of data they have for those specific intents.
And so that's like, if I want to make a meeting on January 12th, 2017, I can phrase that a
thousand different ways, whether it's meeting 9am January 20, whatever I said, or whether
it's book an appointment Jan 12 or Monday after next or
whatever, I have all of these different ways I can phrase that. They may all mean the same thing.
Is that the sort of thing you're talking about? That's exactly right. So with the calendar example,
there's the different kind of words you choose to use so like make a you know i make an appointment make
a calendar event uh i need to go do this thing i have a meeting so there's that piece and then
there's the um the pieces of information you need to complete that task of creating a calendar event
so you need the date and you need the time and then you need like a label and so not only can
you express the desire to create that calendar event differently,
you can not provide all the information you need.
So then you have to collect that data.
So the app would have to ask you,
what time on Saturday would you like your event
to be scheduled for?
But you can also put those in all different orders.
So you could say the time before the date,
before the label, or the label,
then the time, and then the date.
And so all that complexity has to be solved for.
And luckily, calendaring is one of these common tasks that they've been working on for years now,
so it's gotten better.
But that same complexity is there whether you're making a calendar event
or you're ordering a pizza or um you know playing jeopardy
on the jeopardy scale and that gets to another uh kind of a third piece of this beyond recognition
and natural language processing is kind of a mental model or state machine because it's one
thing to say take an action and have it respond and take that action but it's another to keep
context and keep a conversation going, right?
And that's sort of a new piece, I think,
that's still being explored.
Yeah, absolutely right.
So the amount of state that is managed
in current voice applications is very limited.
We're limited to these question and answer
or question and answer with a
clarification so like if you don't provide all the information you need the app will ask you for it
the next step is to be able to maintain your state in kind of a multi-step process so not
necessarily a longer conversation but being able to to, for example, cook through a recipe.
So there's the original, like you open your voice application and you search for a recipe. So
there's the search for a recipe state. Then there's the search results for that recipe
state. Then there's once you've selected the recipe, there's that state. And then there's
the step you are in that recipe as you're cooking through it
and or the ingredients in the list and so being able to track each of those states
is definitely doable but not really supported or we haven't seen that yet in voice applications on
you know the amazon echo and then beyond that is like a robust full conversation uh where you can just go back
and forth on whatever topic uh and amazon just um announced a 2.5 million dollar prize uh for
uh that 12 different academic teams are competing for to try and create um app using Alexa's APIs
that can hold a conversation for 20 minutes.
For 20 minutes?
Yeah, which I think is a pretty hard problem.
That's a lot of state.
I mean, there are some Alexa games that you can play
that have some state.
There's one that you battle
robots and it's sort of
fun, but
it is definitely one
of those, you could play
it on a board and it would be pretty trivially
easy to see where your
robot is and what it's done and
what the other robots are doing.
But it feels
a little conversational.
Like it's the dungeon master
and you're playing Dungeons and Dragons sort of thing.
I don't know how you'd get beyond that.
So yeah, I think there's a lot of challenges
as the user of these experiences
where with a visual application like
you said if the board is in front of you we can it's a lot easier for us to understand
where we are and what all the actions we can take are um and that discoverability of what you can do
is is i think one of the biggest problems with Siri on the iPhone, right?
Or Google Assistant or Cortana, these general virtual assistants
where you don't know what she can do
and you don't know how to say the right thing
to get her to do what you want her to do.
So the discoverability of those interactions is really, really hard.
I think Amazon has done a really good
job with this skill model, which people are used to applications on an iPhone or on websites on
the web. And so skills for voice, like you are limiting what the user can do to the very specific
context of that skill. And then because it's so limited, it's easier to kind of intuit
what you can do. So in like searching for recipes, if you're in a recipe app, changes are you can
search for a recipe. If you're in a shopping app, changes are you can search for a product,
get reviews for that product and purchase that product. So by limiting the context,
it makes it a lot easier, not only on the technology side and understanding what the user is saying, but on the user side for them to know what they can say.
And that is such a huge problem.
I mean, in Siri, I can ask what the tides are, which is something we go to the beach and I always want to know what the tides are now.
Is it going out? Is it coming in?
Christopher doesn't always like to get his feet wet, so I need to know how far we can go.
This is false. It's sort of false.
And so, if I go to Siri, there is one particular phrasing that
will lead me to the tides. All the other phrasings,
I mean, Siri, what are the tides? Nope. What are the tides
today? Is the tides? Nope. What are the tides today?
Is the tide coming?
None of those work.
I have to find exactly the right phrasing.
And it changes.
And it does change, which makes me crazy.
Yeah.
So I don't know exactly how Siri's internal architecture is that
leads to that specific
outcome. But if that same
problem were to happen on
Alexa or in a specific
skill, so if you had the Tide skill for
Alexa and you said, Alexa
ask Tides
when the high tide is,
the whole
primary responsibility of the app developer for that skill
is to collect the thousand different ways
that you are going to ask for tide information
so that it will recognize what you say in your natural language
without having to think about the right way to say it
and then be able to respond.
Because like you just talked about,
if you say it and it doesn't understand,
you're going to have this expectation
that it doesn't work
or you're going to get really frustrated
because you know it does.
You know it can tell you the tide,
but you just can't communicate effectively.
And so that being able to understand
what the user said
is like the core problem
in voice computing right now
and is the number one responsibility
of the app developer
when they're building their app.
And it's a real adoption problem too
because on paper,
if something works 90% of the time,
that sounds pretty good.
Oh, I can talk to my computer and it responds
and does what I say 90% of the time.
But the truth is, if something doesn't work
10% of the time, I think most people are just going to stop trying.
Yeah, it's a big problem. And I think it's things like Amazon and Google are trying to work on,
which is for that 10%, when it doesn't understand, there's kind of two possible reasons for that.
One is it just completely doesn't understand what you're saying.
And so like there's nothing it can do.
But then there's also the case where it does understand what you're saying.
It just doesn't have that functionality, right?
So for the tide example, if you said,
when's the high tide next you know next month next tuesday
at 8 p.m it could understand that request and instead but not be able to respond so instead
of not responding with that information say sorry i don't know when the high tide is next week or
something like that um so by being by the application responding with saying, hey, I hear you, I just can't help you, it helps to teach the user to be more experimental in the requests that they're saying.
How important is it to have that be a human interaction. I mean, you were saying,
we want the app to be able to say,
I can't do that.
And I was thinking, I'm sorry, Dave.
I'm afraid I can't do that.
Would be hilarious to me once, maybe twice.
But as you design these interactions,
how much do you have it be efficient and how much do you have it be efficient
and how much do you have it be personable?
I have very strong opinions on this,
which is to say you should avoid personality
as much as possible
unless you have a really, really, really good reason
to have personality.
Because it gets really annoying. personality is cute when they like did
something right for you but personality makes you want to throw your phone against the wall
when it doesn't do what you want it to um and it even gets annoying like uh and i hate to bash on
siri but like every single thing I ask her to do,
there's some cute turn of phrase in the response.
If it's cold outside, she'll go,
or when it's raining outside, she'll be like,
don't forget to bring an umbrella.
And it just gets really tedious after a while where we don't expect our computers to have personality
when we go do a Google search.
We just want the information that we're asking for.
So until we go from 90% accurate to 98% accurate or higher,
I think personality is only going to get in the way
of a good user experience.
I can see that.
It would be fun for a little while, but I just want it to work.
Spend less time messing around, making neat features, and more time.
Right.
And this kind of gets back to the jokes, which is like, it's great that you can tell me jokes,
but I'd much rather you be able to tell me what my latest email is
or what my latest text message was.
I can see that,
although that's not how I use it
because if I'm in the kitchen,
I don't really want to play with my email.
I do that all the other times.
And you're not the only one.
I think the majority of people
who have the Amazon Echo or the Google Home have them in or near their kitchen.
So if not email, it would be great if she could help you cook through recipes or order groceries or order food from Uber Eats or Grubhub or something or make a restaurant reservation.
And I do think that we haven't used the shopping list yet, but I do think that could be useful.
Because you're standing there and you just ran out of oatmeal.
Okay, Alexa, add oatmeal to my shopping list would be useful.
Christopher's shaking his head like you haven't all unplugged them by now.
So that's what you do.
You actually help people design these conversational methodologies yeah so um
we have a tool called tencan.ai uh t-i-n-c-a-n.ai which um does two things one is it helps you
prototype a voice application so um right now it's really hard to go through the design process with voice apps.
You know, wireframing doesn't really work for non-visual experiences. Like you can draw the
outline of what your mobile app is going to look like, and you get a good sense of how that app is
going to behave. But you can't do that with a voice app. You can say, like, create sample scripts,
like the app says this, and the user says this,
and the app says this, but that doesn't,
it's only so helpful,
and you can't do user testing with that.
So that's the point of this prototyping tool
is for designers and developers
to very quickly be able to mock up
how their voice app is going to behave
and then put that in front of users
to do user testing.
And the reason that's so important, which is what we've been talking about this whole
time, is so you can collect the data of how people are going to interact with your skill.
What are the ways that they're going to phrase the questions that you expect them to ask?
Because there's just so many different ways.
And so we help you to collect that data so that when you do release your skill, it's going
to be more responsive to all the different ways that people are going to talk to it.
So does it end up being a natural language processing problem where you have this giant
tree of options and it's one of those and you just have to build the tree? So, yes and no.
So it is very much a natural language processing problem.
But the way that, the limitations of voice computing right now
is there's not really a tree.
There's a set of actions a user can take called intents,
and all of those intents are active at the same time.
So if a user is using your skill and says something to activate that intent, that intent
will be activated. And then it'll call a function for your app to take an action against that.
And that action could be to look up the low tide time and speak it back to the user. Or it could be
to look up the low tide time, oh, they
didn't give me a location. So to ask the user for that location and then speak them the time. And so
you need all of the data to be able to recognize when a user has expressed that intent. And then you also have to specify the different types of data that you're going to receive from that voice request.
So like a date and time, a physical location, a proper noun, a phone number, and things like that.
And so there's some overlap between applications because anything having to do with the date and time can come in all of the ways a calendar meeting can be requested. Microsoft's LUIS or Facebook's Wit.ai or Google's API.ai will have not only built-in intents for common interactions,
like confirmations, like yes, no, cancel, go back,
those types of things,
but also the common data types like calendars
and phone numbers and things.
It feels like you're building a whole new programming language.
It is a very new way of doing things. It feels like you're building a whole new programming language. It is very, it is a very new way of doing things. That wasn't a yes or no, but okay. So it's, you're relying on the same
programming languages. Like you can build an Alexa skill and whatever language you want,
JavaScript, Java, anything that can run on a server and you can connect to Amazon over HTTP.
And it's just that the architecture is different
and relies so heavily on speech recognition
and on natural language processing.
So the requirements of implementing natural language processing. So the requirements of implementing natural language processing
has a lot of nuances very specific to that process
that we haven't really had to deal with before.
But it's not really coding.
It's just data collection and design.
Okay.
I often see many problems as coding
so the grammar generation
seems like the same sort of grammar generation you would use if you were building
another language but yes I see many things
as coding problems
so what sort of tactical advice
things that if I was building an interface So what sort of tactical advice,
like things that if I was building an interface,
what are the first five things you tell people to look out for?
This variability has got to be one of them, but what else?
So is it okay if I answer the steps that I would take in thinking through designing a voice app?
Sure.
Yeah.
So I would be...
The first step is to clearly define the actions that a user can take.
And so let's go with this example of the T tide app. So if you want to be able to report to users
when the high and low tides are for a given day
and for a given beach,
that means they're going to have to,
they're going to ask for the high tide,
they're going to ask for the low tide,
they're going to ask for the tide table for that day,
they're going to ask for a specific beach,
they're going to ask for a specific location,
and they may even ask qualitative questions like, is today a good day to go surfing? And so in the first step,
it's to think of all the different actions a user is going to want to take. And the second step is
to limit what you are going to support to a very concrete set of actions.
So in the first version, you may want to limit it to just the tied tables and put off qualitative questions on what that means for different water activities.
Then step two is to think through all of the ways that people are going to express those different actions and the different ways they're going to form those questions.
And then step three is to, once you've come up with everything that you can think of, then you have to go ask other people or do testing with other people to see how they're going to express those same
questions.
Because no matter how many you can think of on your own,
what other people are going to say is going to be dramatically different.
So you have to be able to,
to account for that.
Like,
like is the tide coming in would have been one of my questions.
Right.
Exactly.
Which is totally solvable with the data set
you have already.
But it wasn't in your list.
So yeah, you have to ask what other people think
because you're not going to get all the answers yourself.
Go on.
Yeah, that's exactly right.
And then once you have that, it's a very
pretty trivial
coding problem.
Alexa skills as they exist today
are just when you
Alexa, the Echo will tell you
when an intent is activated, it'll
call a specific function
and that function will just need to
access the data that you have
and then you
speak that data back to the user.
And you generate
the response.
And that is just a complicated way of saying,
writing out the text that the app is going to say.
And then that's really it.
That sounds so easy.
And that's the thing,
is it really isn't rocket science.
It's just, I think, where mobile and web development is, you know, there's
considerable amounts of software development involved. And then design is kind of goes hand
in hand with that. And it makes the product better. Voice is significantly less complicated
on the software engineering side, but relies significantly more on good design
and the design process and doing user testing to collect the data that's going to power
that voice application. That makes a lot of sense. If you're getting to 90%
success rate and all you need to do is listen to a few more things your users are asking for
and figure out how to map them into the functions you already have to get to 96 percent that seems
so worth it right right it absolutely is and it it kind of is similar to traditional application development
with analytics and seeing how people are interacting
with your application
and then focusing on the parts of the application
where people aren't doing well.
But where traditionally you have to come up
with your own ways of making that process easier
for the user.
With voice, you know what they're saying.
You know what they're trying to do that you don't understand.
So all you have to do is retrain on that expression
you didn't understand or add the features to support
that expression that the user requested
that you couldn't respond to before.
So just by listening to the user
becomes so much more important,
but it's also so much easier with voice.
So before we get off the topic of general voice,
speech, natural language processing,
and application design,
one thing you haven't mentioned is localization
and supporting multiple languages.
That seems like a really big problem.
It's the same problem, though.
You have to go from what's input.
Is it, though?
Because you have to repeat what we're saying about collecting ways people could ask questions
or signify their intent.
That can be quite different in different languages.
You can have idioms.
You're exactly right.
You're exactly right.
Where, you know, historically,
localization is just translating your website
or your app into another language.
With voice, not only do you have to do that,
but you have to do this data collection
in those other languages.
So, you know So Amazon Alexa is available in the US, the UK, and Germany. And so for the German version,
you have to translate your skill to speak back in German. But then you also have to get the
user utterances or the expressions that people are going to use and speak to your app in German.
And then also in British English, because idioms, like you said, are going to be different.
So the localization problem is greater than it has been before, but is the same process of building the voice app in the first place. You just have to repeat it for every
language that you're going to be supporting that's sort of like how you have to do that
but for numbers when you have an embedded system and you want to output numbers they're all different
yeah it's a you know it's it's a repeat of a large data set collection it's just it's not
quite the same as shoving a excel spreadsheet to a translation company and having to come back
that never works as well as people think anyway.
Well.
Okay, so I want to move on to a different topic.
We've been talking about these applications that go to the net.
And I have to admit, I don't really like it.
One of the reasons I didn't get an Echo sooner
was because I really don't like the fact that it
goes out to the net. I have privacy issues with it and I have issues with both Siri and the Echo
failing to get to the network for some stupid reason, even though everything else in the house
is fine. How long are we going to be stuck going out to the net to do things when can we do
it here that is a a good question um i think the limiting factor right now is the speech recognition
it's still very processor intensiveintensive to convert audio into text
and running the machine learning algorithms and classifiers on that data
to the point where it requires offloading to the internet.
But once you have that text, the natural language processing, on the other hand,
is much less processor-intensive. The training itself is very on the other hand, is much less processor intensive.
The training itself is very processor intensive, and that takes lots of GPUs running for hours to get right.
But once that model is trained, the actual classification is much more straightforward. So it's not...
Speech recognition is one thing,
and I think we're getting there.
So hopefully in the next couple of years,
we'll be able to do on-device speech recognition
over an open dictionary.
So right now you can do it with a fixed number of words
and the tens to hundreds, so less than 500.
But hopefully in the next couple of years, we'll get to the place where you can do open speech recognition, where you can just listen for anything and get the text of that, that's kind of the limiting factor because of, or at least with respect to speech input.
I don't really know.
I don't have a sense of what the text-to-speech requirements would be.
Because all of those, oh no, you can do that on device now.
I'm sorry.
Yes, definitely.
Yes.
Yeah.
So, you know, text-to-speech you can do on device.
And the natural language processing you can do on device,
but there's no way,
there's no services that let you do that right now.
So you would have to build your own, essentially,
which is a lot more complicated than it should be.
So it's just going to, technically, it's not a problem. which is a lot more complicated than it should be.
So technically it's not a problem,
it's just that it isn't available right now.
And then I still think it's a couple years away before we can do speech recognition on device.
There are some dev kits out there.
There's the EasyVR, which is easy voice recognition on SparkFun.
It's a $50 Arduino shield.
And it's got a number of commands that are built in.
There are things like robot stop, robot go, robot turn.
And it's supposed to work pretty well.
And you can add things to it.
It also supports five languages, including US, so that would be six languages.
But it only supports, as you said, a very small dictionary. You can train it for a few more,
but it's not going to be long sentences. It's not going to be, tell me a joke. It's going to be,
joke now, joke cat, joke shark. Yeah, really, this is all I use her for. And then there's another one,
a $20 one from Seed Studios, the Grove Speech Recognizer. It's a Cortex M0 Plus based,
and it has 22 commands, which is pretty cool. But again, that's not a dictionary.
And if you think about the Cortex M0 Plus and how small it is and how efficient it is,
that has a lot of goodness, but I suspect they're pushing that as hard as it can go.
And they're still only getting 22 commands. If they put, if they added flash or whatever they
needed to, or RAM probably to do more, they would need to have a bigger processor because
you have to compare all of these phonemes against your, your signal processor gives
you phonemes and then the phonemes you have to compare to your dictionary of phonemes.
And that process is very intensive and takes a while and you want your device to be able
to respond very quickly it's one of the things that all the ones we've been talking about do
respond to you almost immediately unless they've gone off to the net and fallen down but when they
work it's a snappy back and forth. Let's see what else we have.
Oh, I have the Radio Shack has been doing voice recognition or speech recognition chips since 1980.
Well, I don't think they're still doing it.
And how well is that working out?
I don't remember.
I think it worked.
How many commands did it have?
You know, a dozen or something.
I don't remember what it was,
but it was a little masked ROM with a tiny microprocessor.
It was like 10 bucks in an IC.
It was for the robot kind of thing, like you were saying.
But they have to be pretty separate words.
They have to sound really different, you mean?
On and off are terrible.
Start and stop are not great because those are
both words that sort of sound like each other um if you say well there's yeah you want to do
things that sound different open and closed sound very different that's a good good thing
but for natural language processing it's not a good thing.
Right.
I was going to say, I'm not as familiar with the boards that are out there,
but on the software side, there's an open source on-device speech recognition library
from Carnegie Mellon called Pocket Sphinx.
And it is a on-device speech recognition engine
that, again, has limited dictionaries,
but is much higher numbers that it will support,
I think in the hundreds, but it's going to require a full
Linux computer on a chip.
So maybe you could run it on the new chip, like $8 PC or something like that.
Yes.
But yeah, that's another thing
if people are interested in
that might be worth exploring.
Their links are all broken, darn it.
I will put a link in the show notes
that actually works instead of inside.
Never mind.
I'll send you good links on that
if you want to check the show notes
for Pocket Sphinx.
And there was another group, Automy.
They were on the Amp Hour, which is sort of our...
Brothers, I almost said sister show, but that doesn't work exactly.
We'll go with brother, cousin, sworn enemy show.
Second cousin show.
And they talked a lot about the technology of putting it together but that's a it's a 70
dollar linux based part that wow if you want to and you know that that's retail i mean it's maker
retail so assume that you could actually get those parts for 10 to 20 dollars and put them in your
device and it would all work but that means your cost has to go way up and it starts to make sense why we aren't
doing these locally.
But someday, I mean, Moore's law works for us, then that will be cheaper soon.
And in 10 years, we'll just be walking around talking to ourselves without cell phones.
Yeah. I mean, I think we're going to be, we're definitely going just be walking around talking to ourselves without cell phones. Yeah, I mean, I think we're definitely going to be walking around talking to ourselves very soon.
It's just definitely going to reply on the internet, and a company is going to be able to record all of that.
You're not helping my privacy concerns here. So I think the privacy concerns are very valid
and it's a very,
probably the technology issues aside,
privacy concerns are the biggest issue
facing a wider adoption
of really, really interesting things
that we can do with voice
because of just companies using that data and knowing everything
about you and then governments doing the same in this country and others.
And I think people are going to have a very justified negative response if we push that
too far forward. So I think in order to kind of see the full potential
of what we can do with voice computing,
we're going to need more privacy protections
like written into law.
And I don't know when that's going to ever happen.
Well, let's not go further down that path today.
Okay, okay.
But I agree, we're going to have to sort this out because it's all there.
Different companies have different attitudes towards it.
I mean, Apple, at least consistently up to now, has been very, you know, privacy is one of our product features.
And so, you know, I think they would be pushing hard
to get stuff processing locally,
and then it's just a matter of,
okay, you're doing a web search,
or you have text going up to the web.
Or text going to your apps.
But it's not consistently listening
to everything you're saying
and potentially having a privacy hole there.
But it's going to be interesting to see
if consumers make that something that they want, I guess.
Because if they don't,
then it's kind of incumbent on companies to force it on them.
That's, no, we're going to do this.
It's going to take us a little longer to make this truly safe for you.
Can you be patient?
Right?
All right, let's move on we were connected through the o'reilly
design conference you'll be speaking there next spring what are you going to talk about
yeah i'm going to be running a workshop on uh designing and prototyping voice-based applications
so i think we're going to go in more depth
into the topics that we covered today,
like what are the underlying technologies
that are making this wave of voice computing possible
and what does that mean for what you can do
as a voice app developer and designer?
And then what are the design considerations
and best practices in voice user interface design?
And then we're actually going to sit down and prototype a voice app together using the prototyping tool that we've created.
And then we'll do user testing with other people in the workshop.
And hopefully, I'm really excited to see what people come up with.
Do you think people will bring their apps and say,
how do I do this?
I hope so.
I think that'll make for a much more interesting workshop.
I think I'm curious to see,
like one of the big things that I want to talk about is,
does it make sense for your use case
for there to be a voice application?
Because not everything, at least today, should have a voice app because of the limitations and
just what makes sense, right? I don't think people are going to be doing a lot of clothing shopping
on their Amazon Echo. And yet she tries. What apps are really good candidates? I mean, jokes aside, what should we be looking for soon? that should be asked is given the context that the user is in, like their physical environment
and what they're doing, does voice provide a more efficient means of completing that task
than other possibilities? And that's kind of where Echo has done a really good job
is in the home, most people don't have their phones on them. So having a device that is just
there that you can talk to is going to be a better experience than having to go find your phone
and then pick it up and complete that task. Especially for interacting with your connected
home. It's a lot easier to say, Alexa, turn on the lights than it is to either get up off the
couch and go hit the light switch or pull out your phone from your pocket open up your smart app. You should totally have seen Chris
try to turn on the lights recently. It took him like
12 tries. It would have been
so much easier to go stand over there.
But yes, sorry.
No, I think there's an
entire other episode that can
be done on just how frustrating
the smart home is
as a consumer.
But yeah, so that's the overarching question is,
does voice provide a more efficient means
of completing a task than other alternatives?
And then you have to ask that for your specific use case,
but also for the device you're building for.
And right now, most people who are thinking about voice,
I think are thinking about it on the Amazon Echo and the Google Home when that opens up to developers later this month. are related to food or cooking or kind of the family group experience are going to lend
themselves much better in the short term to voice computing. So anything with recipes,
with food ordering, with restaurant reservations and food search, those types of tasks I think are
very well suited right now. It could change channels on the TV too. That would be nice. It can if you have a Logitech Harmony.
And that's one of my next purchases
because I would really like to set that up.
Which clearly we do not.
Behind the time on technology.
It's clearly a business expense.
Okay, so what apps don't make good candidates
for voice recognition?
You mentioned clothing
and I could see how like some of the picture only
like Pinterest probably wouldn't be that useful.
Right.
Any things that are very visual,
at least right now are not going to make sense.
You know, there are whispers that Amazon's coming out
with an Echo with a touchscreen.
So that might be better, but still probably not going to be better
than just using it on your phone.
So anything heavily visual, anything where you are kind of communicating
very sensitive information, so not necessarily secretive,
but that also counts.
So like credit card information or like your address, things
that if you get that piece of information wrong, just the experience of trying to do that through
speech is going to be very poor and not great. So I would avoid anything that has to rely on that.
Yeah, there's some phone tree applications that want you to say your social security number.
And I always like, you have no idea where I am. I could be in a coffee shop. This isn't the sort
of thing you want me to do out loud. Right, right. I think that kind of gets to one of the other
kind of things to keep in mind with the Echo is it's a public device in your home. And unless
you live by yourself, other people are going to either overhear the conversation that you're
having or be able to access that same data. And so we talked about email a little earlier,
but that's, I think, an example of something that people might not be comfortable doing on this public or shared device.
So anything with very personal information
like mental health or email or journaling
or other things of that nature
I don't think are good fits
for voice computing as it exists today.
I think that'll change very quickly,
but right now, probably not.
That makes sense.
There are a lot of things that I could see
in that whole stuff you don't tell everybody all the time bucket.
Right.
Cool.
Well, it does sound like it's going to be an interesting tutorial. Let's see.
You care about voice recognition a lot, clearly, and you, or speech recognition,
and how it's designed. And you mentioned that you use a screen reader. What is that like?
Extremely, extremely frustrating.
So the reason I got into this whole space is because five years ago, I found out that I was going blind.
I was diagnosed with a genetic disorder where I'm losing my central vision.
And as my vision has gotten worse over that period of time, I've gotten very familiar with the assistive technology that's out there and specifically the screen reader. And I'm very disappointed with
how they work and the experience of using that product. And for those of you who aren't familiar,
a screen reader is a device for the visually impaired or a software for the visually impaired where there's a cursor on the screen that you can move using keyboard
commands on a laptop or swipe gestures on a smartphone. And whatever that cursor highlights,
that text is read aloud or the metadata for that text is read aloud. So you're navigating this two-dimensional visual experience
in a one-dimensional audio stream using keyboard commands.
And it's very cumbersome.
And that's only when it works well.
A lot of websites and a lot of mobile applications
don't take the steps necessary to make their products accessible.
And so this creates a really poor experience for the vast majority of the blind community,
not to mention the fact that most people are losing their vision from aging-related disorders.
So they're very unlikely to know how to do email, let alone use email with the screen
reader and other assistive technology.
So voice computing represents this much better way of doing things where these applications are designed for audio first, and they're going to be a much more intuitive experience
for someone who's less tech savvy.
And conversation is something that we've all been doing since we've been able to
talk. So it's much more comfortable. So I'm really optimistic and really hopeful that in the near
future, we're going to be able to do everything we can do on our smartphones through voice instead.
So you didn't, as we were talking about apps that weren't necessarily good candidates,
you didn't mention things you want to do fast. And as I think about screen readers and what,
how hard it is to navigate a website, I imagine that that is incredibly slow and that these
speech conversational things are also pretty slow. And you said you listen to
podcasts really fast. How do we get these to be slow? I've always found in meetings, I'm like,
can't we just do this over IM where we can all type faster?
Yeah. So I think a big part is just letting you increase the speech rate of your voice service.
And so that's a standard feature in screen readers that I would love to see on Alexa or Siri or Google Assistant.
And that would help make these interactions more efficient.
Alexa, talk faster.
Will that work? No, it won't.
It doesn't work. I was waiting to see what she would
say back to you oh no she's in a different room okay um no i wish i wish um so so that's one thing
that we can do but again that's like more for for power users who are used to it um
the it's just it's i think other things that we can do to make the experiences more efficient are just being
more responsive to more natural language commands, rather than having to force you to go through a
sequence of very specific commands. One, and then two, allow you to make multiple requests in the same kind of
interaction.
I'm trying to think of a good example
of that.
Set two
timers, one for five minutes and one for
ten minutes. That's not a great one, but
an idea of that would be more efficient
than saying Alexa set a timer for five minutes,
Alexa set a timer for ten minutes.
And there's also UI things that you can do
to make the experience more efficient
that I think will help kind of move us in that direction.
But that's difficult.
I mean, the timers is not too difficult
because you're going to the same app in the end.
But if I wanted to say,
put eggs on my grocery list and remind me to go to the grocery before in the end. But if I wanted to say, put eggs on my grocery list
and remind me to go to the grocery
before I go home,
those are two separate apps.
And so you have to have enough
upper level natural language processing
to split it between them.
Yeah, yeah.
You're exactly right. And I think that's
going to be a challenge for these platforms
and the ones who can
do it the best are
going to be the ones that succeed.
And hopefully we see
that sooner rather than later.
But you're right.
It's navigating between applications, but
it's also carrying
on a conversation with the same application.
Right now in Alexa, your skill is only active for like 16 seconds.
So if you're not carrying on that conversation continuously,
you have to re-invoke that application and start over every time you talk to it.
So like with the recipe example,
if you're stepping through the steps in a recipe,
you'd have to say, Alexa, ask the recipe app
what the next step in the recipe is every time.
And that's less efficient than just saying,
Alexa, what's the next step?
Well, there's, I mean, that robot game
lets you stay in it until you exit it or die.
So that's coming, right?
You don't know what the robot game is and I don't remember the name.
So that's just not a good information, is it?
I hope, I would imagine that they're improving that.
Like it just makes so much sense. So hopefully that's something that's happened
and we'll continue to see improvements in that direction.
And your recipe app is exactly what we need here.
Because once you can do that,
once you can step through something
that's a fairly linear process,
then you can start stepping through things that aren't as linear and even the linear process of having a recipe at some point
you're going to want to have a timer and it's it builds on itself and it does seem like someday we
will get to conversation this whole thing reminds me of those old Infocom games. Yes. From the 80s, those text adventures.
Zork!
And yeah, it's the same kind of problem of anticipating any kind of input.
Where for years we've had, okay, there's a mouse pointer and a keyboard.
And the keyboard has a fixed set of key commands and the mouse points exactly where you want it to.
And now it's all, it's a completely different way of thinking about applications,
and I think that's what's hard for people.
Yeah, we're definitely in a transition phase.
And as we learn best practices, not only in the design,
but in the architecture side of things,
we're going to start seeing better and better applications.
Well, that sounds like it's a good place to start wrapping things up.
Chris, do you have any last questions?
Yeah, I wanted to ask, since a lot of people who listen to the show are tinkers
and electrical engineering people and software people who make projects of their own
or might be exploring little Internet of Things projects, electrical engineering people and software people who make projects of their own or little,
might be exploring little Internet of Things projects.
What would you suggest if somebody said,
hey, you know, I'd really like to hook voice recognition,
speech recognition into my project or product.
I don't know where to start.
It wouldn't be making their own voice recognition system, but it would probably be hooking into Amazon Echo's ecosystem or something.
So where would somebody start to look at that?
Yeah, I think the Echo is a great place to start.
I think it's the most open right now,
and there's a lot of support for kind of hooking it up
to more project-type things with If This Then That and other services
that you can use to link up voice commands in your skill
to different devices in your home
or projects you may be working on.
So I would definitely go to Alexa's developer pages,
which I don't remember the URL off the top of my head.
There's also a pretty active developer forum there,
which has a lot of tinkerers
and kind of more hardware maker type communities of people.
And there's also a Slack channel that they have.
So there's definitely a large developer community there.
And that's probably where I would start.
Alexa skills kit, it's developer.amazon.com. And there. That's probably where I would start. Alexa Skills Kit.
It's developer.amazon.com.
And there will be a link in the show notes.
All right.
That sounds like fun, actually.
Think of all the things you can make people do
if you can interact with them voice-wise.
Make people do?
You're going to take over people now?
Well, think of all the things you can make them say.
It would just be funny to have a game where you had to guess the word
and then I made you say some quote or some stupid thing.
Reverse double plus upside down social engineering people.
Yeah, you know.
Got it.
You're going to write the code to make the people do things not the
other way around we're gonna have voice recognition change how people speak instead of just recognizing
things differently it's gonna have to change how i speak uh well chris thank you so much for being
with us do you have any last thoughts you'd like to leave us with um no just thank you so much for
having me uh if you're interested in voice computing, especially voice design, definitely check out our prototyping tool. It's a great place to start playing around with designing voice experiences. And again, that's at tencan.ai guest has been Chris Morey, the founder of Conversant Labs, a company providing design
and development tools to help create fully conversational applications for iOS and the
Amazon Echo.
I'd like to send a special thank you out to O'Reilly's Nina Cavanis for hooking me up
with Chris.
Their design conference is in San Francisco in March, March 2017.
Wow, that's coming fast.
Thank you also to Christopher for producing and co-hosting.
And of course, thank you for listening.
Hit the contact link on Embedded FM if you'd like to say hello,
sign up for the newsletter, subscribe to the YouTube channel,
enter the Tinkercad contest,
and don't forget, Alexa, play Embedded FM.
All right, final thought, final thought.
Let's see.
How about one from Helen Keller?
Be of good cheer.
Do not think of today's failures, but of the successes that may come tomorrow.
You've set yourself a difficult task, but you will succeed if you persevere.
And you will find a joy in overcoming obstacles.
Remember, no effort that we make to attain something beautiful is ever lost.
Embedded is an independently produced radio show that focuses on the many aspects of engineering.
It is a production of Logical Elegance, an embedded software consulting company in California.
If there are advertisements in the show, we did not put them there and do not receive money from them.
At this time, our sponsors are Logical Elegance and listeners like you.