Embedded - 178: Alexa Stop
Episode Date: December 7, 2016We spoke with Chris Maury (@CMaury) about using speech recognition to interact with devices. Note: Please turn off your Echo and Dots as we invoke Alexa a lot. Chris is the founder of Conversant... Labs. They created TinCan.ai which can help you wireframe or prototype a conversational user interface. They can also help you build Alexa Skills, though if you are so inclined, you might try it for yourself: Alexa Skills Kit. Chris will be speaking at the O'Reilly Design Conference in San Francisco, CA in March 2017, giving a tutorial on building voice based user interfaces. You can read more from Chris on his Medium posts medium.com/@CMaury. CMU PocketSphinx Some of the embedded devices Elecia mentioned: Audeme (as heard on The Amp Hour #258) Grove Speech Recognizer from Seeed EasyVR We haven't gotten embedded.fm (or any podcast) to work with Alexa but we aren't sure why. Have you?
 Transcript
 Discussion  (0)
    
                                         Hello. Before we even start the show, I have to warn you that if you have an Amazon Echo or Dot device nearby, please unplug it.
                                         
                                         Or unplug it after the first 10 minutes, because we're going to say her name a lot.
                                         
                                         And yeah, if you don't want her to tell you jokes or shark facts or whatever it is we ask her to do during the show, now is the time to
                                         
                                         unplug your Amazon Echo device.
                                         
                                         Hello and welcome to Embedded.
                                         
                                         I'm Elysia White alongside Christopher White.
                                         
                                         Our guest this week is Chris Morey, who is going to tell us about speech recognition.
                                         
                                         Before we get started, we will be announcing the Tinker Kit winner next week.
                                         
    
                                         If you enter by December 9th, it might be you.
                                         
                                         Hi, Chris. Welcome to the show.
                                         
                                         Hi, Chris. Hi, Alicia. Thanks for having me.
                                         
                                         Could you tell us about yourself?
                                         
                                         Yeah. So I'm the founder of a company called Conversant Labs, which we build tools for designers and developers who are interested in building voice-based applications, whether that's for smartphones or the Amazon Echo, or really in general.
                                         
                                         We're really excited about voice and just making it as easy as possible to get that done. Excellent.
                                         
                                         And of course, we have many, many questions about that.
                                         
                                         But before that, we usually do this thing called lightning round where we ask you short questions and we hope for short answers.
                                         
    
                                         And if we're behaving ourselves, we don't ask you why and how.
                                         
                                         And could you tell us more?
                                         
                                         So you ready?
                                         
                                         I'll do my best.
                                         
                                         Donuts, bagels, or other ring-shaped breakfast treats?
                                         
                                         Bagels.
                                         
                                         When you do a design, should it be for new users and ease of use,
                                         
                                         or should it be for great flexibility for experienced users?
                                         
    
                                         Oh, oh man.
                                         
                                         This is sort of Notepad versus Emacs.
                                         
                                         I see. Well, I would definitely choose Notepad versus Emacs. I see.
                                         
                                         Well, I would definitely choose Notepad over Emacs any day.
                                         
                                         I think for the first step, it should be whatever gets your idea recorded as easily and quickly as possible,
                                         
                                         where you don't lose the thought or the design.
                                         
                                         Related, Form or functionality?
                                         
                                         Oh, functionality.
                                         
    
                                         100%.
                                         
                                         Least favorite planet?
                                         
                                         Least favorite planet?
                                         
                                         I don't have one.
                                         
                                         I know that's sad.
                                         
                                         I did go to space camp in fourth grade,
                                         
                                         so I should have an answer to this.
                                         
                                         We'll say um venus
                                         
    
                                         should we bring back the dinosaurs in any context yes favorite fictional robot oh this is the
                                         
                                         question i couldn't come up with an answer to i know i'm just ruining the premise of
                                         
                                         lightning round it's such a good question, though,
                                         
                                         and I'm very upset.
                                         
                                         Oh, Data from Star Trek,
                                         
                                         Next Generation.
                                         
                                         Do you listen to podcasts?
                                         
                                         Yes.
                                         
    
                                         What is your favorite,
                                         
                                         this show excluded?
                                         
                                         I don't think he listens.
                                         
                                         I know.
                                         
                                         I just feel like I have to
                                         
                                         disclaim that every time.
                                         
                                         There are a lot of very good ones.
                                         
                                         I think the one that I am most one that I most eagerly wait for
                                         
    
                                         and are most disappointed when they don't come out regularly
                                         
                                         is the Exponent podcast.
                                         
                                         They have really good conversations
                                         
                                         about the state of technology and technology strategy.
                                         
                                         It's very nerdy techno business stuff,
                                         
                                         which is kind of trite at this point,
                                         
                                         but it's really, really helpful when you're running a technology company.
                                         
                                         Cool.
                                         
    
                                         I haven't heard that one.
                                         
                                         Yeah,
                                         
                                         that sounds cool.
                                         
                                         I highly,
                                         
                                         highly recommend it.
                                         
                                         What speed do you listen to podcasts at?
                                         
                                         As fast as the Overcast app will let me,
                                         
                                         which is ranges from like 2.5 to 2.8.
                                         
    
                                         But if you guys know of a different podcasting app
                                         
                                         that could let me listen to it more quickly,
                                         
                                         I would be very grateful.
                                         
                                         You have probably a better vocal processing front end
                                         
                                         in your cranium than I do.
                                         
                                         2.8 is pushing it.
                                         
                                         So you build yourself up to it.
                                         
                                         And this might get into a longer conversation,
                                         
    
                                         but I use text-to-speech to read
                                         
                                         most of the content that I read, whether it's blog posts or books, because my vision is going bad.
                                         
                                         And I've worked up from normal reading speeds to 700 words a minute.
                                         
                                         That's great. And, you know, after I heard him say on a different show, on the O'Reilly Solid or Design, one of the O'Reilly podcasts, I heard Chris say that he listened so fast and I started to bump up my overcast every time I listened to this week.
                                         
                                         And it got, it really, it worked.
                                         
                                         I mean, I was surprised I was up at 1.8 from about 1.25 which is what i had been listening
                                         
                                         it was yeah it worked as long as i didn't jump yeah yeah okay so now we should talk about why
                                         
                                         you're here um which is actually one more lightning round question okay voice recognition
                                         
    
                                         or speech recognition i would say speech. I think that's the more standard word choice.
                                         
                                         Because that's what the academics use.
                                         
                                         Because we don't really care about recognizing voices
                                         
                                         unless you're doing passcodes.
                                         
                                         And what you really care about is speech.
                                         
                                         Right.
                                         
                                         So I think when you're talking about taking audio
                                         
                                         and turning it into text,
                                         
    
                                         that would be speech recognition.
                                         
                                         I think voice, like you said, would be more closely associated
                                         
                                         with identifying who is doing the speaking.
                                         
                                         So like speaker identification.
                                         
                                         And that's also really, really helpful and useful,
                                         
                                         but we're not really there yet.
                                         
                                         Well, from a more mundane perspective,
                                         
                                         VR is already an overloaded acronym.
                                         
    
                                         Yes.
                                         
                                         Yes.
                                         
                                         Okay, so speech recognition.
                                         
                                         And that leads to Siri and Alexa.
                                         
                                         Should we not use that word?
                                         
                                         Alexa?
                                         
                                         Or Siri.
                                         
                                         Alexa, tell me a joke.
                                         
    
                                         No, don't do this to people.
                                         
                                         Sorry, everybody.
                                         
                                         Every phone call I have
                                         
                                         someone's echo goes off
                                         
                                         without fail
                                         
                                         so there's Siri and
                                         
                                         Alex something
                                         
                                         and these
                                         
    
                                         are the two most well known
                                         
                                         speech
                                         
                                         recognition things
                                         
                                         and Google
                                         
                                         Google Voice Google Now I think speech recognition things and google and google google voice i don't know what the keyword is
                                         
                                         google now i think or something it's now it's google assistant okay will it tell me a joke too
                                         
                                         probably not a good one well alex alex somethings are also terrible so oh uh actually google hired
                                         
                                         um screenwriters and comedians to do their script writing.
                                         
    
                                         And so the jokes on Google Assistant are significantly better than Amazon Echo.
                                         
                                         And this is actually, for everyone listening, this is what I use my Echo for.
                                         
                                         It has to tell me jokes.
                                         
                                         It sets timers.
                                         
                                         It occasionally plays music, but it mostly tells me jokes. And if you think that is worth the $50 entry price, I have to say it was for me because the jokes are awful.
                                         
                                         But, okay, so Alexa seems better than Siri.
                                         
                                         And Alexa, stop.
                                         
                                         And so why is that?
                                         
    
                                         So this gets to kind of what the latest kind of technological innovation in the last couple of years is in speech recognition.
                                         
                                         So Siri has been around for a handful of years now, and she's finally gotten good enough, as with all the other services, at understanding
                                         
                                         what we say. So she can recognize the words that we're saying with a very high percentage of
                                         
                                         accuracy, no matter the environment and no matter the accent. What she still struggles with and what
                                         
                                         Alexa does a better job of is understanding what we mean. So the vast majority of the time, Siri will just punt your query or command
                                         
                                         to a Bing web search
                                         
                                         where Alexa is much more focused
                                         
                                         on application-specific actions
                                         
    
                                         and enabling developers to build apps
                                         
                                         that can respond to your specific requests. So the difference is this
                                         
                                         ability to understand the meaning of the words that you are using. And Alexa has done a much
                                         
                                         better job of that and Google Assistant as well. But they all go to the internet.
                                         
                                         They do all go to the internet. Though starting with iOS 10 and with Apple has said that speech recognition can be done on device for certain devices. And they haven't been very specific about that.
                                         
                                         Probably the newest.
                                         
                                         Yeah.
                                         
                                         Right.
                                         
    
                                         So you said they all get pretty good at recognizing that is not my experience
                                         
                                         it's my experience is it is it because i is it because i talk wrong um is it because i have a
                                         
                                         higher voice and they're trained for male voices is it because i tend to do the stupid pauses and
                                         
                                         then sort of sing for the rest of my sentence.
                                         
                                         If I had to pick one, it would probably be the last one that you said.
                                         
                                         I'm sure that the data... So the way that all of these systems work is by collecting tons and tons and tons of data,
                                         
                                         of audio recordings of different people speaking with different accents and cadences
                                         
                                         and frequencies in different
                                         
    
                                         environments, and then using that to train machine learning to recognize the next time
                                         
                                         someone says something.
                                         
                                         So I would imagine that the data is very or fairly representative of the broader market.
                                         
                                         So men and women of all age groups.
                                         
                                         But I think being more sing-songy in your elocution
                                         
                                         might throw Siri for a loop in the same way
                                         
                                         that Siri and speech recognition in general
                                         
                                         has a really hard time with kids
                                         
    
                                         because their voices are significantly higher pitched.
                                         
                                         And it is true that shouting at the devices works better.
                                         
                                         Of course, that's going to be more clipped.
                                         
                                         So I think that gets into another factor,
                                         
                                         which is the quality of the microphones on the device that you're using. So if you're using Siri very close to your mouth, she's going to do a better job of understanding you distances, which in my experience with Siri, she is nowhere near as good at.
                                         
                                         And not even like understanding what I'm saying, but just even hearing, hey, Siri.
                                         
                                         Well, that's going to be definitely true because Siri has one microphone, maybe two, and Alexa's got five or something.
                                         
                                         And so she can do beamforming and be able to hear things far more clearly by augmenting the signal from the different microphones.
                                         
    
                                         Right.
                                         
                                         That's why the Echo Dots can put a light on that shows where it thinks you're coming from.
                                         
                                         Oh, I've not noticed that, but that's really cool.
                                         
                                         Are the underlying technologies between these really different?
                                         
                                         I mean, as far as I understand, at the very lowest level,
                                         
                                         you come in through the microphone, you do some signal processing,
                                         
                                         you grade it into pauses, which is where I probably fail,
                                         
                                         and then you try to build up the phonemes
                                         
    
                                         from the frequency information.
                                         
                                         Once you have phonemes, then you can start to build words
                                         
                                         and then words and then sentences and you do some matching.
                                         
                                         Is that all the same for all of them or do they do special stuff?
                                         
                                         I would say at this point,
                                         
                                         it's going to be very similar across all the platforms.
                                         
                                         And the only thing that's going to vary is, to vary is the data set that they're using.
                                         
                                         Google had their Google Voice product that they used to collect all of this data,
                                         
    
                                         speech data to train so that they could transcribe voicemails.
                                         
                                         And then that's now powering their speech recognition.
                                         
                                         I think for the speech recognition, it really is going to be very similar
                                         
                                         and there aren't going to be major differences where, and moving forward, where the differences
                                         
                                         are going to emerge is, or the differences in quality, I should say, are in the understanding
                                         
                                         what you mean with your words. So the voice applications that we use, whether it's Siri or Alexa,
                                         
                                         there's the speech recognition piece, and then there's the natural language understanding piece.
                                         
                                         And that's taking the text that it's transcribed from the speech recognition and trying to assign
                                         
    
                                         meaning or intent to those phrases. And like the speech recognition, the underlying technology relies on machine
                                         
                                         learning and is similar, but it relies much more on the data, which we haven't really built up that
                                         
                                         data set because with speech data, you just need people talking. But with the natural language
                                         
                                         understanding, you need data from people booking flights, from interacting with calendars, from
                                         
                                         whatever that action is. You need data of people talking and expressing the intent to complete that action.
                                         
                                         So those are kind of, I think, in terms of the differences in experiences moving forward
                                         
                                         or across Alexa and Google Assistant and Google Home and Siri,
                                         
                                         it's going to be because of the amount of data they have for those specific intents.
                                         
    
                                         And so that's like, if I want to make a meeting on January 12th, 2017, I can phrase that a
                                         
                                         thousand different ways, whether it's meeting 9am January 20, whatever I said, or whether
                                         
                                         it's book an appointment Jan 12 or Monday after next or
                                         
                                         whatever, I have all of these different ways I can phrase that. They may all mean the same thing.
                                         
                                         Is that the sort of thing you're talking about? That's exactly right. So with the calendar example,
                                         
                                         there's the different kind of words you choose to use so like make a you know i make an appointment make
                                         
                                         a calendar event uh i need to go do this thing i have a meeting so there's that piece and then
                                         
                                         there's the um the pieces of information you need to complete that task of creating a calendar event
                                         
    
                                         so you need the date and you need the time and then you need like a label and so not only can
                                         
                                         you express the desire to create that calendar event differently,
                                         
                                         you can not provide all the information you need.
                                         
                                         So then you have to collect that data.
                                         
                                         So the app would have to ask you,
                                         
                                         what time on Saturday would you like your event
                                         
                                         to be scheduled for?
                                         
                                         But you can also put those in all different orders.
                                         
    
                                         So you could say the time before the date,
                                         
                                         before the label, or the label,
                                         
                                         then the time, and then the date.
                                         
                                         And so all that complexity has to be solved for.
                                         
                                         And luckily, calendaring is one of these common tasks that they've been working on for years now,
                                         
                                         so it's gotten better.
                                         
                                         But that same complexity is there whether you're making a calendar event
                                         
                                         or you're ordering a pizza or um you know playing jeopardy
                                         
    
                                         on the jeopardy scale and that gets to another uh kind of a third piece of this beyond recognition
                                         
                                         and natural language processing is kind of a mental model or state machine because it's one
                                         
                                         thing to say take an action and have it respond and take that action but it's another to keep
                                         
                                         context and keep a conversation going, right?
                                         
                                         And that's sort of a new piece, I think,
                                         
                                         that's still being explored.
                                         
                                         Yeah, absolutely right.
                                         
                                         So the amount of state that is managed
                                         
    
                                         in current voice applications is very limited.
                                         
                                         We're limited to these question and answer
                                         
                                         or question and answer with a
                                         
                                         clarification so like if you don't provide all the information you need the app will ask you for it
                                         
                                         the next step is to be able to maintain your state in kind of a multi-step process so not
                                         
                                         necessarily a longer conversation but being able to to, for example, cook through a recipe.
                                         
                                         So there's the original, like you open your voice application and you search for a recipe. So
                                         
                                         there's the search for a recipe state. Then there's the search results for that recipe
                                         
    
                                         state. Then there's once you've selected the recipe, there's that state. And then there's
                                         
                                         the step you are in that recipe as you're cooking through it
                                         
                                         and or the ingredients in the list and so being able to track each of those states
                                         
                                         is definitely doable but not really supported or we haven't seen that yet in voice applications on
                                         
                                         you know the amazon echo and then beyond that is like a robust full conversation uh where you can just go back
                                         
                                         and forth on whatever topic uh and amazon just um announced a 2.5 million dollar prize uh for
                                         
                                         uh that 12 different academic teams are competing for to try and create um app using Alexa's APIs
                                         
                                         that can hold a conversation for 20 minutes.
                                         
    
                                         For 20 minutes?
                                         
                                         Yeah, which I think is a pretty hard problem.
                                         
                                         That's a lot of state.
                                         
                                         I mean, there are some Alexa games that you can play
                                         
                                         that have some state.
                                         
                                         There's one that you battle
                                         
                                         robots and it's sort of
                                         
                                         fun, but
                                         
    
                                         it is definitely one
                                         
                                         of those, you could play
                                         
                                         it on a board and it would be pretty trivially
                                         
                                         easy to see where your
                                         
                                         robot is and what it's done and
                                         
                                         what the other robots are doing.
                                         
                                         But it feels
                                         
                                         a little conversational.
                                         
    
                                         Like it's the dungeon master
                                         
                                         and you're playing Dungeons and Dragons sort of thing.
                                         
                                         I don't know how you'd get beyond that.
                                         
                                         So yeah, I think there's a lot of challenges
                                         
                                         as the user of these experiences
                                         
                                         where with a visual application like
                                         
                                         you said if the board is in front of you we can it's a lot easier for us to understand
                                         
                                         where we are and what all the actions we can take are um and that discoverability of what you can do
                                         
    
                                         is is i think one of the biggest problems with Siri on the iPhone, right?
                                         
                                         Or Google Assistant or Cortana, these general virtual assistants
                                         
                                         where you don't know what she can do
                                         
                                         and you don't know how to say the right thing
                                         
                                         to get her to do what you want her to do.
                                         
                                         So the discoverability of those interactions is really, really hard.
                                         
                                         I think Amazon has done a really good
                                         
                                         job with this skill model, which people are used to applications on an iPhone or on websites on
                                         
    
                                         the web. And so skills for voice, like you are limiting what the user can do to the very specific
                                         
                                         context of that skill. And then because it's so limited, it's easier to kind of intuit
                                         
                                         what you can do. So in like searching for recipes, if you're in a recipe app, changes are you can
                                         
                                         search for a recipe. If you're in a shopping app, changes are you can search for a product,
                                         
                                         get reviews for that product and purchase that product. So by limiting the context,
                                         
                                         it makes it a lot easier, not only on the technology side and understanding what the user is saying, but on the user side for them to know what they can say.
                                         
                                         And that is such a huge problem.
                                         
                                         I mean, in Siri, I can ask what the tides are, which is something we go to the beach and I always want to know what the tides are now.
                                         
    
                                         Is it going out? Is it coming in?
                                         
                                         Christopher doesn't always like to get his feet wet, so I need to know how far we can go.
                                         
                                         This is false. It's sort of false.
                                         
                                         And so, if I go to Siri, there is one particular phrasing that
                                         
                                         will lead me to the tides. All the other phrasings,
                                         
                                         I mean, Siri, what are the tides? Nope. What are the tides
                                         
                                         today? Is the tides? Nope. What are the tides today?
                                         
                                         Is the tide coming?
                                         
    
                                         None of those work.
                                         
                                         I have to find exactly the right phrasing.
                                         
                                         And it changes.
                                         
                                         And it does change, which makes me crazy.
                                         
                                         Yeah.
                                         
                                         So I don't know exactly how Siri's internal architecture is that
                                         
                                         leads to that specific
                                         
                                         outcome. But if that same
                                         
    
                                         problem were to happen on
                                         
                                         Alexa or in a specific
                                         
                                         skill, so if you had the Tide skill for
                                         
                                         Alexa and you said, Alexa
                                         
                                         ask Tides
                                         
                                         when the high tide is,
                                         
                                         the whole
                                         
                                         primary responsibility of the app developer for that skill
                                         
    
                                         is to collect the thousand different ways
                                         
                                         that you are going to ask for tide information
                                         
                                         so that it will recognize what you say in your natural language
                                         
                                         without having to think about the right way to say it
                                         
                                         and then be able to respond.
                                         
                                         Because like you just talked about,
                                         
                                         if you say it and it doesn't understand,
                                         
                                         you're going to have this expectation
                                         
    
                                         that it doesn't work
                                         
                                         or you're going to get really frustrated
                                         
                                         because you know it does.
                                         
                                         You know it can tell you the tide,
                                         
                                         but you just can't communicate effectively.
                                         
                                         And so that being able to understand
                                         
                                         what the user said
                                         
                                         is like the core problem
                                         
    
                                         in voice computing right now
                                         
                                         and is the number one responsibility
                                         
                                         of the app developer
                                         
                                         when they're building their app.
                                         
                                         And it's a real adoption problem too
                                         
                                         because on paper,
                                         
                                         if something works 90% of the time,
                                         
                                         that sounds pretty good.
                                         
    
                                         Oh, I can talk to my computer and it responds
                                         
                                         and does what I say 90% of the time.
                                         
                                         But the truth is, if something doesn't work
                                         
                                         10% of the time, I think most people are just going to stop trying.
                                         
                                         Yeah, it's a big problem. And I think it's things like Amazon and Google are trying to work on,
                                         
                                         which is for that 10%, when it doesn't understand, there's kind of two possible reasons for that.
                                         
                                         One is it just completely doesn't understand what you're saying.
                                         
                                         And so like there's nothing it can do.
                                         
    
                                         But then there's also the case where it does understand what you're saying.
                                         
                                         It just doesn't have that functionality, right?
                                         
                                         So for the tide example, if you said,
                                         
                                         when's the high tide next you know next month next tuesday
                                         
                                         at 8 p.m it could understand that request and instead but not be able to respond so instead
                                         
                                         of not responding with that information say sorry i don't know when the high tide is next week or
                                         
                                         something like that um so by being by the application responding with saying, hey, I hear you, I just can't help you, it helps to teach the user to be more experimental in the requests that they're saying.
                                         
                                         How important is it to have that be a human interaction. I mean, you were saying,
                                         
    
                                         we want the app to be able to say,
                                         
                                         I can't do that.
                                         
                                         And I was thinking, I'm sorry, Dave.
                                         
                                         I'm afraid I can't do that.
                                         
                                         Would be hilarious to me once, maybe twice.
                                         
                                         But as you design these interactions,
                                         
                                         how much do you have it be efficient and how much do you have it be efficient
                                         
                                         and how much do you have it be personable?
                                         
    
                                         I have very strong opinions on this,
                                         
                                         which is to say you should avoid personality
                                         
                                         as much as possible
                                         
                                         unless you have a really, really, really good reason
                                         
                                         to have personality.
                                         
                                         Because it gets really annoying. personality is cute when they like did
                                         
                                         something right for you but personality makes you want to throw your phone against the wall
                                         
                                         when it doesn't do what you want it to um and it even gets annoying like uh and i hate to bash on
                                         
    
                                         siri but like every single thing I ask her to do,
                                         
                                         there's some cute turn of phrase in the response.
                                         
                                         If it's cold outside, she'll go,
                                         
                                         or when it's raining outside, she'll be like,
                                         
                                         don't forget to bring an umbrella.
                                         
                                         And it just gets really tedious after a while where we don't expect our computers to have personality
                                         
                                         when we go do a Google search.
                                         
                                         We just want the information that we're asking for.
                                         
    
                                         So until we go from 90% accurate to 98% accurate or higher,
                                         
                                         I think personality is only going to get in the way
                                         
                                         of a good user experience.
                                         
                                         I can see that.
                                         
                                         It would be fun for a little while, but I just want it to work.
                                         
                                         Spend less time messing around, making neat features, and more time.
                                         
                                         Right.
                                         
                                         And this kind of gets back to the jokes, which is like, it's great that you can tell me jokes,
                                         
    
                                         but I'd much rather you be able to tell me what my latest email is
                                         
                                         or what my latest text message was.
                                         
                                         I can see that,
                                         
                                         although that's not how I use it
                                         
                                         because if I'm in the kitchen,
                                         
                                         I don't really want to play with my email.
                                         
                                         I do that all the other times.
                                         
                                         And you're not the only one.
                                         
    
                                         I think the majority of people
                                         
                                         who have the Amazon Echo or the Google Home have them in or near their kitchen.
                                         
                                         So if not email, it would be great if she could help you cook through recipes or order groceries or order food from Uber Eats or Grubhub or something or make a restaurant reservation.
                                         
                                         And I do think that we haven't used the shopping list yet, but I do think that could be useful.
                                         
                                         Because you're standing there and you just ran out of oatmeal.
                                         
                                         Okay, Alexa, add oatmeal to my shopping list would be useful.
                                         
                                         Christopher's shaking his head like you haven't all unplugged them by now.
                                         
                                         So that's what you do.
                                         
    
                                         You actually help people design these conversational methodologies yeah so um
                                         
                                         we have a tool called tencan.ai uh t-i-n-c-a-n.ai which um does two things one is it helps you
                                         
                                         prototype a voice application so um right now it's really hard to go through the design process with voice apps.
                                         
                                         You know, wireframing doesn't really work for non-visual experiences. Like you can draw the
                                         
                                         outline of what your mobile app is going to look like, and you get a good sense of how that app is
                                         
                                         going to behave. But you can't do that with a voice app. You can say, like, create sample scripts,
                                         
                                         like the app says this, and the user says this,
                                         
                                         and the app says this, but that doesn't,
                                         
    
                                         it's only so helpful,
                                         
                                         and you can't do user testing with that.
                                         
                                         So that's the point of this prototyping tool
                                         
                                         is for designers and developers
                                         
                                         to very quickly be able to mock up
                                         
                                         how their voice app is going to behave
                                         
                                         and then put that in front of users
                                         
                                         to do user testing.
                                         
    
                                         And the reason that's so important, which is what we've been talking about this whole
                                         
                                         time, is so you can collect the data of how people are going to interact with your skill.
                                         
                                         What are the ways that they're going to phrase the questions that you expect them to ask?
                                         
                                         Because there's just so many different ways.
                                         
                                         And so we help you to collect that data so that when you do release your skill, it's going
                                         
                                         to be more responsive to all the different ways that people are going to talk to it.
                                         
                                         So does it end up being a natural language processing problem where you have this giant
                                         
                                         tree of options and it's one of those and you just have to build the tree? So, yes and no.
                                         
    
                                         So it is very much a natural language processing problem.
                                         
                                         But the way that, the limitations of voice computing right now
                                         
                                         is there's not really a tree.
                                         
                                         There's a set of actions a user can take called intents,
                                         
                                         and all of those intents are active at the same time.
                                         
                                         So if a user is using your skill and says something to activate that intent, that intent
                                         
                                         will be activated. And then it'll call a function for your app to take an action against that.
                                         
                                         And that action could be to look up the low tide time and speak it back to the user. Or it could be
                                         
    
                                         to look up the low tide time, oh, they
                                         
                                         didn't give me a location. So to ask the user for that location and then speak them the time. And so
                                         
                                         you need all of the data to be able to recognize when a user has expressed that intent. And then you also have to specify the different types of data that you're going to receive from that voice request.
                                         
                                         So like a date and time, a physical location, a proper noun, a phone number, and things like that.
                                         
                                         And so there's some overlap between applications because anything having to do with the date and time can come in all of the ways a calendar meeting can be requested. Microsoft's LUIS or Facebook's Wit.ai or Google's API.ai will have not only built-in intents for common interactions,
                                         
                                         like confirmations, like yes, no, cancel, go back,
                                         
                                         those types of things,
                                         
                                         but also the common data types like calendars
                                         
    
                                         and phone numbers and things.
                                         
                                         It feels like you're building a whole new programming language.
                                         
                                         It is a very new way of doing things. It feels like you're building a whole new programming language. It is very, it is a very new way of doing things. That wasn't a yes or no, but okay. So it's, you're relying on the same
                                         
                                         programming languages. Like you can build an Alexa skill and whatever language you want,
                                         
                                         JavaScript, Java, anything that can run on a server and you can connect to Amazon over HTTP.
                                         
                                         And it's just that the architecture is different
                                         
                                         and relies so heavily on speech recognition
                                         
                                         and on natural language processing.
                                         
    
                                         So the requirements of implementing natural language processing. So the requirements of implementing natural language processing
                                         
                                         has a lot of nuances very specific to that process
                                         
                                         that we haven't really had to deal with before.
                                         
                                         But it's not really coding.
                                         
                                         It's just data collection and design.
                                         
                                         Okay.
                                         
                                         I often see many problems as coding
                                         
                                         so the grammar generation
                                         
    
                                         seems like the same sort of grammar generation you would use if you were building
                                         
                                         another language but yes I see many things
                                         
                                         as coding problems
                                         
                                         so what sort of tactical advice
                                         
                                         things that if I was building an interface So what sort of tactical advice,
                                         
                                         like things that if I was building an interface,
                                         
                                         what are the first five things you tell people to look out for?
                                         
                                         This variability has got to be one of them, but what else?
                                         
    
                                         So is it okay if I answer the steps that I would take in thinking through designing a voice app?
                                         
                                         Sure.
                                         
                                         Yeah.
                                         
                                         So I would be...
                                         
                                         The first step is to clearly define the actions that a user can take.
                                         
                                         And so let's go with this example of the T tide app. So if you want to be able to report to users
                                         
                                         when the high and low tides are for a given day
                                         
                                         and for a given beach,
                                         
    
                                         that means they're going to have to,
                                         
                                         they're going to ask for the high tide,
                                         
                                         they're going to ask for the low tide,
                                         
                                         they're going to ask for the tide table for that day,
                                         
                                         they're going to ask for a specific beach,
                                         
                                         they're going to ask for a specific location,
                                         
                                         and they may even ask qualitative questions like, is today a good day to go surfing? And so in the first step,
                                         
                                         it's to think of all the different actions a user is going to want to take. And the second step is
                                         
    
                                         to limit what you are going to support to a very concrete set of actions.
                                         
                                         So in the first version, you may want to limit it to just the tied tables and put off qualitative questions on what that means for different water activities.
                                         
                                         Then step two is to think through all of the ways that people are going to express those different actions and the different ways they're going to form those questions.
                                         
                                         And then step three is to, once you've come up with everything that you can think of, then you have to go ask other people or do testing with other people to see how they're going to express those same
                                         
                                         questions.
                                         
                                         Because no matter how many you can think of on your own,
                                         
                                         what other people are going to say is going to be dramatically different.
                                         
                                         So you have to be able to,
                                         
    
                                         to account for that.
                                         
                                         Like,
                                         
                                         like is the tide coming in would have been one of my questions.
                                         
                                         Right.
                                         
                                         Exactly.
                                         
                                         Which is totally solvable with the data set
                                         
                                         you have already.
                                         
                                         But it wasn't in your list.
                                         
    
                                         So yeah, you have to ask what other people think
                                         
                                         because you're not going to get all the answers yourself.
                                         
                                         Go on.
                                         
                                         Yeah, that's exactly right.
                                         
                                         And then once you have that, it's a very
                                         
                                         pretty trivial
                                         
                                         coding problem.
                                         
                                         Alexa skills as they exist today
                                         
    
                                         are just when you
                                         
                                         Alexa, the Echo will tell you
                                         
                                         when an intent is activated, it'll
                                         
                                         call a specific function
                                         
                                         and that function will just need to
                                         
                                         access the data that you have
                                         
                                         and then you
                                         
                                         speak that data back to the user.
                                         
    
                                         And you generate
                                         
                                         the response.
                                         
                                         And that is just a complicated way of saying,
                                         
                                         writing out the text that the app is going to say.
                                         
                                         And then that's really it.
                                         
                                         That sounds so easy.
                                         
                                         And that's the thing,
                                         
                                         is it really isn't rocket science.
                                         
    
                                         It's just, I think, where mobile and web development is, you know, there's
                                         
                                         considerable amounts of software development involved. And then design is kind of goes hand
                                         
                                         in hand with that. And it makes the product better. Voice is significantly less complicated
                                         
                                         on the software engineering side, but relies significantly more on good design
                                         
                                         and the design process and doing user testing to collect the data that's going to power
                                         
                                         that voice application. That makes a lot of sense. If you're getting to 90%
                                         
                                         success rate and all you need to do is listen to a few more things your users are asking for
                                         
                                         and figure out how to map them into the functions you already have to get to 96 percent that seems
                                         
    
                                         so worth it right right it absolutely is and it it kind of is similar to traditional application development
                                         
                                         with analytics and seeing how people are interacting
                                         
                                         with your application
                                         
                                         and then focusing on the parts of the application
                                         
                                         where people aren't doing well.
                                         
                                         But where traditionally you have to come up
                                         
                                         with your own ways of making that process easier
                                         
                                         for the user.
                                         
    
                                         With voice, you know what they're saying.
                                         
                                         You know what they're trying to do that you don't understand.
                                         
                                         So all you have to do is retrain on that expression
                                         
                                         you didn't understand or add the features to support
                                         
                                         that expression that the user requested
                                         
                                         that you couldn't respond to before.
                                         
                                         So just by listening to the user
                                         
                                         becomes so much more important,
                                         
    
                                         but it's also so much easier with voice.
                                         
                                         So before we get off the topic of general voice,
                                         
                                         speech, natural language processing,
                                         
                                         and application design,
                                         
                                         one thing you haven't mentioned is localization
                                         
                                         and supporting multiple languages.
                                         
                                         That seems like a really big problem.
                                         
                                         It's the same problem, though.
                                         
    
                                         You have to go from what's input.
                                         
                                         Is it, though?
                                         
                                         Because you have to repeat what we're saying about collecting ways people could ask questions
                                         
                                         or signify their intent.
                                         
                                         That can be quite different in different languages.
                                         
                                         You can have idioms.
                                         
                                         You're exactly right.
                                         
                                         You're exactly right.
                                         
    
                                         Where, you know, historically,
                                         
                                         localization is just translating your website
                                         
                                         or your app into another language.
                                         
                                         With voice, not only do you have to do that,
                                         
                                         but you have to do this data collection
                                         
                                         in those other languages.
                                         
                                         So, you know So Amazon Alexa is available in the US, the UK, and Germany. And so for the German version,
                                         
                                         you have to translate your skill to speak back in German. But then you also have to get the
                                         
    
                                         user utterances or the expressions that people are going to use and speak to your app in German.
                                         
                                         And then also in British English, because idioms, like you said, are going to be different.
                                         
                                         So the localization problem is greater than it has been before, but is the same process of building the voice app in the first place. You just have to repeat it for every
                                         
                                         language that you're going to be supporting that's sort of like how you have to do that
                                         
                                         but for numbers when you have an embedded system and you want to output numbers they're all different
                                         
                                         yeah it's a you know it's it's a repeat of a large data set collection it's just it's not
                                         
                                         quite the same as shoving a excel spreadsheet to a translation company and having to come back
                                         
                                         that never works as well as people think anyway.
                                         
    
                                         Well.
                                         
                                         Okay, so I want to move on to a different topic.
                                         
                                         We've been talking about these applications that go to the net.
                                         
                                         And I have to admit, I don't really like it.
                                         
                                         One of the reasons I didn't get an Echo sooner
                                         
                                         was because I really don't like the fact that it
                                         
                                         goes out to the net. I have privacy issues with it and I have issues with both Siri and the Echo
                                         
                                         failing to get to the network for some stupid reason, even though everything else in the house
                                         
    
                                         is fine. How long are we going to be stuck going out to the net to do things when can we do
                                         
                                         it here that is a a good question um i think the limiting factor right now is the speech recognition
                                         
                                         it's still very processor intensiveintensive to convert audio into text
                                         
                                         and running the machine learning algorithms and classifiers on that data
                                         
                                         to the point where it requires offloading to the internet.
                                         
                                         But once you have that text, the natural language processing, on the other hand,
                                         
                                         is much less processor-intensive. The training itself is very on the other hand, is much less processor intensive.
                                         
                                         The training itself is very processor intensive, and that takes lots of GPUs running for hours to get right.
                                         
    
                                         But once that model is trained, the actual classification is much more straightforward. So it's not...
                                         
                                         Speech recognition is one thing,
                                         
                                         and I think we're getting there.
                                         
                                         So hopefully in the next couple of years,
                                         
                                         we'll be able to do on-device speech recognition
                                         
                                         over an open dictionary.
                                         
                                         So right now you can do it with a fixed number of words
                                         
                                         and the tens to hundreds, so less than 500.
                                         
    
                                         But hopefully in the next couple of years, we'll get to the place where you can do open speech recognition, where you can just listen for anything and get the text of that, that's kind of the limiting factor because of, or at least with respect to speech input.
                                         
                                         I don't really know.
                                         
                                         I don't have a sense of what the text-to-speech requirements would be.
                                         
                                         Because all of those, oh no, you can do that on device now.
                                         
                                         I'm sorry.
                                         
                                         Yes, definitely.
                                         
                                         Yes.
                                         
                                         Yeah.
                                         
    
                                         So, you know, text-to-speech you can do on device.
                                         
                                         And the natural language processing you can do on device,
                                         
                                         but there's no way,
                                         
                                         there's no services that let you do that right now.
                                         
                                         So you would have to build your own, essentially,
                                         
                                         which is a lot more complicated than it should be.
                                         
                                         So it's just going to, technically, it's not a problem. which is a lot more complicated than it should be.
                                         
                                         So technically it's not a problem,
                                         
    
                                         it's just that it isn't available right now.
                                         
                                         And then I still think it's a couple years away before we can do speech recognition on device.
                                         
                                         There are some dev kits out there.
                                         
                                         There's the EasyVR, which is easy voice recognition on SparkFun.
                                         
                                         It's a $50 Arduino shield.
                                         
                                         And it's got a number of commands that are built in.
                                         
                                         There are things like robot stop, robot go, robot turn.
                                         
                                         And it's supposed to work pretty well.
                                         
    
                                         And you can add things to it.
                                         
                                         It also supports five languages, including US, so that would be six languages.
                                         
                                         But it only supports, as you said, a very small dictionary. You can train it for a few more,
                                         
                                         but it's not going to be long sentences. It's not going to be, tell me a joke. It's going to be,
                                         
                                         joke now, joke cat, joke shark. Yeah, really, this is all I use her for. And then there's another one,
                                         
                                         a $20 one from Seed Studios, the Grove Speech Recognizer. It's a Cortex M0 Plus based,
                                         
                                         and it has 22 commands, which is pretty cool. But again, that's not a dictionary.
                                         
                                         And if you think about the Cortex M0 Plus and how small it is and how efficient it is,
                                         
    
                                         that has a lot of goodness, but I suspect they're pushing that as hard as it can go.
                                         
                                         And they're still only getting 22 commands. If they put, if they added flash or whatever they
                                         
                                         needed to, or RAM probably to do more, they would need to have a bigger processor because
                                         
                                         you have to compare all of these phonemes against your, your signal processor gives
                                         
                                         you phonemes and then the phonemes you have to compare to your dictionary of phonemes.
                                         
                                         And that process is very intensive and takes a while and you want your device to be able
                                         
                                         to respond very quickly it's one of the things that all the ones we've been talking about do
                                         
                                         respond to you almost immediately unless they've gone off to the net and fallen down but when they
                                         
    
                                         work it's a snappy back and forth. Let's see what else we have.
                                         
                                         Oh, I have the Radio Shack has been doing voice recognition or speech recognition chips since 1980.
                                         
                                         Well, I don't think they're still doing it.
                                         
                                         And how well is that working out?
                                         
                                         I don't remember.
                                         
                                         I think it worked.
                                         
                                         How many commands did it have?
                                         
                                         You know, a dozen or something.
                                         
    
                                         I don't remember what it was,
                                         
                                         but it was a little masked ROM with a tiny microprocessor.
                                         
                                         It was like 10 bucks in an IC.
                                         
                                         It was for the robot kind of thing, like you were saying.
                                         
                                         But they have to be pretty separate words.
                                         
                                         They have to sound really different, you mean?
                                         
                                         On and off are terrible.
                                         
                                         Start and stop are not great because those are
                                         
    
                                         both words that sort of sound like each other um if you say well there's yeah you want to do
                                         
                                         things that sound different open and closed sound very different that's a good good thing
                                         
                                         but for natural language processing it's not a good thing.
                                         
                                         Right.
                                         
                                         I was going to say, I'm not as familiar with the boards that are out there,
                                         
                                         but on the software side, there's an open source on-device speech recognition library
                                         
                                         from Carnegie Mellon called Pocket Sphinx.
                                         
                                         And it is a on-device speech recognition engine
                                         
    
                                         that, again, has limited dictionaries,
                                         
                                         but is much higher numbers that it will support,
                                         
                                         I think in the hundreds, but it's going to require a full
                                         
                                         Linux computer on a chip.
                                         
                                         So maybe you could run it on the new chip, like $8 PC or something like that.
                                         
                                         Yes.
                                         
                                         But yeah, that's another thing
                                         
                                         if people are interested in
                                         
    
                                         that might be worth exploring.
                                         
                                         Their links are all broken, darn it.
                                         
                                         I will put a link in the show notes
                                         
                                         that actually works instead of inside.
                                         
                                         Never mind.
                                         
                                         I'll send you good links on that
                                         
                                         if you want to check the show notes
                                         
                                         for Pocket Sphinx.
                                         
    
                                         And there was another group, Automy.
                                         
                                         They were on the Amp Hour, which is sort of our...
                                         
                                         Brothers, I almost said sister show, but that doesn't work exactly.
                                         
                                         We'll go with brother, cousin, sworn enemy show.
                                         
                                         Second cousin show.
                                         
                                         And they talked a lot about the technology of putting it together but that's a it's a 70
                                         
                                         dollar linux based part that wow if you want to and you know that that's retail i mean it's maker
                                         
                                         retail so assume that you could actually get those parts for 10 to 20 dollars and put them in your
                                         
    
                                         device and it would all work but that means your cost has to go way up and it starts to make sense why we aren't
                                         
                                         doing these locally.
                                         
                                         But someday, I mean, Moore's law works for us, then that will be cheaper soon.
                                         
                                         And in 10 years, we'll just be walking around talking to ourselves without cell phones.
                                         
                                         Yeah. I mean, I think we're going to be, we're definitely going just be walking around talking to ourselves without cell phones. Yeah, I mean, I think we're definitely going to be walking around talking to ourselves very soon.
                                         
                                         It's just definitely going to reply on the internet, and a company is going to be able to record all of that.
                                         
                                         You're not helping my privacy concerns here. So I think the privacy concerns are very valid
                                         
                                         and it's a very,
                                         
    
                                         probably the technology issues aside,
                                         
                                         privacy concerns are the biggest issue
                                         
                                         facing a wider adoption
                                         
                                         of really, really interesting things
                                         
                                         that we can do with voice
                                         
                                         because of just companies using that data and knowing everything
                                         
                                         about you and then governments doing the same in this country and others.
                                         
                                         And I think people are going to have a very justified negative response if we push that
                                         
    
                                         too far forward. So I think in order to kind of see the full potential
                                         
                                         of what we can do with voice computing,
                                         
                                         we're going to need more privacy protections
                                         
                                         like written into law.
                                         
                                         And I don't know when that's going to ever happen.
                                         
                                         Well, let's not go further down that path today.
                                         
                                         Okay, okay.
                                         
                                         But I agree, we're going to have to sort this out because it's all there.
                                         
    
                                         Different companies have different attitudes towards it.
                                         
                                         I mean, Apple, at least consistently up to now, has been very, you know, privacy is one of our product features.
                                         
                                         And so, you know, I think they would be pushing hard
                                         
                                         to get stuff processing locally,
                                         
                                         and then it's just a matter of,
                                         
                                         okay, you're doing a web search,
                                         
                                         or you have text going up to the web.
                                         
                                         Or text going to your apps.
                                         
    
                                         But it's not consistently listening
                                         
                                         to everything you're saying
                                         
                                         and potentially having a privacy hole there.
                                         
                                         But it's going to be interesting to see
                                         
                                         if consumers make that something that they want, I guess.
                                         
                                         Because if they don't,
                                         
                                         then it's kind of incumbent on companies to force it on them.
                                         
                                         That's, no, we're going to do this.
                                         
    
                                         It's going to take us a little longer to make this truly safe for you.
                                         
                                         Can you be patient?
                                         
                                         Right?
                                         
                                         All right, let's move on we were connected through the o'reilly
                                         
                                         design conference you'll be speaking there next spring what are you going to talk about
                                         
                                         yeah i'm going to be running a workshop on uh designing and prototyping voice-based applications
                                         
                                         so i think we're going to go in more depth
                                         
                                         into the topics that we covered today,
                                         
    
                                         like what are the underlying technologies
                                         
                                         that are making this wave of voice computing possible
                                         
                                         and what does that mean for what you can do
                                         
                                         as a voice app developer and designer?
                                         
                                         And then what are the design considerations
                                         
                                         and best practices in voice user interface design?
                                         
                                         And then we're actually going to sit down and prototype a voice app together using the prototyping tool that we've created.
                                         
                                         And then we'll do user testing with other people in the workshop.
                                         
    
                                         And hopefully, I'm really excited to see what people come up with.
                                         
                                         Do you think people will bring their apps and say,
                                         
                                         how do I do this?
                                         
                                         I hope so.
                                         
                                         I think that'll make for a much more interesting workshop.
                                         
                                         I think I'm curious to see,
                                         
                                         like one of the big things that I want to talk about is,
                                         
                                         does it make sense for your use case
                                         
    
                                         for there to be a voice application?
                                         
                                         Because not everything, at least today, should have a voice app because of the limitations and
                                         
                                         just what makes sense, right? I don't think people are going to be doing a lot of clothing shopping
                                         
                                         on their Amazon Echo. And yet she tries. What apps are really good candidates? I mean, jokes aside, what should we be looking for soon? that should be asked is given the context that the user is in, like their physical environment
                                         
                                         and what they're doing, does voice provide a more efficient means of completing that task
                                         
                                         than other possibilities? And that's kind of where Echo has done a really good job
                                         
                                         is in the home, most people don't have their phones on them. So having a device that is just
                                         
                                         there that you can talk to is going to be a better experience than having to go find your phone
                                         
    
                                         and then pick it up and complete that task. Especially for interacting with your connected
                                         
                                         home. It's a lot easier to say, Alexa, turn on the lights than it is to either get up off the
                                         
                                         couch and go hit the light switch or pull out your phone from your pocket open up your smart app. You should totally have seen Chris
                                         
                                         try to turn on the lights recently. It took him like
                                         
                                         12 tries. It would have been
                                         
                                         so much easier to go stand over there.
                                         
                                         But yes, sorry.
                                         
                                         No, I think there's an
                                         
    
                                         entire other episode that can
                                         
                                         be done on just how frustrating
                                         
                                         the smart home is
                                         
                                         as a consumer.
                                         
                                         But yeah, so that's the overarching question is,
                                         
                                         does voice provide a more efficient means
                                         
                                         of completing a task than other alternatives?
                                         
                                         And then you have to ask that for your specific use case,
                                         
    
                                         but also for the device you're building for.
                                         
                                         And right now, most people who are thinking about voice,
                                         
                                         I think are thinking about it on the Amazon Echo and the Google Home when that opens up to developers later this month. are related to food or cooking or kind of the family group experience are going to lend
                                         
                                         themselves much better in the short term to voice computing. So anything with recipes,
                                         
                                         with food ordering, with restaurant reservations and food search, those types of tasks I think are
                                         
                                         very well suited right now. It could change channels on the TV too. That would be nice. It can if you have a Logitech Harmony.
                                         
                                         And that's one of my next purchases
                                         
                                         because I would really like to set that up.
                                         
    
                                         Which clearly we do not.
                                         
                                         Behind the time on technology.
                                         
                                         It's clearly a business expense.
                                         
                                         Okay, so what apps don't make good candidates
                                         
                                         for voice recognition?
                                         
                                         You mentioned clothing
                                         
                                         and I could see how like some of the picture only
                                         
                                         like Pinterest probably wouldn't be that useful.
                                         
    
                                         Right.
                                         
                                         Any things that are very visual,
                                         
                                         at least right now are not going to make sense.
                                         
                                         You know, there are whispers that Amazon's coming out
                                         
                                         with an Echo with a touchscreen.
                                         
                                         So that might be better, but still probably not going to be better
                                         
                                         than just using it on your phone.
                                         
                                         So anything heavily visual, anything where you are kind of communicating
                                         
    
                                         very sensitive information, so not necessarily secretive,
                                         
                                         but that also counts.
                                         
                                         So like credit card information or like your address, things
                                         
                                         that if you get that piece of information wrong, just the experience of trying to do that through
                                         
                                         speech is going to be very poor and not great. So I would avoid anything that has to rely on that.
                                         
                                         Yeah, there's some phone tree applications that want you to say your social security number.
                                         
                                         And I always like, you have no idea where I am. I could be in a coffee shop. This isn't the sort
                                         
                                         of thing you want me to do out loud. Right, right. I think that kind of gets to one of the other
                                         
    
                                         kind of things to keep in mind with the Echo is it's a public device in your home. And unless
                                         
                                         you live by yourself, other people are going to either overhear the conversation that you're
                                         
                                         having or be able to access that same data. And so we talked about email a little earlier,
                                         
                                         but that's, I think, an example of something that people might not be comfortable doing on this public or shared device.
                                         
                                         So anything with very personal information
                                         
                                         like mental health or email or journaling
                                         
                                         or other things of that nature
                                         
                                         I don't think are good fits
                                         
    
                                         for voice computing as it exists today.
                                         
                                         I think that'll change very quickly,
                                         
                                         but right now, probably not.
                                         
                                         That makes sense.
                                         
                                         There are a lot of things that I could see
                                         
                                         in that whole stuff you don't tell everybody all the time bucket.
                                         
                                         Right.
                                         
                                         Cool.
                                         
    
                                         Well, it does sound like it's going to be an interesting tutorial. Let's see.
                                         
                                         You care about voice recognition a lot, clearly, and you, or speech recognition,
                                         
                                         and how it's designed. And you mentioned that you use a screen reader. What is that like?
                                         
                                         Extremely, extremely frustrating.
                                         
                                         So the reason I got into this whole space is because five years ago, I found out that I was going blind.
                                         
                                         I was diagnosed with a genetic disorder where I'm losing my central vision.
                                         
                                         And as my vision has gotten worse over that period of time, I've gotten very familiar with the assistive technology that's out there and specifically the screen reader. And I'm very disappointed with
                                         
                                         how they work and the experience of using that product. And for those of you who aren't familiar,
                                         
    
                                         a screen reader is a device for the visually impaired or a software for the visually impaired where there's a cursor on the screen that you can move using keyboard
                                         
                                         commands on a laptop or swipe gestures on a smartphone. And whatever that cursor highlights,
                                         
                                         that text is read aloud or the metadata for that text is read aloud. So you're navigating this two-dimensional visual experience
                                         
                                         in a one-dimensional audio stream using keyboard commands.
                                         
                                         And it's very cumbersome.
                                         
                                         And that's only when it works well.
                                         
                                         A lot of websites and a lot of mobile applications
                                         
                                         don't take the steps necessary to make their products accessible.
                                         
    
                                         And so this creates a really poor experience for the vast majority of the blind community,
                                         
                                         not to mention the fact that most people are losing their vision from aging-related disorders.
                                         
                                         So they're very unlikely to know how to do email, let alone use email with the screen
                                         
                                         reader and other assistive technology.
                                         
                                         So voice computing represents this much better way of doing things where these applications are designed for audio first, and they're going to be a much more intuitive experience
                                         
                                         for someone who's less tech savvy.
                                         
                                         And conversation is something that we've all been doing since we've been able to
                                         
                                         talk. So it's much more comfortable. So I'm really optimistic and really hopeful that in the near
                                         
    
                                         future, we're going to be able to do everything we can do on our smartphones through voice instead.
                                         
                                         So you didn't, as we were talking about apps that weren't necessarily good candidates,
                                         
                                         you didn't mention things you want to do fast. And as I think about screen readers and what,
                                         
                                         how hard it is to navigate a website, I imagine that that is incredibly slow and that these
                                         
                                         speech conversational things are also pretty slow. And you said you listen to
                                         
                                         podcasts really fast. How do we get these to be slow? I've always found in meetings, I'm like,
                                         
                                         can't we just do this over IM where we can all type faster?
                                         
                                         Yeah. So I think a big part is just letting you increase the speech rate of your voice service.
                                         
    
                                         And so that's a standard feature in screen readers that I would love to see on Alexa or Siri or Google Assistant.
                                         
                                         And that would help make these interactions more efficient.
                                         
                                         Alexa, talk faster.
                                         
                                         Will that work? No, it won't.
                                         
                                         It doesn't work. I was waiting to see what she would
                                         
                                         say back to you oh no she's in a different room okay um no i wish i wish um so so that's one thing
                                         
                                         that we can do but again that's like more for for power users who are used to it um
                                         
                                         the it's just it's i think other things that we can do to make the experiences more efficient are just being
                                         
    
                                         more responsive to more natural language commands, rather than having to force you to go through a
                                         
                                         sequence of very specific commands. One, and then two, allow you to make multiple requests in the same kind of
                                         
                                         interaction.
                                         
                                         I'm trying to think of a good example
                                         
                                         of that.
                                         
                                         Set two
                                         
                                         timers, one for five minutes and one for
                                         
                                         ten minutes. That's not a great one, but
                                         
    
                                         an idea of that would be more efficient
                                         
                                         than saying Alexa set a timer for five minutes,
                                         
                                         Alexa set a timer for ten minutes.
                                         
                                         And there's also UI things that you can do
                                         
                                         to make the experience more efficient
                                         
                                         that I think will help kind of move us in that direction.
                                         
                                         But that's difficult.
                                         
                                         I mean, the timers is not too difficult
                                         
    
                                         because you're going to the same app in the end.
                                         
                                         But if I wanted to say,
                                         
                                         put eggs on my grocery list and remind me to go to the grocery before in the end. But if I wanted to say, put eggs on my grocery list
                                         
                                         and remind me to go to the grocery
                                         
                                         before I go home,
                                         
                                         those are two separate apps.
                                         
                                         And so you have to have enough
                                         
                                         upper level natural language processing
                                         
    
                                         to split it between them.
                                         
                                         Yeah, yeah.
                                         
                                         You're exactly right. And I think that's
                                         
                                         going to be a challenge for these platforms
                                         
                                         and the ones who can
                                         
                                         do it the best are
                                         
                                         going to be the ones that succeed.
                                         
                                         And hopefully we see
                                         
    
                                         that sooner rather than later.
                                         
                                         But you're right.
                                         
                                         It's navigating between applications, but
                                         
                                         it's also carrying
                                         
                                         on a conversation with the same application.
                                         
                                         Right now in Alexa, your skill is only active for like 16 seconds.
                                         
                                         So if you're not carrying on that conversation continuously,
                                         
                                         you have to re-invoke that application and start over every time you talk to it.
                                         
    
                                         So like with the recipe example,
                                         
                                         if you're stepping through the steps in a recipe,
                                         
                                         you'd have to say, Alexa, ask the recipe app
                                         
                                         what the next step in the recipe is every time.
                                         
                                         And that's less efficient than just saying,
                                         
                                         Alexa, what's the next step?
                                         
                                         Well, there's, I mean, that robot game
                                         
                                         lets you stay in it until you exit it or die.
                                         
    
                                         So that's coming, right?
                                         
                                         You don't know what the robot game is and I don't remember the name.
                                         
                                         So that's just not a good information, is it?
                                         
                                         I hope, I would imagine that they're improving that.
                                         
                                         Like it just makes so much sense. So hopefully that's something that's happened
                                         
                                         and we'll continue to see improvements in that direction.
                                         
                                         And your recipe app is exactly what we need here.
                                         
                                         Because once you can do that,
                                         
    
                                         once you can step through something
                                         
                                         that's a fairly linear process,
                                         
                                         then you can start stepping through things that aren't as linear and even the linear process of having a recipe at some point
                                         
                                         you're going to want to have a timer and it's it builds on itself and it does seem like someday we
                                         
                                         will get to conversation this whole thing reminds me of those old Infocom games. Yes. From the 80s, those text adventures.
                                         
                                         Zork!
                                         
                                         And yeah, it's the same kind of problem of anticipating any kind of input.
                                         
                                         Where for years we've had, okay, there's a mouse pointer and a keyboard.
                                         
    
                                         And the keyboard has a fixed set of key commands and the mouse points exactly where you want it to.
                                         
                                         And now it's all, it's a completely different way of thinking about applications,
                                         
                                         and I think that's what's hard for people.
                                         
                                         Yeah, we're definitely in a transition phase.
                                         
                                         And as we learn best practices, not only in the design,
                                         
                                         but in the architecture side of things,
                                         
                                         we're going to start seeing better and better applications.
                                         
                                         Well, that sounds like it's a good place to start wrapping things up.
                                         
    
                                         Chris, do you have any last questions?
                                         
                                         Yeah, I wanted to ask, since a lot of people who listen to the show are tinkers
                                         
                                         and electrical engineering people and software people who make projects of their own
                                         
                                         or might be exploring little Internet of Things projects, electrical engineering people and software people who make projects of their own or little,
                                         
                                         might be exploring little Internet of Things projects.
                                         
                                         What would you suggest if somebody said,
                                         
                                         hey, you know, I'd really like to hook voice recognition,
                                         
                                         speech recognition into my project or product.
                                         
    
                                         I don't know where to start.
                                         
                                         It wouldn't be making their own voice recognition system, but it would probably be hooking into Amazon Echo's ecosystem or something.
                                         
                                         So where would somebody start to look at that?
                                         
                                         Yeah, I think the Echo is a great place to start.
                                         
                                         I think it's the most open right now,
                                         
                                         and there's a lot of support for kind of hooking it up
                                         
                                         to more project-type things with If This Then That and other services
                                         
                                         that you can use to link up voice commands in your skill
                                         
    
                                         to different devices in your home
                                         
                                         or projects you may be working on.
                                         
                                         So I would definitely go to Alexa's developer pages,
                                         
                                         which I don't remember the URL off the top of my head.
                                         
                                         There's also a pretty active developer forum there,
                                         
                                         which has a lot of tinkerers
                                         
                                         and kind of more hardware maker type communities of people.
                                         
                                         And there's also a Slack channel that they have.
                                         
    
                                         So there's definitely a large developer community there.
                                         
                                         And that's probably where I would start.
                                         
                                         Alexa skills kit, it's developer.amazon.com. And there. That's probably where I would start. Alexa Skills Kit.
                                         
                                         It's developer.amazon.com.
                                         
                                         And there will be a link in the show notes.
                                         
                                         All right.
                                         
                                         That sounds like fun, actually.
                                         
                                         Think of all the things you can make people do
                                         
    
                                         if you can interact with them voice-wise.
                                         
                                         Make people do?
                                         
                                         You're going to take over people now?
                                         
                                         Well, think of all the things you can make them say.
                                         
                                         It would just be funny to have a game where you had to guess the word
                                         
                                         and then I made you say some quote or some stupid thing.
                                         
                                         Reverse double plus upside down social engineering people.
                                         
                                         Yeah, you know.
                                         
    
                                         Got it.
                                         
                                         You're going to write the code to make the people do things not the
                                         
                                         other way around we're gonna have voice recognition change how people speak instead of just recognizing
                                         
                                         things differently it's gonna have to change how i speak uh well chris thank you so much for being
                                         
                                         with us do you have any last thoughts you'd like to leave us with um no just thank you so much for
                                         
                                         having me uh if you're interested in voice computing, especially voice design, definitely check out our prototyping tool. It's a great place to start playing around with designing voice experiences. And again, that's at tencan.ai guest has been Chris Morey, the founder of Conversant Labs, a company providing design
                                         
                                         and development tools to help create fully conversational applications for iOS and the
                                         
                                         Amazon Echo.
                                         
    
                                         I'd like to send a special thank you out to O'Reilly's Nina Cavanis for hooking me up
                                         
                                         with Chris.
                                         
                                         Their design conference is in San Francisco in March, March 2017.
                                         
                                         Wow, that's coming fast.
                                         
                                         Thank you also to Christopher for producing and co-hosting.
                                         
                                         And of course, thank you for listening.
                                         
                                         Hit the contact link on Embedded FM if you'd like to say hello,
                                         
                                         sign up for the newsletter, subscribe to the YouTube channel,
                                         
    
                                         enter the Tinkercad contest,
                                         
                                         and don't forget, Alexa, play Embedded FM.
                                         
                                         All right, final thought, final thought.
                                         
                                         Let's see.
                                         
                                         How about one from Helen Keller?
                                         
                                         Be of good cheer.
                                         
                                         Do not think of today's failures, but of the successes that may come tomorrow.
                                         
                                         You've set yourself a difficult task, but you will succeed if you persevere.
                                         
    
                                         And you will find a joy in overcoming obstacles.
                                         
                                         Remember, no effort that we make to attain something beautiful is ever lost.
                                         
                                         Embedded is an independently produced radio show that focuses on the many aspects of engineering.
                                         
                                         It is a production of Logical Elegance, an embedded software consulting company in California.
                                         
                                         If there are advertisements in the show, we did not put them there and do not receive money from them.
                                         
                                         At this time, our sponsors are Logical Elegance and listeners like you.
                                         
