Embedded - 178: Alexa Stop

Episode Date: December 7, 2016

We spoke with Chris Maury (@CMaury) about using speech recognition to interact with devices.  Note: Please turn off your Echo and Dots as we invoke Alexa a lot.  Chris is the founder of Conversant... Labs. They created TinCan.ai which can help you wireframe or prototype a conversational user interface.  They can also help you build Alexa Skills, though if you are so inclined, you might try it for yourself: Alexa Skills Kit.  Chris will be speaking at the O'Reilly Design Conference in San Francisco, CA in March 2017, giving a tutorial on building voice based user interfaces. You can read more from Chris on his Medium posts medium.com/@CMaury. CMU PocketSphinx Some of the embedded devices Elecia mentioned: Audeme (as heard on The Amp Hour #258) Grove Speech Recognizer from Seeed EasyVR  We haven't gotten embedded.fm (or any podcast) to work with Alexa but we aren't sure why. Have you?  

Transcript
Discussion (0)
Starting point is 00:00:00 Hello. Before we even start the show, I have to warn you that if you have an Amazon Echo or Dot device nearby, please unplug it. Or unplug it after the first 10 minutes, because we're going to say her name a lot. And yeah, if you don't want her to tell you jokes or shark facts or whatever it is we ask her to do during the show, now is the time to unplug your Amazon Echo device. Hello and welcome to Embedded. I'm Elysia White alongside Christopher White. Our guest this week is Chris Morey, who is going to tell us about speech recognition. Before we get started, we will be announcing the Tinker Kit winner next week.
Starting point is 00:00:50 If you enter by December 9th, it might be you. Hi, Chris. Welcome to the show. Hi, Chris. Hi, Alicia. Thanks for having me. Could you tell us about yourself? Yeah. So I'm the founder of a company called Conversant Labs, which we build tools for designers and developers who are interested in building voice-based applications, whether that's for smartphones or the Amazon Echo, or really in general. We're really excited about voice and just making it as easy as possible to get that done. Excellent. And of course, we have many, many questions about that. But before that, we usually do this thing called lightning round where we ask you short questions and we hope for short answers.
Starting point is 00:01:34 And if we're behaving ourselves, we don't ask you why and how. And could you tell us more? So you ready? I'll do my best. Donuts, bagels, or other ring-shaped breakfast treats? Bagels. When you do a design, should it be for new users and ease of use, or should it be for great flexibility for experienced users?
Starting point is 00:01:58 Oh, oh man. This is sort of Notepad versus Emacs. I see. Well, I would definitely choose Notepad versus Emacs. I see. Well, I would definitely choose Notepad over Emacs any day. I think for the first step, it should be whatever gets your idea recorded as easily and quickly as possible, where you don't lose the thought or the design. Related, Form or functionality? Oh, functionality.
Starting point is 00:02:28 100%. Least favorite planet? Least favorite planet? I don't have one. I know that's sad. I did go to space camp in fourth grade, so I should have an answer to this. We'll say um venus
Starting point is 00:02:46 should we bring back the dinosaurs in any context yes favorite fictional robot oh this is the question i couldn't come up with an answer to i know i'm just ruining the premise of lightning round it's such a good question, though, and I'm very upset. Oh, Data from Star Trek, Next Generation. Do you listen to podcasts? Yes.
Starting point is 00:03:12 What is your favorite, this show excluded? I don't think he listens. I know. I just feel like I have to disclaim that every time. There are a lot of very good ones. I think the one that I am most one that I most eagerly wait for
Starting point is 00:03:27 and are most disappointed when they don't come out regularly is the Exponent podcast. They have really good conversations about the state of technology and technology strategy. It's very nerdy techno business stuff, which is kind of trite at this point, but it's really, really helpful when you're running a technology company. Cool.
Starting point is 00:03:47 I haven't heard that one. Yeah, that sounds cool. I highly, highly recommend it. What speed do you listen to podcasts at? As fast as the Overcast app will let me, which is ranges from like 2.5 to 2.8.
Starting point is 00:04:01 But if you guys know of a different podcasting app that could let me listen to it more quickly, I would be very grateful. You have probably a better vocal processing front end in your cranium than I do. 2.8 is pushing it. So you build yourself up to it. And this might get into a longer conversation,
Starting point is 00:04:22 but I use text-to-speech to read most of the content that I read, whether it's blog posts or books, because my vision is going bad. And I've worked up from normal reading speeds to 700 words a minute. That's great. And, you know, after I heard him say on a different show, on the O'Reilly Solid or Design, one of the O'Reilly podcasts, I heard Chris say that he listened so fast and I started to bump up my overcast every time I listened to this week. And it got, it really, it worked. I mean, I was surprised I was up at 1.8 from about 1.25 which is what i had been listening it was yeah it worked as long as i didn't jump yeah yeah okay so now we should talk about why you're here um which is actually one more lightning round question okay voice recognition
Starting point is 00:05:19 or speech recognition i would say speech. I think that's the more standard word choice. Because that's what the academics use. Because we don't really care about recognizing voices unless you're doing passcodes. And what you really care about is speech. Right. So I think when you're talking about taking audio and turning it into text,
Starting point is 00:05:42 that would be speech recognition. I think voice, like you said, would be more closely associated with identifying who is doing the speaking. So like speaker identification. And that's also really, really helpful and useful, but we're not really there yet. Well, from a more mundane perspective, VR is already an overloaded acronym.
Starting point is 00:06:03 Yes. Yes. Okay, so speech recognition. And that leads to Siri and Alexa. Should we not use that word? Alexa? Or Siri. Alexa, tell me a joke.
Starting point is 00:06:19 No, don't do this to people. Sorry, everybody. Every phone call I have someone's echo goes off without fail so there's Siri and Alex something and these
Starting point is 00:06:35 are the two most well known speech recognition things and Google Google Voice Google Now I think speech recognition things and google and google google voice i don't know what the keyword is google now i think or something it's now it's google assistant okay will it tell me a joke too probably not a good one well alex alex somethings are also terrible so oh uh actually google hired um screenwriters and comedians to do their script writing.
Starting point is 00:07:06 And so the jokes on Google Assistant are significantly better than Amazon Echo. And this is actually, for everyone listening, this is what I use my Echo for. It has to tell me jokes. It sets timers. It occasionally plays music, but it mostly tells me jokes. And if you think that is worth the $50 entry price, I have to say it was for me because the jokes are awful. But, okay, so Alexa seems better than Siri. And Alexa, stop. And so why is that?
Starting point is 00:07:45 So this gets to kind of what the latest kind of technological innovation in the last couple of years is in speech recognition. So Siri has been around for a handful of years now, and she's finally gotten good enough, as with all the other services, at understanding what we say. So she can recognize the words that we're saying with a very high percentage of accuracy, no matter the environment and no matter the accent. What she still struggles with and what Alexa does a better job of is understanding what we mean. So the vast majority of the time, Siri will just punt your query or command to a Bing web search where Alexa is much more focused on application-specific actions
Starting point is 00:08:39 and enabling developers to build apps that can respond to your specific requests. So the difference is this ability to understand the meaning of the words that you are using. And Alexa has done a much better job of that and Google Assistant as well. But they all go to the internet. They do all go to the internet. Though starting with iOS 10 and with Apple has said that speech recognition can be done on device for certain devices. And they haven't been very specific about that. Probably the newest. Yeah. Right.
Starting point is 00:09:22 So you said they all get pretty good at recognizing that is not my experience it's my experience is it is it because i is it because i talk wrong um is it because i have a higher voice and they're trained for male voices is it because i tend to do the stupid pauses and then sort of sing for the rest of my sentence. If I had to pick one, it would probably be the last one that you said. I'm sure that the data... So the way that all of these systems work is by collecting tons and tons and tons of data, of audio recordings of different people speaking with different accents and cadences and frequencies in different
Starting point is 00:10:05 environments, and then using that to train machine learning to recognize the next time someone says something. So I would imagine that the data is very or fairly representative of the broader market. So men and women of all age groups. But I think being more sing-songy in your elocution might throw Siri for a loop in the same way that Siri and speech recognition in general has a really hard time with kids
Starting point is 00:10:39 because their voices are significantly higher pitched. And it is true that shouting at the devices works better. Of course, that's going to be more clipped. So I think that gets into another factor, which is the quality of the microphones on the device that you're using. So if you're using Siri very close to your mouth, she's going to do a better job of understanding you distances, which in my experience with Siri, she is nowhere near as good at. And not even like understanding what I'm saying, but just even hearing, hey, Siri. Well, that's going to be definitely true because Siri has one microphone, maybe two, and Alexa's got five or something. And so she can do beamforming and be able to hear things far more clearly by augmenting the signal from the different microphones.
Starting point is 00:11:58 Right. That's why the Echo Dots can put a light on that shows where it thinks you're coming from. Oh, I've not noticed that, but that's really cool. Are the underlying technologies between these really different? I mean, as far as I understand, at the very lowest level, you come in through the microphone, you do some signal processing, you grade it into pauses, which is where I probably fail, and then you try to build up the phonemes
Starting point is 00:12:25 from the frequency information. Once you have phonemes, then you can start to build words and then words and then sentences and you do some matching. Is that all the same for all of them or do they do special stuff? I would say at this point, it's going to be very similar across all the platforms. And the only thing that's going to vary is, to vary is the data set that they're using. Google had their Google Voice product that they used to collect all of this data,
Starting point is 00:12:53 speech data to train so that they could transcribe voicemails. And then that's now powering their speech recognition. I think for the speech recognition, it really is going to be very similar and there aren't going to be major differences where, and moving forward, where the differences are going to emerge is, or the differences in quality, I should say, are in the understanding what you mean with your words. So the voice applications that we use, whether it's Siri or Alexa, there's the speech recognition piece, and then there's the natural language understanding piece. And that's taking the text that it's transcribed from the speech recognition and trying to assign
Starting point is 00:13:37 meaning or intent to those phrases. And like the speech recognition, the underlying technology relies on machine learning and is similar, but it relies much more on the data, which we haven't really built up that data set because with speech data, you just need people talking. But with the natural language understanding, you need data from people booking flights, from interacting with calendars, from whatever that action is. You need data of people talking and expressing the intent to complete that action. So those are kind of, I think, in terms of the differences in experiences moving forward or across Alexa and Google Assistant and Google Home and Siri, it's going to be because of the amount of data they have for those specific intents.
Starting point is 00:14:26 And so that's like, if I want to make a meeting on January 12th, 2017, I can phrase that a thousand different ways, whether it's meeting 9am January 20, whatever I said, or whether it's book an appointment Jan 12 or Monday after next or whatever, I have all of these different ways I can phrase that. They may all mean the same thing. Is that the sort of thing you're talking about? That's exactly right. So with the calendar example, there's the different kind of words you choose to use so like make a you know i make an appointment make a calendar event uh i need to go do this thing i have a meeting so there's that piece and then there's the um the pieces of information you need to complete that task of creating a calendar event
Starting point is 00:15:18 so you need the date and you need the time and then you need like a label and so not only can you express the desire to create that calendar event differently, you can not provide all the information you need. So then you have to collect that data. So the app would have to ask you, what time on Saturday would you like your event to be scheduled for? But you can also put those in all different orders.
Starting point is 00:15:39 So you could say the time before the date, before the label, or the label, then the time, and then the date. And so all that complexity has to be solved for. And luckily, calendaring is one of these common tasks that they've been working on for years now, so it's gotten better. But that same complexity is there whether you're making a calendar event or you're ordering a pizza or um you know playing jeopardy
Starting point is 00:16:05 on the jeopardy scale and that gets to another uh kind of a third piece of this beyond recognition and natural language processing is kind of a mental model or state machine because it's one thing to say take an action and have it respond and take that action but it's another to keep context and keep a conversation going, right? And that's sort of a new piece, I think, that's still being explored. Yeah, absolutely right. So the amount of state that is managed
Starting point is 00:16:37 in current voice applications is very limited. We're limited to these question and answer or question and answer with a clarification so like if you don't provide all the information you need the app will ask you for it the next step is to be able to maintain your state in kind of a multi-step process so not necessarily a longer conversation but being able to to, for example, cook through a recipe. So there's the original, like you open your voice application and you search for a recipe. So there's the search for a recipe state. Then there's the search results for that recipe
Starting point is 00:17:17 state. Then there's once you've selected the recipe, there's that state. And then there's the step you are in that recipe as you're cooking through it and or the ingredients in the list and so being able to track each of those states is definitely doable but not really supported or we haven't seen that yet in voice applications on you know the amazon echo and then beyond that is like a robust full conversation uh where you can just go back and forth on whatever topic uh and amazon just um announced a 2.5 million dollar prize uh for uh that 12 different academic teams are competing for to try and create um app using Alexa's APIs that can hold a conversation for 20 minutes.
Starting point is 00:18:10 For 20 minutes? Yeah, which I think is a pretty hard problem. That's a lot of state. I mean, there are some Alexa games that you can play that have some state. There's one that you battle robots and it's sort of fun, but
Starting point is 00:18:28 it is definitely one of those, you could play it on a board and it would be pretty trivially easy to see where your robot is and what it's done and what the other robots are doing. But it feels a little conversational.
Starting point is 00:18:45 Like it's the dungeon master and you're playing Dungeons and Dragons sort of thing. I don't know how you'd get beyond that. So yeah, I think there's a lot of challenges as the user of these experiences where with a visual application like you said if the board is in front of you we can it's a lot easier for us to understand where we are and what all the actions we can take are um and that discoverability of what you can do
Starting point is 00:19:19 is is i think one of the biggest problems with Siri on the iPhone, right? Or Google Assistant or Cortana, these general virtual assistants where you don't know what she can do and you don't know how to say the right thing to get her to do what you want her to do. So the discoverability of those interactions is really, really hard. I think Amazon has done a really good job with this skill model, which people are used to applications on an iPhone or on websites on
Starting point is 00:19:53 the web. And so skills for voice, like you are limiting what the user can do to the very specific context of that skill. And then because it's so limited, it's easier to kind of intuit what you can do. So in like searching for recipes, if you're in a recipe app, changes are you can search for a recipe. If you're in a shopping app, changes are you can search for a product, get reviews for that product and purchase that product. So by limiting the context, it makes it a lot easier, not only on the technology side and understanding what the user is saying, but on the user side for them to know what they can say. And that is such a huge problem. I mean, in Siri, I can ask what the tides are, which is something we go to the beach and I always want to know what the tides are now.
Starting point is 00:20:41 Is it going out? Is it coming in? Christopher doesn't always like to get his feet wet, so I need to know how far we can go. This is false. It's sort of false. And so, if I go to Siri, there is one particular phrasing that will lead me to the tides. All the other phrasings, I mean, Siri, what are the tides? Nope. What are the tides today? Is the tides? Nope. What are the tides today? Is the tide coming?
Starting point is 00:21:08 None of those work. I have to find exactly the right phrasing. And it changes. And it does change, which makes me crazy. Yeah. So I don't know exactly how Siri's internal architecture is that leads to that specific outcome. But if that same
Starting point is 00:21:30 problem were to happen on Alexa or in a specific skill, so if you had the Tide skill for Alexa and you said, Alexa ask Tides when the high tide is, the whole primary responsibility of the app developer for that skill
Starting point is 00:21:48 is to collect the thousand different ways that you are going to ask for tide information so that it will recognize what you say in your natural language without having to think about the right way to say it and then be able to respond. Because like you just talked about, if you say it and it doesn't understand, you're going to have this expectation
Starting point is 00:22:08 that it doesn't work or you're going to get really frustrated because you know it does. You know it can tell you the tide, but you just can't communicate effectively. And so that being able to understand what the user said is like the core problem
Starting point is 00:22:22 in voice computing right now and is the number one responsibility of the app developer when they're building their app. And it's a real adoption problem too because on paper, if something works 90% of the time, that sounds pretty good.
Starting point is 00:22:37 Oh, I can talk to my computer and it responds and does what I say 90% of the time. But the truth is, if something doesn't work 10% of the time, I think most people are just going to stop trying. Yeah, it's a big problem. And I think it's things like Amazon and Google are trying to work on, which is for that 10%, when it doesn't understand, there's kind of two possible reasons for that. One is it just completely doesn't understand what you're saying. And so like there's nothing it can do.
Starting point is 00:23:14 But then there's also the case where it does understand what you're saying. It just doesn't have that functionality, right? So for the tide example, if you said, when's the high tide next you know next month next tuesday at 8 p.m it could understand that request and instead but not be able to respond so instead of not responding with that information say sorry i don't know when the high tide is next week or something like that um so by being by the application responding with saying, hey, I hear you, I just can't help you, it helps to teach the user to be more experimental in the requests that they're saying. How important is it to have that be a human interaction. I mean, you were saying,
Starting point is 00:24:06 we want the app to be able to say, I can't do that. And I was thinking, I'm sorry, Dave. I'm afraid I can't do that. Would be hilarious to me once, maybe twice. But as you design these interactions, how much do you have it be efficient and how much do you have it be efficient and how much do you have it be personable?
Starting point is 00:24:29 I have very strong opinions on this, which is to say you should avoid personality as much as possible unless you have a really, really, really good reason to have personality. Because it gets really annoying. personality is cute when they like did something right for you but personality makes you want to throw your phone against the wall when it doesn't do what you want it to um and it even gets annoying like uh and i hate to bash on
Starting point is 00:25:01 siri but like every single thing I ask her to do, there's some cute turn of phrase in the response. If it's cold outside, she'll go, or when it's raining outside, she'll be like, don't forget to bring an umbrella. And it just gets really tedious after a while where we don't expect our computers to have personality when we go do a Google search. We just want the information that we're asking for.
Starting point is 00:25:32 So until we go from 90% accurate to 98% accurate or higher, I think personality is only going to get in the way of a good user experience. I can see that. It would be fun for a little while, but I just want it to work. Spend less time messing around, making neat features, and more time. Right. And this kind of gets back to the jokes, which is like, it's great that you can tell me jokes,
Starting point is 00:26:01 but I'd much rather you be able to tell me what my latest email is or what my latest text message was. I can see that, although that's not how I use it because if I'm in the kitchen, I don't really want to play with my email. I do that all the other times. And you're not the only one.
Starting point is 00:26:23 I think the majority of people who have the Amazon Echo or the Google Home have them in or near their kitchen. So if not email, it would be great if she could help you cook through recipes or order groceries or order food from Uber Eats or Grubhub or something or make a restaurant reservation. And I do think that we haven't used the shopping list yet, but I do think that could be useful. Because you're standing there and you just ran out of oatmeal. Okay, Alexa, add oatmeal to my shopping list would be useful. Christopher's shaking his head like you haven't all unplugged them by now. So that's what you do.
Starting point is 00:27:00 You actually help people design these conversational methodologies yeah so um we have a tool called tencan.ai uh t-i-n-c-a-n.ai which um does two things one is it helps you prototype a voice application so um right now it's really hard to go through the design process with voice apps. You know, wireframing doesn't really work for non-visual experiences. Like you can draw the outline of what your mobile app is going to look like, and you get a good sense of how that app is going to behave. But you can't do that with a voice app. You can say, like, create sample scripts, like the app says this, and the user says this, and the app says this, but that doesn't,
Starting point is 00:27:48 it's only so helpful, and you can't do user testing with that. So that's the point of this prototyping tool is for designers and developers to very quickly be able to mock up how their voice app is going to behave and then put that in front of users to do user testing.
Starting point is 00:28:05 And the reason that's so important, which is what we've been talking about this whole time, is so you can collect the data of how people are going to interact with your skill. What are the ways that they're going to phrase the questions that you expect them to ask? Because there's just so many different ways. And so we help you to collect that data so that when you do release your skill, it's going to be more responsive to all the different ways that people are going to talk to it. So does it end up being a natural language processing problem where you have this giant tree of options and it's one of those and you just have to build the tree? So, yes and no.
Starting point is 00:28:46 So it is very much a natural language processing problem. But the way that, the limitations of voice computing right now is there's not really a tree. There's a set of actions a user can take called intents, and all of those intents are active at the same time. So if a user is using your skill and says something to activate that intent, that intent will be activated. And then it'll call a function for your app to take an action against that. And that action could be to look up the low tide time and speak it back to the user. Or it could be
Starting point is 00:29:23 to look up the low tide time, oh, they didn't give me a location. So to ask the user for that location and then speak them the time. And so you need all of the data to be able to recognize when a user has expressed that intent. And then you also have to specify the different types of data that you're going to receive from that voice request. So like a date and time, a physical location, a proper noun, a phone number, and things like that. And so there's some overlap between applications because anything having to do with the date and time can come in all of the ways a calendar meeting can be requested. Microsoft's LUIS or Facebook's Wit.ai or Google's API.ai will have not only built-in intents for common interactions, like confirmations, like yes, no, cancel, go back, those types of things, but also the common data types like calendars
Starting point is 00:30:38 and phone numbers and things. It feels like you're building a whole new programming language. It is a very new way of doing things. It feels like you're building a whole new programming language. It is very, it is a very new way of doing things. That wasn't a yes or no, but okay. So it's, you're relying on the same programming languages. Like you can build an Alexa skill and whatever language you want, JavaScript, Java, anything that can run on a server and you can connect to Amazon over HTTP. And it's just that the architecture is different and relies so heavily on speech recognition and on natural language processing.
Starting point is 00:31:22 So the requirements of implementing natural language processing. So the requirements of implementing natural language processing has a lot of nuances very specific to that process that we haven't really had to deal with before. But it's not really coding. It's just data collection and design. Okay. I often see many problems as coding so the grammar generation
Starting point is 00:31:48 seems like the same sort of grammar generation you would use if you were building another language but yes I see many things as coding problems so what sort of tactical advice things that if I was building an interface So what sort of tactical advice, like things that if I was building an interface, what are the first five things you tell people to look out for? This variability has got to be one of them, but what else?
Starting point is 00:32:23 So is it okay if I answer the steps that I would take in thinking through designing a voice app? Sure. Yeah. So I would be... The first step is to clearly define the actions that a user can take. And so let's go with this example of the T tide app. So if you want to be able to report to users when the high and low tides are for a given day and for a given beach,
Starting point is 00:32:50 that means they're going to have to, they're going to ask for the high tide, they're going to ask for the low tide, they're going to ask for the tide table for that day, they're going to ask for a specific beach, they're going to ask for a specific location, and they may even ask qualitative questions like, is today a good day to go surfing? And so in the first step, it's to think of all the different actions a user is going to want to take. And the second step is
Starting point is 00:33:21 to limit what you are going to support to a very concrete set of actions. So in the first version, you may want to limit it to just the tied tables and put off qualitative questions on what that means for different water activities. Then step two is to think through all of the ways that people are going to express those different actions and the different ways they're going to form those questions. And then step three is to, once you've come up with everything that you can think of, then you have to go ask other people or do testing with other people to see how they're going to express those same questions. Because no matter how many you can think of on your own, what other people are going to say is going to be dramatically different. So you have to be able to,
Starting point is 00:34:17 to account for that. Like, like is the tide coming in would have been one of my questions. Right. Exactly. Which is totally solvable with the data set you have already. But it wasn't in your list.
Starting point is 00:34:29 So yeah, you have to ask what other people think because you're not going to get all the answers yourself. Go on. Yeah, that's exactly right. And then once you have that, it's a very pretty trivial coding problem. Alexa skills as they exist today
Starting point is 00:34:46 are just when you Alexa, the Echo will tell you when an intent is activated, it'll call a specific function and that function will just need to access the data that you have and then you speak that data back to the user.
Starting point is 00:35:02 And you generate the response. And that is just a complicated way of saying, writing out the text that the app is going to say. And then that's really it. That sounds so easy. And that's the thing, is it really isn't rocket science.
Starting point is 00:35:23 It's just, I think, where mobile and web development is, you know, there's considerable amounts of software development involved. And then design is kind of goes hand in hand with that. And it makes the product better. Voice is significantly less complicated on the software engineering side, but relies significantly more on good design and the design process and doing user testing to collect the data that's going to power that voice application. That makes a lot of sense. If you're getting to 90% success rate and all you need to do is listen to a few more things your users are asking for and figure out how to map them into the functions you already have to get to 96 percent that seems
Starting point is 00:36:13 so worth it right right it absolutely is and it it kind of is similar to traditional application development with analytics and seeing how people are interacting with your application and then focusing on the parts of the application where people aren't doing well. But where traditionally you have to come up with your own ways of making that process easier for the user.
Starting point is 00:36:42 With voice, you know what they're saying. You know what they're trying to do that you don't understand. So all you have to do is retrain on that expression you didn't understand or add the features to support that expression that the user requested that you couldn't respond to before. So just by listening to the user becomes so much more important,
Starting point is 00:37:07 but it's also so much easier with voice. So before we get off the topic of general voice, speech, natural language processing, and application design, one thing you haven't mentioned is localization and supporting multiple languages. That seems like a really big problem. It's the same problem, though.
Starting point is 00:37:32 You have to go from what's input. Is it, though? Because you have to repeat what we're saying about collecting ways people could ask questions or signify their intent. That can be quite different in different languages. You can have idioms. You're exactly right. You're exactly right.
Starting point is 00:37:52 Where, you know, historically, localization is just translating your website or your app into another language. With voice, not only do you have to do that, but you have to do this data collection in those other languages. So, you know So Amazon Alexa is available in the US, the UK, and Germany. And so for the German version, you have to translate your skill to speak back in German. But then you also have to get the
Starting point is 00:38:19 user utterances or the expressions that people are going to use and speak to your app in German. And then also in British English, because idioms, like you said, are going to be different. So the localization problem is greater than it has been before, but is the same process of building the voice app in the first place. You just have to repeat it for every language that you're going to be supporting that's sort of like how you have to do that but for numbers when you have an embedded system and you want to output numbers they're all different yeah it's a you know it's it's a repeat of a large data set collection it's just it's not quite the same as shoving a excel spreadsheet to a translation company and having to come back that never works as well as people think anyway.
Starting point is 00:39:07 Well. Okay, so I want to move on to a different topic. We've been talking about these applications that go to the net. And I have to admit, I don't really like it. One of the reasons I didn't get an Echo sooner was because I really don't like the fact that it goes out to the net. I have privacy issues with it and I have issues with both Siri and the Echo failing to get to the network for some stupid reason, even though everything else in the house
Starting point is 00:39:39 is fine. How long are we going to be stuck going out to the net to do things when can we do it here that is a a good question um i think the limiting factor right now is the speech recognition it's still very processor intensiveintensive to convert audio into text and running the machine learning algorithms and classifiers on that data to the point where it requires offloading to the internet. But once you have that text, the natural language processing, on the other hand, is much less processor-intensive. The training itself is very on the other hand, is much less processor intensive. The training itself is very processor intensive, and that takes lots of GPUs running for hours to get right.
Starting point is 00:40:36 But once that model is trained, the actual classification is much more straightforward. So it's not... Speech recognition is one thing, and I think we're getting there. So hopefully in the next couple of years, we'll be able to do on-device speech recognition over an open dictionary. So right now you can do it with a fixed number of words and the tens to hundreds, so less than 500.
Starting point is 00:41:09 But hopefully in the next couple of years, we'll get to the place where you can do open speech recognition, where you can just listen for anything and get the text of that, that's kind of the limiting factor because of, or at least with respect to speech input. I don't really know. I don't have a sense of what the text-to-speech requirements would be. Because all of those, oh no, you can do that on device now. I'm sorry. Yes, definitely. Yes. Yeah.
Starting point is 00:41:44 So, you know, text-to-speech you can do on device. And the natural language processing you can do on device, but there's no way, there's no services that let you do that right now. So you would have to build your own, essentially, which is a lot more complicated than it should be. So it's just going to, technically, it's not a problem. which is a lot more complicated than it should be. So technically it's not a problem,
Starting point is 00:42:10 it's just that it isn't available right now. And then I still think it's a couple years away before we can do speech recognition on device. There are some dev kits out there. There's the EasyVR, which is easy voice recognition on SparkFun. It's a $50 Arduino shield. And it's got a number of commands that are built in. There are things like robot stop, robot go, robot turn. And it's supposed to work pretty well.
Starting point is 00:42:38 And you can add things to it. It also supports five languages, including US, so that would be six languages. But it only supports, as you said, a very small dictionary. You can train it for a few more, but it's not going to be long sentences. It's not going to be, tell me a joke. It's going to be, joke now, joke cat, joke shark. Yeah, really, this is all I use her for. And then there's another one, a $20 one from Seed Studios, the Grove Speech Recognizer. It's a Cortex M0 Plus based, and it has 22 commands, which is pretty cool. But again, that's not a dictionary. And if you think about the Cortex M0 Plus and how small it is and how efficient it is,
Starting point is 00:43:27 that has a lot of goodness, but I suspect they're pushing that as hard as it can go. And they're still only getting 22 commands. If they put, if they added flash or whatever they needed to, or RAM probably to do more, they would need to have a bigger processor because you have to compare all of these phonemes against your, your signal processor gives you phonemes and then the phonemes you have to compare to your dictionary of phonemes. And that process is very intensive and takes a while and you want your device to be able to respond very quickly it's one of the things that all the ones we've been talking about do respond to you almost immediately unless they've gone off to the net and fallen down but when they
Starting point is 00:44:19 work it's a snappy back and forth. Let's see what else we have. Oh, I have the Radio Shack has been doing voice recognition or speech recognition chips since 1980. Well, I don't think they're still doing it. And how well is that working out? I don't remember. I think it worked. How many commands did it have? You know, a dozen or something.
Starting point is 00:44:42 I don't remember what it was, but it was a little masked ROM with a tiny microprocessor. It was like 10 bucks in an IC. It was for the robot kind of thing, like you were saying. But they have to be pretty separate words. They have to sound really different, you mean? On and off are terrible. Start and stop are not great because those are
Starting point is 00:45:06 both words that sort of sound like each other um if you say well there's yeah you want to do things that sound different open and closed sound very different that's a good good thing but for natural language processing it's not a good thing. Right. I was going to say, I'm not as familiar with the boards that are out there, but on the software side, there's an open source on-device speech recognition library from Carnegie Mellon called Pocket Sphinx. And it is a on-device speech recognition engine
Starting point is 00:45:52 that, again, has limited dictionaries, but is much higher numbers that it will support, I think in the hundreds, but it's going to require a full Linux computer on a chip. So maybe you could run it on the new chip, like $8 PC or something like that. Yes. But yeah, that's another thing if people are interested in
Starting point is 00:46:26 that might be worth exploring. Their links are all broken, darn it. I will put a link in the show notes that actually works instead of inside. Never mind. I'll send you good links on that if you want to check the show notes for Pocket Sphinx.
Starting point is 00:46:43 And there was another group, Automy. They were on the Amp Hour, which is sort of our... Brothers, I almost said sister show, but that doesn't work exactly. We'll go with brother, cousin, sworn enemy show. Second cousin show. And they talked a lot about the technology of putting it together but that's a it's a 70 dollar linux based part that wow if you want to and you know that that's retail i mean it's maker retail so assume that you could actually get those parts for 10 to 20 dollars and put them in your
Starting point is 00:47:23 device and it would all work but that means your cost has to go way up and it starts to make sense why we aren't doing these locally. But someday, I mean, Moore's law works for us, then that will be cheaper soon. And in 10 years, we'll just be walking around talking to ourselves without cell phones. Yeah. I mean, I think we're going to be, we're definitely going just be walking around talking to ourselves without cell phones. Yeah, I mean, I think we're definitely going to be walking around talking to ourselves very soon. It's just definitely going to reply on the internet, and a company is going to be able to record all of that. You're not helping my privacy concerns here. So I think the privacy concerns are very valid and it's a very,
Starting point is 00:48:09 probably the technology issues aside, privacy concerns are the biggest issue facing a wider adoption of really, really interesting things that we can do with voice because of just companies using that data and knowing everything about you and then governments doing the same in this country and others. And I think people are going to have a very justified negative response if we push that
Starting point is 00:48:43 too far forward. So I think in order to kind of see the full potential of what we can do with voice computing, we're going to need more privacy protections like written into law. And I don't know when that's going to ever happen. Well, let's not go further down that path today. Okay, okay. But I agree, we're going to have to sort this out because it's all there.
Starting point is 00:49:13 Different companies have different attitudes towards it. I mean, Apple, at least consistently up to now, has been very, you know, privacy is one of our product features. And so, you know, I think they would be pushing hard to get stuff processing locally, and then it's just a matter of, okay, you're doing a web search, or you have text going up to the web. Or text going to your apps.
Starting point is 00:49:37 But it's not consistently listening to everything you're saying and potentially having a privacy hole there. But it's going to be interesting to see if consumers make that something that they want, I guess. Because if they don't, then it's kind of incumbent on companies to force it on them. That's, no, we're going to do this.
Starting point is 00:49:55 It's going to take us a little longer to make this truly safe for you. Can you be patient? Right? All right, let's move on we were connected through the o'reilly design conference you'll be speaking there next spring what are you going to talk about yeah i'm going to be running a workshop on uh designing and prototyping voice-based applications so i think we're going to go in more depth into the topics that we covered today,
Starting point is 00:50:27 like what are the underlying technologies that are making this wave of voice computing possible and what does that mean for what you can do as a voice app developer and designer? And then what are the design considerations and best practices in voice user interface design? And then we're actually going to sit down and prototype a voice app together using the prototyping tool that we've created. And then we'll do user testing with other people in the workshop.
Starting point is 00:51:03 And hopefully, I'm really excited to see what people come up with. Do you think people will bring their apps and say, how do I do this? I hope so. I think that'll make for a much more interesting workshop. I think I'm curious to see, like one of the big things that I want to talk about is, does it make sense for your use case
Starting point is 00:51:24 for there to be a voice application? Because not everything, at least today, should have a voice app because of the limitations and just what makes sense, right? I don't think people are going to be doing a lot of clothing shopping on their Amazon Echo. And yet she tries. What apps are really good candidates? I mean, jokes aside, what should we be looking for soon? that should be asked is given the context that the user is in, like their physical environment and what they're doing, does voice provide a more efficient means of completing that task than other possibilities? And that's kind of where Echo has done a really good job is in the home, most people don't have their phones on them. So having a device that is just there that you can talk to is going to be a better experience than having to go find your phone
Starting point is 00:52:29 and then pick it up and complete that task. Especially for interacting with your connected home. It's a lot easier to say, Alexa, turn on the lights than it is to either get up off the couch and go hit the light switch or pull out your phone from your pocket open up your smart app. You should totally have seen Chris try to turn on the lights recently. It took him like 12 tries. It would have been so much easier to go stand over there. But yes, sorry. No, I think there's an
Starting point is 00:52:56 entire other episode that can be done on just how frustrating the smart home is as a consumer. But yeah, so that's the overarching question is, does voice provide a more efficient means of completing a task than other alternatives? And then you have to ask that for your specific use case,
Starting point is 00:53:17 but also for the device you're building for. And right now, most people who are thinking about voice, I think are thinking about it on the Amazon Echo and the Google Home when that opens up to developers later this month. are related to food or cooking or kind of the family group experience are going to lend themselves much better in the short term to voice computing. So anything with recipes, with food ordering, with restaurant reservations and food search, those types of tasks I think are very well suited right now. It could change channels on the TV too. That would be nice. It can if you have a Logitech Harmony. And that's one of my next purchases because I would really like to set that up.
Starting point is 00:54:12 Which clearly we do not. Behind the time on technology. It's clearly a business expense. Okay, so what apps don't make good candidates for voice recognition? You mentioned clothing and I could see how like some of the picture only like Pinterest probably wouldn't be that useful.
Starting point is 00:54:35 Right. Any things that are very visual, at least right now are not going to make sense. You know, there are whispers that Amazon's coming out with an Echo with a touchscreen. So that might be better, but still probably not going to be better than just using it on your phone. So anything heavily visual, anything where you are kind of communicating
Starting point is 00:54:56 very sensitive information, so not necessarily secretive, but that also counts. So like credit card information or like your address, things that if you get that piece of information wrong, just the experience of trying to do that through speech is going to be very poor and not great. So I would avoid anything that has to rely on that. Yeah, there's some phone tree applications that want you to say your social security number. And I always like, you have no idea where I am. I could be in a coffee shop. This isn't the sort of thing you want me to do out loud. Right, right. I think that kind of gets to one of the other
Starting point is 00:55:40 kind of things to keep in mind with the Echo is it's a public device in your home. And unless you live by yourself, other people are going to either overhear the conversation that you're having or be able to access that same data. And so we talked about email a little earlier, but that's, I think, an example of something that people might not be comfortable doing on this public or shared device. So anything with very personal information like mental health or email or journaling or other things of that nature I don't think are good fits
Starting point is 00:56:18 for voice computing as it exists today. I think that'll change very quickly, but right now, probably not. That makes sense. There are a lot of things that I could see in that whole stuff you don't tell everybody all the time bucket. Right. Cool.
Starting point is 00:56:41 Well, it does sound like it's going to be an interesting tutorial. Let's see. You care about voice recognition a lot, clearly, and you, or speech recognition, and how it's designed. And you mentioned that you use a screen reader. What is that like? Extremely, extremely frustrating. So the reason I got into this whole space is because five years ago, I found out that I was going blind. I was diagnosed with a genetic disorder where I'm losing my central vision. And as my vision has gotten worse over that period of time, I've gotten very familiar with the assistive technology that's out there and specifically the screen reader. And I'm very disappointed with how they work and the experience of using that product. And for those of you who aren't familiar,
Starting point is 00:57:41 a screen reader is a device for the visually impaired or a software for the visually impaired where there's a cursor on the screen that you can move using keyboard commands on a laptop or swipe gestures on a smartphone. And whatever that cursor highlights, that text is read aloud or the metadata for that text is read aloud. So you're navigating this two-dimensional visual experience in a one-dimensional audio stream using keyboard commands. And it's very cumbersome. And that's only when it works well. A lot of websites and a lot of mobile applications don't take the steps necessary to make their products accessible.
Starting point is 00:58:24 And so this creates a really poor experience for the vast majority of the blind community, not to mention the fact that most people are losing their vision from aging-related disorders. So they're very unlikely to know how to do email, let alone use email with the screen reader and other assistive technology. So voice computing represents this much better way of doing things where these applications are designed for audio first, and they're going to be a much more intuitive experience for someone who's less tech savvy. And conversation is something that we've all been doing since we've been able to talk. So it's much more comfortable. So I'm really optimistic and really hopeful that in the near
Starting point is 00:59:14 future, we're going to be able to do everything we can do on our smartphones through voice instead. So you didn't, as we were talking about apps that weren't necessarily good candidates, you didn't mention things you want to do fast. And as I think about screen readers and what, how hard it is to navigate a website, I imagine that that is incredibly slow and that these speech conversational things are also pretty slow. And you said you listen to podcasts really fast. How do we get these to be slow? I've always found in meetings, I'm like, can't we just do this over IM where we can all type faster? Yeah. So I think a big part is just letting you increase the speech rate of your voice service.
Starting point is 01:00:07 And so that's a standard feature in screen readers that I would love to see on Alexa or Siri or Google Assistant. And that would help make these interactions more efficient. Alexa, talk faster. Will that work? No, it won't. It doesn't work. I was waiting to see what she would say back to you oh no she's in a different room okay um no i wish i wish um so so that's one thing that we can do but again that's like more for for power users who are used to it um the it's just it's i think other things that we can do to make the experiences more efficient are just being
Starting point is 01:00:49 more responsive to more natural language commands, rather than having to force you to go through a sequence of very specific commands. One, and then two, allow you to make multiple requests in the same kind of interaction. I'm trying to think of a good example of that. Set two timers, one for five minutes and one for ten minutes. That's not a great one, but
Starting point is 01:01:17 an idea of that would be more efficient than saying Alexa set a timer for five minutes, Alexa set a timer for ten minutes. And there's also UI things that you can do to make the experience more efficient that I think will help kind of move us in that direction. But that's difficult. I mean, the timers is not too difficult
Starting point is 01:01:39 because you're going to the same app in the end. But if I wanted to say, put eggs on my grocery list and remind me to go to the grocery before in the end. But if I wanted to say, put eggs on my grocery list and remind me to go to the grocery before I go home, those are two separate apps. And so you have to have enough upper level natural language processing
Starting point is 01:01:58 to split it between them. Yeah, yeah. You're exactly right. And I think that's going to be a challenge for these platforms and the ones who can do it the best are going to be the ones that succeed. And hopefully we see
Starting point is 01:02:15 that sooner rather than later. But you're right. It's navigating between applications, but it's also carrying on a conversation with the same application. Right now in Alexa, your skill is only active for like 16 seconds. So if you're not carrying on that conversation continuously, you have to re-invoke that application and start over every time you talk to it.
Starting point is 01:02:43 So like with the recipe example, if you're stepping through the steps in a recipe, you'd have to say, Alexa, ask the recipe app what the next step in the recipe is every time. And that's less efficient than just saying, Alexa, what's the next step? Well, there's, I mean, that robot game lets you stay in it until you exit it or die.
Starting point is 01:03:08 So that's coming, right? You don't know what the robot game is and I don't remember the name. So that's just not a good information, is it? I hope, I would imagine that they're improving that. Like it just makes so much sense. So hopefully that's something that's happened and we'll continue to see improvements in that direction. And your recipe app is exactly what we need here. Because once you can do that,
Starting point is 01:03:40 once you can step through something that's a fairly linear process, then you can start stepping through things that aren't as linear and even the linear process of having a recipe at some point you're going to want to have a timer and it's it builds on itself and it does seem like someday we will get to conversation this whole thing reminds me of those old Infocom games. Yes. From the 80s, those text adventures. Zork! And yeah, it's the same kind of problem of anticipating any kind of input. Where for years we've had, okay, there's a mouse pointer and a keyboard.
Starting point is 01:04:16 And the keyboard has a fixed set of key commands and the mouse points exactly where you want it to. And now it's all, it's a completely different way of thinking about applications, and I think that's what's hard for people. Yeah, we're definitely in a transition phase. And as we learn best practices, not only in the design, but in the architecture side of things, we're going to start seeing better and better applications. Well, that sounds like it's a good place to start wrapping things up.
Starting point is 01:04:52 Chris, do you have any last questions? Yeah, I wanted to ask, since a lot of people who listen to the show are tinkers and electrical engineering people and software people who make projects of their own or might be exploring little Internet of Things projects, electrical engineering people and software people who make projects of their own or little, might be exploring little Internet of Things projects. What would you suggest if somebody said, hey, you know, I'd really like to hook voice recognition, speech recognition into my project or product.
Starting point is 01:05:18 I don't know where to start. It wouldn't be making their own voice recognition system, but it would probably be hooking into Amazon Echo's ecosystem or something. So where would somebody start to look at that? Yeah, I think the Echo is a great place to start. I think it's the most open right now, and there's a lot of support for kind of hooking it up to more project-type things with If This Then That and other services that you can use to link up voice commands in your skill
Starting point is 01:05:51 to different devices in your home or projects you may be working on. So I would definitely go to Alexa's developer pages, which I don't remember the URL off the top of my head. There's also a pretty active developer forum there, which has a lot of tinkerers and kind of more hardware maker type communities of people. And there's also a Slack channel that they have.
Starting point is 01:06:19 So there's definitely a large developer community there. And that's probably where I would start. Alexa skills kit, it's developer.amazon.com. And there. That's probably where I would start. Alexa Skills Kit. It's developer.amazon.com. And there will be a link in the show notes. All right. That sounds like fun, actually. Think of all the things you can make people do
Starting point is 01:06:35 if you can interact with them voice-wise. Make people do? You're going to take over people now? Well, think of all the things you can make them say. It would just be funny to have a game where you had to guess the word and then I made you say some quote or some stupid thing. Reverse double plus upside down social engineering people. Yeah, you know.
Starting point is 01:07:01 Got it. You're going to write the code to make the people do things not the other way around we're gonna have voice recognition change how people speak instead of just recognizing things differently it's gonna have to change how i speak uh well chris thank you so much for being with us do you have any last thoughts you'd like to leave us with um no just thank you so much for having me uh if you're interested in voice computing, especially voice design, definitely check out our prototyping tool. It's a great place to start playing around with designing voice experiences. And again, that's at tencan.ai guest has been Chris Morey, the founder of Conversant Labs, a company providing design and development tools to help create fully conversational applications for iOS and the Amazon Echo.
Starting point is 01:07:55 I'd like to send a special thank you out to O'Reilly's Nina Cavanis for hooking me up with Chris. Their design conference is in San Francisco in March, March 2017. Wow, that's coming fast. Thank you also to Christopher for producing and co-hosting. And of course, thank you for listening. Hit the contact link on Embedded FM if you'd like to say hello, sign up for the newsletter, subscribe to the YouTube channel,
Starting point is 01:08:20 enter the Tinkercad contest, and don't forget, Alexa, play Embedded FM. All right, final thought, final thought. Let's see. How about one from Helen Keller? Be of good cheer. Do not think of today's failures, but of the successes that may come tomorrow. You've set yourself a difficult task, but you will succeed if you persevere.
Starting point is 01:08:47 And you will find a joy in overcoming obstacles. Remember, no effort that we make to attain something beautiful is ever lost. Embedded is an independently produced radio show that focuses on the many aspects of engineering. It is a production of Logical Elegance, an embedded software consulting company in California. If there are advertisements in the show, we did not put them there and do not receive money from them. At this time, our sponsors are Logical Elegance and listeners like you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.