Y Combinator Startup Podcast - #25 - Baidu's AI Lab Director on Advancing Speech Recognition and Simulation

Starting point is 00:00:00 Hey, this is Craig Cannon, and you're listening to Y Combinators podcast. This episode is with Adam Coates. Adam's the director of Bydo's Silicon Valley AI lab, and what they focus on is developing AI technologies that will impact at least 100 million people. We spent a good chunk of this episode talking about Adam's work in speech to text and text to speech. So if you want to learn more about those projects, you can check out research.bidu.com. And as always, if you want to read the transcript or watch the video, you can check out blog.witr.combinator.com. All right, here we go. Today we have Adam Coates. here for an interview. Adam, you run the AI lab at Baidu in Silicon Valley. Could you just give us a quick intro and explain what Baidu is for people who don't know? Yeah, so Baidu is actually the largest

Starting point is 00:00:43 search engine in China. So it turns out the internet ecosystem in China is this incredibly dynamic environment. And so Baidu, I think, sort of turned out to be an early technology leader and really established itself in PC search. But then also has sort of remade itself in the mobile revolution. And increasingly today is becoming an AI company, recognizing the value of AI for a whole bunch of different applications, not just search. Okay.

Starting point is 00:01:11 And so, yeah, what do you do exactly? So I'm the director of the Silicon Valley AI Lab, which is one of four labs within Bidu research. So especially as Bidu is becoming an AI company, the need for a team to sort of be on the bleeding edge and understand all of the current research, be able to do a lot of basic research ourselves, but also figure out how we can translate that

Starting point is 00:01:35 into business and product impact for the company. That's increasingly critical, and so that's what By-Due research is here for. In the AI lab in particular, we kind of founded recognizing how extreme this problem was about to get. So I think the deep learning research and AI research right now is flying forward so rapidly. that the need for teams to be able to both understand that research, but also quickly translate it

Starting point is 00:02:04 into something that businesses and products can use is more critical than ever. So we founded the AI lab to try to close that gap and help the company move faster. And so then how do you break up your time in between doing basic research around AI and actually implementing it, like bringing it forward to a product? There's no hard and fast rule to this. I think one of the things that we try to repeat to ourselves every day is that we're mission-oriented. So the mission of the AI lab is precisely to create AI technologies that can have a significant impact on at least 100 million people. We chose this to sort of keep bringing ourselves back to the sort of final goal that we want all the research we do to ultimately ends up in the hands of users.

Starting point is 00:02:55 And so sometimes that means that we spot something that needs to happen in the world to really change technology for the better and to help by do. But no one knows how to solve it. And there's a basic research problem there that someone has to tackle. And so we'll sort of go back to our visionary stance and think about the long term and invest in research. And then as we have success there, we shift back to the other foot and take responsibility for carrying. all of that to a real application and making sure we don't just solve the 90% that you might put in, say, your research paper, but we also solve the last, the last mile. We get to the 99.9%. So maybe the best way to do this then is to just explain like something that's started with research

Starting point is 00:03:43 here and how that's been brought on to like a full on product that exists. So I'll give you an example. We've spent a ton of time on speech recognition. So speech recognition a few years ago is one of these technologies that always felt pretty good, but not good enough. And so traditionally, speech recognition systems have been heavily optimized for things like mobile search. So if you hold your phone up close to your mouth and you say a short query. In talk and a non-human voice. Exactly. The systems can figure it out and they're getting quite good. I think, you know, the speech engine that we've built it by due called deep speech, it's actually superhuman for these short queries because you have no context, people can have thick accents.

Starting point is 00:04:30 So that speech engine actually started out as a basic research project. We looked at this problem, we said, gosh, what would happen if speech recognition were human level for every product you ever used? So whether you're in your home or in your car or you pick up your phone, whether you hold your phone up close or hold it away, if I'm in the kitchen and my toddler is, you know, yelling at me, can I still use a speech interface? Could it work as well as a human being understands us? And so then how did you do that, what is the basic research that moved it forward to put it in a place that it's useful? So we had the hypothesis that maybe the thing

Starting point is 00:05:07 holding back a lot of the progress in speech is actually just scale. Maybe if we took some of the same basic ideas, we could see in the research literature already, and scaled them way up, put in a lot more data, invested a lot of time in solving computational problems, and built a much larger neural network than anyone had been building before for this problem. We could just get better performance. And lo and behold, with a lot of effort,

Starting point is 00:05:33 we ended up with this pretty amazing speech recognition model that, like I said, in Mandarin at least, is actually superhuman. You can actually sit there and listen to a voice query that someone is trying out, and you'll have native speakers sitting around debating with each other, wondering what the heck the person is saying. Wow.

Starting point is 00:05:52 And then the speech engine will give an answer, and everybody goes, oh, that's what it was, because it's just such a thick accent from perhaps someone in rural China. How much data do you have to give it to train it, you know, to train it on a new line? Because I think on the site I saw it was English and Mandarin. Yeah. Like if I wanted German, how much would I have to give it? So one of the big challenges for these things is that they need a ton of data.

Starting point is 00:06:16 So our English system uses like 10 to 20,000 hours of audio. The Mandarin systems are using even more for top end products. So this certainly means that the technology is at a state where to get that superhuman performance, you've got to really care about it. So for Bidu voice search maps, things like that that are flagship products, we can put in the capital and the effort to do that. But it's also one of the exciting things going forward in the basic research. that we think about is how do we get around that? How can we develop machine learning systems

Starting point is 00:06:51 that get you human performance on every product and do it with a lot less data? So what I was wondering then, like, did you see that liar bird thing that was floating around the internet this week? Okay. They claim that they don't need all that much time, all that much data, audio data, to emulate your voice or simulate whatever they call it. You guys have a similar project going on, right? That's right. Yeah, we're working on text to speech. Why can they achieve that with less data? I think the technical challenge behind all of this is there's sort of two things that we can do. One is to try to share data across many applications.

Starting point is 00:07:27 So to take text to speech is one example. If I learn to mimic lots of different voices and then you give me the 1,000 and first voice, you'd hope that the first thousand taught you virtually everything you need to know about language and that what's left is really some idiosyncratic change that you could learn from very little data. So that's one possibility. The other side of it is that a lot of these systems, this is much more important for things like speech recognition

Starting point is 00:07:57 that we were talking about, is we want to move from using supervised learning, where a human being has to give you the correct answer in order for you to train your neural network, but move to unsupervised learning, where I could just give you a lot of raw audio, and have you learn the mechanics of speech before I ask you to learn a new language. And hopefully that can also bring down the amount of data that we need.

Starting point is 00:08:22 And so then on the technical side, like, could you give us just a somewhat of an overview of how that actually works? Like how do you process a voice? For text to speech? Let's do both, actually, because I'm super interested. All right, so speech to text. Yeah, let's start with speech recognition. before we go and train a speech system, what we have to do is collect a whole bunch of audio clips.

Starting point is 00:08:47 So, for example, if we wanted to build a new voice search engine, I would need to get lots of examples of people speaking to me, giving me little voice queries. And then I actually need human annotators, or I need some kind of system that can give me ground truth, that can tell me for a given audio clip what was the correct transcription. And so once you've done that, you can ask a deep learning algorithm to learn the function that predicts the correct text transcript from the audio clip.

Starting point is 00:09:19 So this is called supervised learning. It's an incredibly successful framework. We're really good with this for lots of different applications. But the big challenge there is those labels, that someone has to be able to sit there and give you, say, 10,000 hours worth of labels, which can be really expensive. Okay. So. Yeah, how is it actually recognize, like, what is the software doing to recognize the intonation of a word? Well, traditionally, what you would have to do is break these problems down into lots of different stages.

Starting point is 00:09:53 So I, as a speech recognition expert, would sit down and I would think a lot about what are the mechanics of this language. So for Chinese, you would have to think about tonality and how to break up all the different sounds into some interming. immediate representation. And then you would need some sophisticated piece of software we called a decoder that goes through and tries to map that sequence of sounds to possible words that it might represent. Oh, okay. And so you have all these different pieces and you'd have to engineer each one often with its own expert knowledge.

Starting point is 00:10:30 But deep speech and all of the new deep learning systems we're seeing now try to solve this in one fell swoop. So really the answer to your question is kind of the vacuous one, which is that once you give me the audio clips and the characters that it needs to output, a deep learning algorithm can actually just learn to predict those characters directly. And in the past, it always looked like there was some fundamental problem that maybe we could never escape this need for these hand-engineered representations. But it turns out that once you have enough data, all of those things go away. And so where did your data come from?

Starting point is 00:11:08 Like 10,000 hours of audio? We actually do a lot of clever tricks in English, where we don't have a lot of a large number of English language products. So, for example, it turns out that if you go onto, say, a crowdsourcing service, you can hire people very cheaply to just read books to you. Wow. And it's not the same as the kinds of audio that we hear in real applications. but it's enough to teach a speech system all about liaisons between words and you get some speaker variation

Starting point is 00:11:41 and you hear strange vocabulary where English spelling is totally ridiculous. Oh, right. And in the past, you would hand engineer these things. You'd say, well, I've never heard that word before. So I'm going to bake the pronunciation into my speech engine. But now it's all data-driven. So if I hear enough of these unusual words, you see these neural networks actually learn to spell on their own, even considering all the weird exceptions of English. Interesting. And you have the input, right? Because I've heard of people doing it with like a YouTube video, but then you need a caption as well with the audio. So it's twice as much, if not more, work. Interesting. And so then what about the other way around? How does that work on the technical side?

Starting point is 00:12:21 Right. So that's one of the really kind of cool parts of deep learning right now is that a lot of these insights about what works in one domain keep transferring to other. domains. So with text to speech, you could see a lot of the same practices. So you would see that a lot of systems were hand-engineered combinations of many different modules. And each module would have its own set of machine learning algorithms with its own little tricks. And so one of the things that our team did recently with a piece of work that we're calling Deep Voice was to just ask, what if I rewrote all of those modules using Deep Learning for every single one? to not put them all together just yet, but even just ask,

Starting point is 00:13:03 can deep learning solve all of these adequately to get a good speech system? It turns out the answer is yes. You can basically abandon most of this specialized knowledge in order to build all of the subsequent modules. And in more recent research that's in the deep learning community, we're seeing that, of course, everyone is now figuring out how to make these things work end to end.

Starting point is 00:13:27 They're all data-driven. and that's the same story we saw for deep speech. So we're really excited about that. That's wild. And so do you have a team just dedicated to parsing, like, research coming out of universities and then figuring out how to apply it? Are you testing everything that comes out? It's a bit of a mix.

Starting point is 00:13:44 It's definitely our role to not only think about AI research, but to think about AI products and how to get these things to impact. I think there is clearly so much AI research. research happening, that it's impossible to look through everything. But one of the big challenges right now is to not just digest everything, but to identify the things that are truly important. So what's like a 90 million person product? You're like, oh, man. Well, it's the speech recognition we chose because we felt in aggregate, it had that potential. So as we have the next wave of AI products.

Starting point is 00:14:27 I think we're going to move from these sort of bolted on AI features to really immersive AI products. So if you look at how keyboards were designed a few years ago for your phone, you see that everybody just bolted on a microphone and they hooked it up

Starting point is 00:14:42 to their speech API. And then that was fine for that level of technology. But as the technology is getting better and better, we can now start putting speech up front. we can actually build a voice first keyboard. So it's actually something we've been prototyping in the AI lab.

Starting point is 00:15:00 You can actually download this for your Android phone. So it's called tuck type in case anybody wants to try it. But it's remarkable how much it changes your habits. I use it all the time and I never thought I would do that. And so it emphasized to me why the AI lab is here, that we can sort of discover these changes in user habits. We can understand how speech recognition can impact people much more deeply than it could when it was just bolted onto our product. And that sort of spurs us on to start looking at the full range of speech problems that we have to solve to get you away from this sort of close talking voice search scenario and to one where I can just talk to my phone or talk to a device and have it always work.

Starting point is 00:15:45 So as you're like, you know, given this to a bunch of users, I assume, and gotten their feedback, have you been surprised with the like voice. interface. I know lots of people talk about it. Some people say like, it doesn't really make sense. You know, for example, you see like Apple transcribing voicemails now. Are there certain use cases where you've been surprised at how effective it is and others where you're like, I don't know if this will ever play out? You know, I think, you know, the really obvious ones like texting seem to be the most popular. I feel like the feedback that is maybe the most fun for me is for when people with thick accents, a poster review. They say, oh, I have this crazy accent I grew up with and nothing works for me. But I tried this new keyboard and it works amazingly well. I have a friend who has a thick Italian

Starting point is 00:16:34 accent and he complains all the time that nothing works. And it's working. And all of this stuff now works for different accents because it's all data driven. We don't have to think about how we're going to serve all these different users. If they're represented in the data sets and we get some transcriptions, we can actually serve them in a way that really wasn't possible when we were trying to do it all by hand. That's fantastic. And have you gone it like through the whole system? In other words, like if I want to give myself, you know, an Italian-American accent, like can I do that yet with By-Doo?

Starting point is 00:17:03 We can't do that yet with our TTS engine, but it's definitely on the way. Okay, cool. So what else is on the way? What are you researching? What products are you working on? What's coming? So speech and text-to-speech, I think these are part of a big effort. to make this next generation of AI products really fly.

Starting point is 00:17:23 Once text to speech and speech are your primary interface to a new device, they have to be amazingly good and they have to work for everybody. And so I think there's actually still quite a bit of room to run on those topics, not just making it work for a narrow domain, but making it work for really the full breadth of what humans can do. Do you see a world where you can run this stuff locally or will they always be calling an APA? Yeah. I think it's definitely going to be.

Starting point is 00:17:48 to happen. One kind of funny thing is that if you look at folks who maybe have a lot less technical knowledge and don't really have the sort of instinct to think through how a piece of technology is working on the back end, I think the response to a lot of AI technologies now, because they're reaching this sort of uncanny valley, is that we often respond to them as though they're sort of human. And that sets the bar really high our expectations for, how delightful a product should be is now being set by our interactions with people. And one of the things we discovered as we were translating deep speech into a production system was that latency is a huge part of that experience.

Starting point is 00:18:33 That the difference between 50 or 100 milliseconds of latency and 200 milliseconds of latency is actually quite perceptible. And it really, anything we can do to bring that down actually affects user experience quite a bit. We actually did a combination of research, production hacking, working with product teams, thinking through how to make all of that work. And that's a big part of the sort of translation process that we're here for. That's very cool. And so, yeah, what happens on the technical side to make it run faster? So when we first started like the basic research for deep speech, like all research papers, you know, we choose the model that gets the best benchmark score, which turns out to be horribly impractical for putting on the.

Starting point is 00:19:18 line. And so after sort of the initial research results, team sat down with just a set of what you might think of as product requirements and started thinking through what kinds of neural network models will allow us to get the same performance, but don't require so much sort of future context. They don't have to listen to the entire audio clip before they can give you a really high accuracy response. So kind of doing that like, you know, the language prediction stuff, like the Open AI guys were doing with the Amazon reviews, like predicting what's coming next? Maybe not even predicting what's coming next, but one thing that humans do without thinking about it is if I misunderstand a word that you've said to me, and then a couple of words later, I pick up

Starting point is 00:20:06 context that disambiguates it. I actually don't skip a beat. I just understand that as one long stream. And so one of the ways that our speech systems would do this is that they would listen to the entire audio clip first, process it all in one fell swoop and then give you a final answer. And that works great for getting the highest accuracy. But it doesn't work so great for a product where you need to give a response online, give people some feedback that lets them know that you're listening. And so you need to alter the neural network so that it tries to give you a really good answer using only what it's heard so far, but can then update it very quickly as it gets more

Starting point is 00:20:46 context. So I've noticed over the past few years, people have, like, gotten quite good at structuring sentences so Siri understands them. Um, you know, they put like the noun in the correct position. So it like feeds back the data correctly. I found this when I was traveling, like I was using a Google translate. And I, uh, after like one day recognized that I couldn't give it a sentence. But if I gave it a noun, I could just show it to someone. And like, if I just show like, you know, bread, it will translate it perfectly and give it. Um, do you find that like, we're going to have to slightly adapt how we communicate with machines, or your goal is to communicate perfectly as we would?

Starting point is 00:21:23 I really wanted to be human level, and I don't see a serious barrier to getting there, at least for really high-valued applications. I think there's a lot more research to do, but I sincerely think there's a chance that over the next few years, we're going to regard speech recognition as a solved problem. That's very cool. So what are the really hard things happening right now?

Starting point is 00:21:43 Like, what are you not sure if it'll work? So I think we were talking earlier about getting all this data. So for problems where we can just get gobs of labeled data, I think we've got a little bit more room to run there, but we can certainly solve those kinds of applications. But there's a huge range of what humans are able to do often without thinking that current speech engines just don't handle. We can deal with cross-talk and a lot of background noise. If you talk to me from the other side of a room, even if there's a lot of reverse, and things going on,

Starting point is 00:22:18 usually doesn't bother anybody that much. And yet, current speech systems often have a really hard time with this. But for the next generation of AI products, they're going to need to handle all of this. And so a lot of the research that we're doing now is focused on trying to go after all of those other things.

Starting point is 00:22:36 How do I handle people who are talking over each other, or handle multiple speakers who are having a conversation very casually? How do I transcribe things that have very long structure to them, like a lecture, where over the course of the lecture, I might realize I misunderstood something or some piece of jargon gets spelled out for me and now I need to go and transcribe it. So this is one place where our ability to innovate on products is actually really useful. We've just launched recently a product vision called SwiftScribe to help transcriptionists be much more efficient. And that's targeted at understanding all of these scenarios where the world wants

Starting point is 00:23:18 this long-form transcription. We have all of these conversations that we're having that are just sort of lost and we wish we had written down. But it's just too expensive to transcribe all of it for everyday applications. So do you, so in terms of emulating someone's voice, do you have any concerns for faking it? Because I, did you see the face simulation? I forget the researcher's so I'll have to link to it, but you know what I'm talking about. So essentially you can feed it both video and audio, and you can recreate, you know, Adam talking. Do you have any thoughts on, like, how we can prepare for that world? You know, I think in some sense, this is a social question, right?

Starting point is 00:23:59 I think culturally we're all going to have to exercise a lot of critical thinking. We've always had this problem in some sense that I can read an article that has someone's name on it. and notwithstanding understanding writing style, I don't know for sure where that article came from. And so I think we have habits for how to deal with that scenario. We can be healthily skeptical, and I think we're going to have to come up with ways to adapt that to this sort of brave new world.

Starting point is 00:24:31 I think those are big challenges coming up, and I do think about them. But I also think a lot about just all the positives that AI is going to have. have. I don't talk about it too much. My mother actually has muscular dystrophy. And so things like speech and language interfaces are just incredibly valuable for someone who cannot type on an iPad because the keys are too far apart. And so these are just all these things that you don't really think about that these technologies are going to address over the next few years.

Starting point is 00:25:07 and on balance, I know that we're going to have a lot of big challenges of like, how do we use these? How do we as users adapt to all of the implications? But I think we've done really well with this in the past, and we're going to keep doing well with it in the future. So do you think where AI will create new jobs for people, or will we all be like mechanical Turks feeding the system? I'm not sure.

Starting point is 00:25:30 I think this is something where, you know, the job turnover in the United States every quarter is incredibly high. It's actually shocking that the fraction of our workforce that quits one occupation and moves to another one is really high. I think it is clearly getting faster. We talked about this phenomenon within the AI lab here, where the deep learning research is flying ahead so quickly that we're often remaking ourselves to keep up with it and to make sure that we can keep innovating. I think that might even be a little bit of a lesson for

Starting point is 00:26:10 everyone that continual learning is going to become more and more important going forward. Yeah. So speaking of, like, what are you teaching yourself so the robots don't take your job? I don't think we're at risk of robots taking our jobs right now. Actually, it's kind of interesting. We've thought a lot about, like, how does this change careers? One thing that has been true in the past is that if you were to create a new research lab, one of the first things you do is fill it with AI experts, where they live and breathe AI technology all day long. I think that's really important. I think for basic research, you need that kind of specialization. But because the field's moving so quickly, we also need a different kind of person now. We also need people who are sort of

Starting point is 00:26:58 chameleons who are these highly flexible types that can understand and even contribute to a research project, but can also simultaneously shift to the other foot and think about how does this interact with GPU hardware and a production system and how do I think about a product team and user experience? Because often product teams today can't tell you what to change in your machine learning algorithm to make the user experience better. It's very hard to quantify where it's falling off the edge. And so you have to be able to think that through to change the algorithms. You also have to be able to look at the research community to think about what's possible

Starting point is 00:27:35 and what's coming. And so there's a sort of amazing full-stack machine learning engineer that's starting to show up. Where are they coming from? Like if I want to be that person, what do I do like now? Say I'm 18. They seem to be really hard to find right now. I believe it. So in the AI lab, we've really set ourselves to just creating them.

Starting point is 00:27:56 I think this is sort of the way unicorns are, that we have to find the first few examples and see how exciting that is, and then come up with a way for people to learn and become that sort of professional. Actually, one of the cultural characteristics of our team is that we look for people who are really self-directed and hungry to learn. That things are going so quickly. We just can't guess what we're going to have to do in six months. and having that sort of do-anything attitude of saying, well, I'm going to do research today and think about research papers, but wow, once we get some traction and the results are looking good, we're going to take responsibility for getting this all the way to 100 million people.

Starting point is 00:28:42 That's a towering request of anyone on our team, and the things that we find really help everyone sort of connect to that and do really well with that is really self-directed and able to kind of deal with ambiguity, and also really willing to learn a lot of stuff that isn't just AI research, but is also stepping way outside of comfort zones and learning about GPUs and high-performance computing and learning about how a product manager thinks.

Starting point is 00:29:10 Okay, so this has been super helpful. If someone wanted to learn more about what you guys are working on or even just things that have been influential to you, like what would you recommend they check out on the internet? Oh, my goodness. So, I'll have to think about this one for a second here. I think the stuff that's actually been quite influential for me is actually like startup books. I think, especially with big companies, it's easy to think of ourselves in silos of having a single job.

Starting point is 00:29:46 One idea from the startup world that I think is really amazingly powerful is this idea that a huge fraction of what you're doing is learning. There's a tendency, especially amongst engineers, which I count myself a member, is like we want to build something. And so one of the disciplines we all have to keep in mind is that we also have to be really clear-eyed and think about what do we not know right now and focus on learning as quickly. as we can to find the most important part of AI research that's happening and find the most important pain point that people in the real world are experiencing and then be really fast at connecting those. And I think a lot of that influence on my thinking has come from the startup world. There you go.

Starting point is 00:30:34 That's a great answer. Okay. Cool. Thanks, man. Thanks so much. All right. Thanks for listening. So please remember to rate the show and subscribe wherever you listen to podcast.

Starting point is 00:30:43 And if you'd like to read the transcript or watch the video, you can check out blog.w.Ycombinator.com. All right. See you next time.

Y Combinator Startup Podcast - #25 - Baidu's AI Lab Director on Advancing Speech Recognition and Simulation

Adam Coates is the Director of Baidu's Silicon Valley AI Lab.Read the transcript here. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.