Advent of Computing - Episode 6 - Digital Voices

Episode Date: June 16, 2019

What are the origins of our modern day text-to-speech systems? In this episode we will dive into the rich history of electronic talking machines. Along the way I will tell you the story of the vocoder..., the first singing computer, and a little about the father of modern synthesized speech.

Transcript
Discussion (0)
Starting point is 00:00:00 I'm sure you all can remember your very first cell phone. For me, it was in middle school. I had this junk, no-name flip phone that I pulled out of an e-waste bin. My friends all had phones by this point, and I really wanted to fit in. Now, it didn't super matter that the phone couldn't, well, call anyone. It was more about keeping up appearances. Now, I can't really remember much about that phone, besides the fact that I carried it around and pretended like I owned it. But one thing I do recall is it had
Starting point is 00:00:31 this cool hands-free mode. This is where you could just leave it open and you could tell it simple commands like open contacts, and it would give you a robotic response out of a tiny phone speaker. give you a robotic response out of a tiny phone speaker. Now, this was thousands mobile technology, so it only worked some of the time. But when it did, I thought it was the coolest thing ever. You had a small virtual assistant in your pocket and it could talk to you. Fast forward to 2019 and, well, let's just give this a listen. Hey Google, tell me a joke.
Starting point is 00:01:09 Where does a general keep his armies? In his sleevies. There's no accounting for taste, but that does sound pretty good. Just in my lifetime, we've gone from half-working voice assistants on small flip phones to synthesized speech that could easily be mistaken for real-life human speech. So how did computerized speech get this good? And how did that all start? Welcome back to Adrin of Computing. I'm your host, Sean Haas. Today, we're going to take a look at something that I think is pretty fun, talking machines. There's a long and rich history of humans trying to record or otherwise reproduce the sound of our own voices, which, when you stop and think about it, that's actually pretty self-absorbed. Believe it or not, attempts at talking machines may go back as far
Starting point is 00:02:06 as 1000 AD. So instead of covering all of that, I'm going to start us off with the first electric machines that were able to reproduce what could be termed as human speech. That should save us at least a few hundred years. Now, when I first started looking at this topic, I imagined it would be fairly easy and clear to break into eras. The first examples would have to be some kind of analog hardware, then there'd be a shift to digital, and finally fully software solutions. But, you see, as I got into the research, I started to see that it wasn't that clear cut. More so, speech synthesis can be roughly broken down into two large chunks with a lot of wiggle room in between, at least as far as I can see it.
Starting point is 00:02:52 Early attempts focused on finding a way to replicate how a human physically makes sound. Now that would either be done by working to somehow encode human speech and replicate it, or reproducing an entire vocal tract, some attempts even went as far as having a tongue and teeth. Now, later, that approach would start to shift to creating a more holistic model of how human speech and language interplayed together. It should go without saying that, since I like starting things at the beginning, I'm going to start us off in the first era by looking at a machine that attempted to create human speech by encoding and then electrically replicating a voice pattern. Our story begins in 1937 with a machine called the
Starting point is 00:03:39 voter. Now, just judging by the year, it's easy to tell we're in the pre-digital age. I think this is a good place to start our discussion because it will show us the issues that will have to be overcome in the following decades. The voter was, as I said earlier, an analog machine, and it was able to roughly mimic human speech. It was designed and built by a team of researchers at Bell Labs, headed by one Homer Dudley. Just prior to the start of the voter project, Dudley had been issued a patent for a very similar device called the vocoder. These two machines work off a very similar principle, so I think it bears explaining both of them here. The vocoder is, at its base, a way of encoding human speech. It does so by measuring the intensity of sound at a set series of different frequencies called bands. The result
Starting point is 00:04:33 is a set of numbers that represent that sound at any one point. To play back the sound, the process is just done in reverse, taking the set of numbers for each sample and turning that back into an audio waveform. If you have enough samples, then you can easily reconstruct a human voice. The main impetus behind this process is that transmitting audio takes a lot of bandwidth. By breaking it down into some fundamental representation, it's easier to send over wires. fundamental representation, it's easier to send over wires. But the vocoder has a side effect of creating a bizarre and artificial sound since it strips out anything unique in the voice. So why would someone want to create a strange lo-fi rendition of their own voice? Well, as I mentioned, it does take less bandwidth to transfer the encoded voice and just decode it on the other side than
Starting point is 00:05:25 to send raw audio. This means that it takes less advanced infrastructure and you can send a lot of vocoded messages in place of just one raw recording. The other useful aspect of the vocoder is it can be easily encrypted, since you just have to encrypt the data stream that you're sending. This ended up being used extensively by the Allies in World War II to send secure voice messages. You just encode with the vocoder, encrypt on one side, and then transmit, and it appears to be raw noise to anyone who doesn't have the decryption key. Then, as you're waiting for the message, once you get it, you just decrypt it
Starting point is 00:06:05 and then send it back through your own vocoder and boom. Secure voice messaging. Now, that's the gist of how the vocoder works, but how does that technology fit into Voter? Well, it turns out that the Voter is basically just the decoding side of a vocoder, with a few modifications. Instead of dealing with the encoding of a voice, then decoding it back to audio, the voter takes input from a keyboard type apparatus that is then turned into audio. Each key combination acts like the numbers that a vocoder internally uses to encode speech, and then that's decoded into sound. By inputting the correct combination of keystrokes, the voter can mimic human voice, at least roughly. In 1939, the voter was demonstrated at the New York World's Fair. Let's take a quick listen. She saw me with no expression. She saw me.
Starting point is 00:07:07 Now say it in answer to these questions. Who saw you? She saw me. Whom did she see? She saw me. Did she see you or hear you? She saw me. So, as you can hear, it does sound like a human voice, but there's two big issues with the voter. Well,
Starting point is 00:07:34 three if you want to be a little picky. First, the clip isn't really that intelligible. You can tell what it's saying, but you really have to listen to it. Second, Voter is not an easy machine to use. In fact, it's very difficult. The machine was controlled by a 10-key keyboard, plus a wrist bar and a pedal. The 10 keys were used to essentially key in the same data that the vocoder would decode, a set of intensities for 10 discrete frequencies. The wrist bar was used to select either a hiss or a buzz sound, while the foot pedal allowed the pitch to be adjusted slightly. An operator had to use all of these controls in combination to make the machine talk, and anyone that wanted to use the voter had to be trained on how to operate it extensively. In some cases, like with the operator that we just heard in that clip, the training would take more than a year.
Starting point is 00:08:23 Really, the voter is more like a musical instrument than a practical electronic reimplementation of the human voice. The third issue is that the voter is totally analog. Now, I feel this can be forgiven since in the 30s, almost no one knew any better. But the main problem with an analog-based system is flexibility. By using a digital computer, later talking machines will have a lot more flexibility on how they generate speech and how they deal with the mathematics behind it, instead of being restricted to a single workflow. Overall, the Voter was really a tech demo. It was impressive for the time, but ultimately a non-starter.
Starting point is 00:09:04 However, the device would be revisited in later years and much improved upon. The next computer voice I want to talk about is something of a personal favorite. That's the first recording of a singing computer. This should sound like a strange headline from a 60s sci-fi magazine, because, in part, it is. Now, this part of our story takes place in 1961 at, once again, Bell Labs. Just as an aside, the last few episodes have been pretty Bell Labs heavy. I should probably get either a sound cue for Bell, or try to tone it down on them maybe a little bit. for Bell or try to tone it down on them maybe a little bit. So 1961 is the year that John L. Kelly and Max Matthews created the first singing computer program. The two of them used an IBM 704 mainframe
Starting point is 00:09:55 interfaced with a vocoder to sing Daisy Bell, complete with musical accompaniment. What makes this a milestone is the fact that both the voice and the backing track were produced by a digital computer. Let's give the song a listen. I'm mad crazy all for the love of you It won't be a stylish wedding I'd pass the court of marriage But you look sweet upon the feet of a boy who built for you Like the voter, this demo is roughly intelligible, but really not that much better. As I mentioned before, the mainframe was interfaced with a vocoder to generate the audio.
Starting point is 00:10:57 So under the hood, it's basically a computer-controlled voter instead of one controlled by human hands. computer-controlled voter instead of one controlled by human hands. While this approach does give the accuracy and power of a digital computer, it still restricts the system to having to work with the set of sounds that a vocoder can generate. Okay, so I know you're all dying for the answer. How is this connected to 60s sci-fi? Well, the astute of you will no doubt have noticed that the song sounds familiar. In the climax of 2001 A Space Odyssey, the malevolent HAL 9000 computer sings Daisy Bell as it's deactivated. Now, this isn't mere coincidence. In fact, sometime in the early 60s, Arthur C. Clarke, the author of 2001, wrote about visiting Bell Labs and seeing that very demo of the singing mainframe. This is where his idea for that scene came
Starting point is 00:11:51 from. The Daisy Bell tech demo was a big deal for a few reasons. Firstly, it showed that computers, a relatively new technology, could be used for speech synthesis. At the same time, it demonstrated that computers could also be used to generate music. While this is exciting, it's still a long way off from easily recognizable and easily usable text-to-speech systems. But in the coming years, this too would be improved upon. This brings us up to the next big advancement in speech synthesis. Up to this point, each system I've touched on has two key flaws.
Starting point is 00:12:28 Firstly, it kinda sucks. At least in the audio quality and intelligibility category. And secondly, it's not very general purpose. Both examples are either difficult to use or have to be specifically programmed for a single task. So obviously, the next big step forward programmed for a single task. So obviously, the next big step forward would be a computerized speech system that's both intelligible and able to be used for general speech synthesis by a novice. It can't require extensive training like
Starting point is 00:12:57 the voter, and it can't be tied to a mainframe and careful programming like the Daisy Bell demo. The breakthrough that would solve these problems required a shift in the way we look at speech synthesis. You see, earlier examples of talking machines focused on how a human voice sounds. That is to say, they were derived from encoded human speech, or at least something akin to that. In these systems, the process of speech was something of a black box. The words went in and a voice came out. But that started to change as flexible and powerful digital computers became more available. Starting in the late 60s, the older paradigm was turned on its head thanks to dedicated research and better hardware.
Starting point is 00:13:46 That's when the idea of rule-based synthesis started to take shape. In this new system, researchers investigated how language played a role in the creation of speech. This would eventually lead to more fully realized talking machines. So, how does rule-based synthesis work, and how is it so much better than earlier attempts at speech synthesis? Explaining how these types of algorithms work turns out to not be very simple, unsurprisingly. In preparing this episode, I've spent more time than I'd like to admit going over scientific papers trying to wrap my head around this process. Near as I can tell, it works by breaking down a sentence into phonemes, the smallest unit of sound that can be used to build up words. But that's just the start. The crux of the process
Starting point is 00:14:37 is finding a way to make the speech sound natural from this list of pronunciations. If you just pronounced everything phonetically, at least in English, you wouldn't sound very natural or intelligible. To make a computerized voice that was both normal sounding somewhat and able to be understood, rule-based synthesis, as the name suggests, applies a series of rules that change how the pronunciations work to sound more natural. In practice, there's a huge set of rules that are applied in the thousands. These are things such as starting a sentence at a lower pitch, stressing the proper syllables and words, or calculating the proper duration of each syllable. Those are just a few examples, but I can't stress enough that
Starting point is 00:15:23 there's a lot of rules that go into this kind of algorithm. During the 60s and 70s, there were many researchers in the field, and the development of rule synthesis was really a collective effort. But one of the most influential scientists during this time was named Dennis Klatt. Klatt was a speech and hearing researcher at MIT from 1965 to 1988. This put him right in the center of really the leading edge of advancement in speech synthesis. But more than that, Klatt had a vision that a lot of other researchers at the time didn't. He wanted to create a speech system that could give a voice to people who were unable to talk, whether that be due to disease,
Starting point is 00:16:05 an illness, or some other reason. I think this is an important part of the story because unlike earlier researchers, Klatt didn't just want to make a talking machine to see if he could or to drive sales. He had a well-defined reason. The system Klatt would create would go on to be known as Klatt Talk. Now, his goal informed his designs greatly. He set out to make a system that would be intelligible, clear, and usable with little or no training. To make this system accessible to novices, Clatt settled on a real-time text-to-speech style of interface, something that really hadn't been used on a large scale beforehand. This means that to use Clatt's system, you would just have to type in the words or enter some type of text,
Starting point is 00:16:51 and ClackTalk would do the rest for you. Another key feature was its configurability. Instead of just targeting a single voice, the sound of ClackTalk could be tweaked by adjusting variables like its bass frequency or pronunciation speed. One of the more important rules using Clattalk was an algorithm to govern the duration of each phenome. In other words, a model used to calculate how long each syllable should be pronounced for, creating a natural cadence to the computerized tones. Once again, since this was based on a model, it could be altered to allow for different sounding voices by just adjusting one or two parameters. You could totally change the cadence of the voice.
Starting point is 00:17:34 All of these rules and more were put together to make a new, more natural-sounding digital voice. Of course, this complex model didn't come out of nowhere. The developers, including Klatt himself, all contributed rules and methods to the system based on real-world research and feedback. By focusing on a model of human speech instead of just attempting to make human-like sounds, Klatt Talk became a more holistic approach to speech synthesis. Part of this development was actually testing the perceptibility of synthesized speech. This was done by having people outside of the field and outside of the
Starting point is 00:18:11 research group rate the intelligibility and clarity of both single words and whole sentences. By using this feedback, they were able to iterate through the development process. This let them work on the system, then test it and revise it, and then do more research. By the early 80s, Clattalk was really looking impressive, but it was still a long way off from helping anyone. Dennis had an algorithm and a program running on a mainframe, but that wasn't usable to anyone except for fellow researchers in his lab at MIT. However, in 1982, that would all change. This is the year the Digital Equipment Corporation, better known as DEC, got involved. DEC saw a demo of the CLAT-TOC algorithm and was impressed, and so in 1982, they entered into
Starting point is 00:19:00 TOCS with CLAT to license out his algorithm. The DEC implementation of this algorithm, released in 1984, would end up being called DECTalk. Now, DEC wanted to bring text-to-speech to the market for a few reasons. Partly, this was to help the speech and vision impaired. An advanced speech synthesis solution hadn't been widely available yet, so DEC could really get in on the ground floor while helping to fulfill Klatt's vision of helping the disabled. But beyond that, DEC also saw the opportunity to use text-to-speech for automation. Original ad campaigns show that DECtalk was largely marketed at businesses and institutions to automate things like announcements or phone services. Really, anything that had to do with repetitive speech could be farmed out to a DECtalk. So, how did DEC go about implementing Clattalk so that it could be used outside of the lab?
Starting point is 00:20:00 Well, it turns out that shrinking down a system as complex as this takes a lot of power. Due to this, DECTOC ended up being implemented as a hardware device rather than software. The device itself was a large box, about 18 inches by 12 inches, weighing in at just over 16 pounds. Under the hood of this hefty beast was a Motorola 68000 microprocessor, 256KB of RAM, and 64KB of RAM. The machine also included circuitry for touch tone and serial communication as well as an audio output for, well, the actual speech output. So we have a device with CPU, memory, and peripherals.
Starting point is 00:20:50 Really, DECTOC's starting to sound like a computer, and in fact, in a lot of respects, it was. The 68000 CPU was the same processor that would be used for the first Macintosh. I think it's pretty interesting that speech synthesis, at least in the 80s, took as much power as an entire desktop Mac. When it came to actually using DECtalk, the system was really quite intuitive. There is a serial connection on the DECtalk device itself. That allowed outside programs to send raw text straight into the machine. The DECtalk would just sit there waiting for text commands to come in, and as soon as it got text text it would transpose it into speech, either putting it straight out the audio jack or sending the spoken word over a phone line through the touchtone interface.
Starting point is 00:21:43 By default, DECtalk spoke in a man's voice, but you could change that to one of eight presets that range from women to children to elderly men. The parameters of each voice could also be changed on the fly, allowing you to create new voices or even let the deck talk rudimentarily sing. So after all this fuss about how it works, let's listen to some deck talking. As you can hear, this does sound better than earlier attempts. Part of the reason this voice is so much more understandable is due to the breakthrough in rule-based synthesis. The testing and iteration process used to hone Clatic also helps, but as you can hear, there are still rough edges.
Starting point is 00:22:15 So where was DECTOC actually used? One thing that's neat about the preservation of this kind of computer history is devices like DECTOC can now be fully emulated in their original hardware all through the internet. That clip you just heard was generated using Archive.org's MAME emulator. I'll post a link in the show notes so you can check it out for yourself. So as DECTOC said, where was DECTOC used? During its product lifespan, DECtalk actually saw a lot of use in a lot of really far-flung fields. Many government agencies used the device to automate announcements, just as DEC had predicted. During the 1990s and into the mid-thousands even, NOAA weather alerts were voiced by a DECtalk device.
Starting point is 00:23:02 More in line with Klatt's original vision, however, his new methods of speech synthesis were used to help give back a voice to those who had lost theirs. The most prominent person to benefit from Klatt's work was the late Stephen Hawking. The physicist liked the system so much that, according to legend, he even sent Klatt a thank you note. During the 80s and beyond, many people unable to speak regained their voice by using DECTOC and similar devices based on Klatt's work. Dennis Klatt continued to work on perfecting his speech synthesis until his early death in 1988 due to complications of thyroid cancer. But that's not the end of Klatt's story. of thyroid cancer. But that's not the end of Klatt's story. Remember how DECTOC had preset voices? Well, the default voice, called Perfect Paul in the documentation, was actually constructed
Starting point is 00:23:54 from recordings of Dennis Klatt's own voice. The voice that's now most recognizable as Hawking's own, is, in fact, Perfect Paul, also known as Dennis Klatt. Even as technology progressed past the 80s and Hawking was able to update his computer systems, he would never allow his voice to be changed because it became part of who he was. Klatt's voice had become his voice, and really, I can't think of a compliment that would make Dennis Klatt happier. Alright, I think it's time to wrap this episode up. There's a lot more to say about speech synthesis, but I hope this episode has served as a primer on some of the larger breakthroughs in the field, at least during the electronic era. So next time you hear a computer speak, remember that the technology that enables that voice
Starting point is 00:24:49 comes from a rich history of research. If you found this topic interesting, then I'd recommend looking up Klatt's Last Tapes. That's a BBC Radio 4 program that covers a wider range of the history of speech synthesis. Thanks for listening to Ad Advent of Computing. I'll be back in two weeks time with a new episode on a new topic. In the meantime, if you like the show, please take a second to share it with your friends. As always,
Starting point is 00:25:14 you can rate and review on iTunes. If you have any comments or suggestions for a future show, go ahead and shoot me a tweet. I'm at Advent of Comp on Twitter. And as always, have a great rest of your day.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.