The Economics of Everyday Things - 90. Closed Captions

Starting point is 00:00:00 Katie Ryan's home office in Pittsburgh, Pennsylvania is pretty run-of-the-mill. I just have a regular Ikea desk. I have a big TV up on the wall. I have a laptop stand with my laptop on it, and then I have a monitor stand that has two monitors on it. There's a blanket on the floor for my dog, you know. But the work she does at this desk is seen by millions of people every week. I've done the Super Bowl a handful of times. I've done the Olympics many times. I just did the Oscars a couple weeks ago. Any sporting event that you can think of, I've probably

Starting point is 00:00:42 done it. Any major news event that has happened, I have probably been involved in that somehow. Presidential funerals, presidential debates. I remember when the Boston Marathon bombing happened, the breaking news was just constant. I think I was on the air writing without a commercial break for something like three and a half hours. S1 C1 Ryan is a captioner. She writes the text transcripts that appear on your TV screen when you turn on closed

Starting point is 00:01:10 captioning. She does this in real time. Most people think their TV just does it. They don't realize that there's a person like me sitting in a room with headphones on. And people don't realize that it's happening live. Like if I'm writing a news broadcast or a sporting event. headphones on and people don't realize that it's happening live. Like if I'm writing a news broadcast or a sporting event, maybe I have like five seconds extra than you do when you're hearing it.

Starting point is 00:01:32 And I have to write it at the same time and try and keep up with all the speedy talkers that are out there. In some ways, it's a good business to be in. One survey found that 50% of Americans and 70% of Gen Z viewers say they watch content with captions on most of the time. But the industry is also rapidly changing. The nimble fingers of human captioners like Katie Ryan are up against the neural networks of artificial intelligence services. Technology is the key to the future of captioning,

Starting point is 00:02:16 but you know, you need people that are looking at the content. For the Freakonomics Radio Network, this is the economics of everyday things. I'm Zachary Kroett. Today, closed captions. The term captions is often used interchangeably with subtitles, but the two are different. Subtitles are used for translation. Captions are designed for people with hearing impairments, and they describe every auditory element—

Starting point is 00:02:44 dialogue, sound effects, music, and sometimes even background noises. The goal of captioning is to give the user the content of exactly what's being heard. That's Doug Karlovitz. He's a general manager at Verbit, the largest provider of captions in America. He says that if you're watching something on TV, either live or pre-recorded, you can

Starting point is 00:03:10 almost always turn on the captions in the device's settings. But that wasn't always an option. Really captions were born for television in 1970. The first pre-recorded show ever captioned was The French Chef with Julia Chow. The earliest efforts were called open captions, and they were limited to pre-recorded shows. The text was a permanent part of the video. Eventually, a new method called closed captions made it possible for viewers to turn the text on and off. And by the 1980s, thanks to the efforts of the nonprofit National Captioning Institute, captions could also be used for live television. Around this

Starting point is 00:03:57 time, Karlovitz's father Joe saw an opportunity to expand the captioning industry. My father was a court reporter, a stenographer, and he became very interested in computers and how to take his stenotype and get it translated through a computer into English. Stenographers are extremely fast typists. On stenotype machines, they can transcribe up to 300 words per minute. Joe began training fellow stenographers to do TV captioning, and in 1986, he founded a company called VITAC, which was later acquired by another company called Verbit.

Starting point is 00:04:40 We started out with a local television station in Pittsburgh and eventually grew into the largest provider in North America of captioning. Today, broadcasters, cable companies, and satellite services are required by federal laws to have captions available for nearly every televised program. This also carries over to much of the media on streaming services online, and most video content in public settings, like courtrooms, hospitals, schools, and sports bars. Captions have to be readable, accurate, and inclusive of all audio context. They have to clearly identify each speaker, and for live broadcasts, like news programs, they appear almost in real time. In the United States, everything that airs on television should have captions today.

Starting point is 00:05:33 Almost every show has captions on it. VITAC is one of three companies alongside IBM and Zoo Digital Group that control around 60% of the captioning market. Karlovitz says they caption around 500,000 hours of content a year. We work with all the major broadcasters, all the various producers of television programs. Work with all the different universities around the world, providing captions for the classroom. On the legal side, we're working with law firms and court reporting agencies. And on the government side,

Starting point is 00:06:13 we'll do anything from town halls to training on all the different things. We also work with sports venues, theaters. So everywhere where words are spoken, there's the opportunity to add captions. Much of today's captioning has shifted from human stenographers to automated tools. In some cases, the captioning service uses a technique called re-speaking.

Starting point is 00:06:39 A human employee watches a show in a recording booth and carefully recites every word into a special microphone. Voice-to-text software turns the narration into a written transcript. In other cases, particularly with pre-recorded TV shows, technology can be used to generate text from a script. But for live TV, like news broadcasts, Super Bowls, and presidential debates, a human captioner clacking away at a machine is still the most reliable option. A stenographer gets a live feed of a network's audio a few seconds before it goes to the general public.

Starting point is 00:07:17 They listen through a pair of headphones while typing out the words in shorthand on their stenotype machine. This shorthand goes through processing software on a computer that turns it into text. The text is embedded in a video signal that's transmitted to the television network through modems and IP connections. And when you press the closed captions button

Starting point is 00:07:38 on your remote, a microchip inside your TV retrieves and displays the captions on screen. It's a complex process, and networks might pay Verbit anywhere from $130 to $175 per hour for live human captioning services. So if you have a broadcast show that's in a 30-minute block, but it may be really only on the air for 24 minutes, they would pay for that on a 30-minute block, but it may be really only on the air for 24 minutes. They would pay for that on a per-minute basis. If you're doing a live show, you're paying basically for the times that are booked, because

Starting point is 00:08:13 you don't know how long those live shows can go. So who are these humans who create the captions on TV? And what's it like to be on the clock during a live broadcast? Sometimes you can't even get a drink of water. That's coming up. Katie Ryan didn't start out hoping to be a professional captioner. When I was graduating high school, I really didn't know what I wanted to do with my life. And my great aunt Sandy, her sister at the time,

Starting point is 00:08:48 was an official court reporter in Philadelphia. And Sandy said, well, you can type fast on a keyboard. Why don't you look into stenography? Ryan completed a court reporting program at a community college in Pittsburgh and joined VITAC, now Verbit, after graduating. She's been at the company as a captioner for more than two decades. In her work, Ryan uses a machine called a Stenotype. It has a small screen and around

Starting point is 00:09:16 20 unmarked keys that look kind of like popsicle sticks. She's able to type at speeds of up to 300 strokes per minute using a technique called cording. She presses down on multiple keys simultaneously to phonetically spell out whole syllables, words, and phrases with one motion. Stenography is essentially learning another language. It's combinations of keys to make It's combinations of keys to make words. And so on the machine, each key has a letter, and then there are combinations of keys that make more letters.

Starting point is 00:09:52 P, B would be N. The letter I would be E, U. The letter D would be T, K. And then there are combinations of keys that make words. So and would be A, P, B, D. Your hands are on different sides of the keyboard on the machine. Your left hand is prefixes, your right hand is suffixes.

Starting point is 00:10:16 And then you have your endings, I-N-G-S-E-D on your right side. Brian can spell out entire phrases with just a few keystrokes. A good example would be like ladies and gentlemen. That would be good for TV or court. On my machine, it would be L-A-I-R-J. So you hit all of those keys at once and ladies and gentlemen will come out in your computer

Starting point is 00:10:39 software. In one fell swoop. In one stroke, you get all of those words. Before she goes live, Ryan creates a dictionary full of customized briefs, abbreviations of specific words that she knows will reoccur throughout the broadcast. For the Academy Awards,

Starting point is 00:10:56 she'll program combinations of keystrokes for the title of each nominated movie. For a hockey game, she'll program every player's name. Instead of having to write out their name every single time that it's said, you hit that one combination of keys one time or twice and then that whole name will come out. Obviously, we have to search ahead of time to find out who like your play-by-play announcer is and who your color analyst is. But the process doesn't usually go without a hitch or two. Captioners are human, after all, and they make the occasional mistake.

Starting point is 00:11:31 While there's no federally mandated benchmark, the standard for accuracy in the industry is 99%, meaning one out of every 100 words might be misspelled or altogether butchered. Oftentimes, a captioner is aware of a typo. They just don't have the time to fix it during a high-speed live broadcast. We have the asterisk on my machine, which is the key in the very middle, that can erase a mistake.

Starting point is 00:11:58 But nine times out of ten, you are not going to catch it fast enough before it already goes out on the air. And then if you try and take it back, it's just gonna garble the captions up. So it's better to just, if you make a mistake, just ignore it and keep writing and move past it. And then the faster it moves off the screen, the faster people will forget about it.

Starting point is 00:12:16 Even after 21 years in the job, Ryan has a few recurring issues. I tend to drag my fingers. So sometimes I will catch extra letters when I'm trying to write certain words or I'll miss keys too. Like if my fingernails are too long sometimes, I can't quite hit the keys right. Sometimes you might notice the captions pause for a few moments or go blank. This is likely because the captioner fell off pace and is trying to catch up.

Starting point is 00:12:50 This happens most often with news shows where the banter can be lightning fast. Rachel Maddow, who hosts her own live show on MSNBC, has been clocked talking at up to 270 words per minute. A challenge for even the most seasoned captioner. If you need to just let a sentence go and then catch up again, that's okay. When you start paraphrasing though, then you take the risk of presenting the wrong information or turning it into something that they didn't actually say. And that's the last thing you want to do. You don't want to put words in anybody's mouth. The goal is to provide a text equivalent of as much of the audio as possible.

Starting point is 00:13:25 This can be particularly challenging when multiple people are speaking at once. A lot of times it'll just be, you know, a couple of words in a dash, and then the next person will be a couple of words in a dash. Sometimes there's nothing you can do. If they're just screaming at each other, there is nothing you can do, you know. Once they figure it out, then you can keep going again. Doug Karlovitz, the general manager at Verbit, says certain TV shows pose more problems than others.

Starting point is 00:13:52 Like The Osborns, a reality show from the early 2000s that followed the aging and often incomprehensible rock star Ozzy Osborn and his family. The debates around the office on what we thought he was saying on that show was good watercooler conversations. Well, first was, is he just putting this on? Eventually as that show got renewed, you realize, no, that's how Ozzy talks. It was really like, I think he said this. And then, you know, people would go and come over, listen to this. What do you think he said? And, you know,

Starting point is 00:14:30 you would just sit there and I don't know. I don't know what he said. I don't think he knows what he was saying. There are also elements that require interpretation, like how to caption a noise or a nonverbal vocalization. Some networks and studios are particular. Disney reportedly has specific rules about how R2-D2's mechanical noises should be captioned. Netflix is fond of using the phrase wet swelching to describe the sound of monsters in the show Stranger Things.

Starting point is 00:15:01 For background noises and live captioning, Ryan uses a list of templatized descriptions. We call them parentheticals, so like bells tolling or applause, singing, chanting, things like that. You want to try and be descriptive, but also you don't want to go overboard. All of this effort is to ensure that people who are deaf or hard of hearing have equal access to media, but captions have found a much broader audience. A 2022 survey by the language learning platform Preply found that half of all viewers now watch media with captions on most of the time. Some have speculated that's at least partly to do with modern sound mixing,

Starting point is 00:15:42 which alternates between loud sound effects and quiet dialogue. Game of Thrones, there was so much background noise occurring on that show that a lot of the people started using captions. But the most frequent users of captions are now younger people, particularly Gen Z. And that has more to do with changes in the media landscape. The younger viewers, they're watching it on their phones. They're watching it on their iPads.

Starting point is 00:16:11 They're not necessarily listening, but they're reading it as they're in class or they're at work and don't wanna call attention to themselves. Some publishers have estimated that up to 85% of the videos they post on Facebook are watched on mute. Many short-form videos on social media sites now have captions coded directly into the media file that can't be turned on or off.

Starting point is 00:16:36 That's because it's keeping that person who's looking, it's keeping their attention longer. Some platforms, like YouTube, offer their own tools to creators that use speech recognition to generate captions automatically. Karlovitz says artificial intelligence has already fundamentally changed the captioning business. Verbit offers automatic speech recognition

Starting point is 00:16:59 and generative AI tools that are trained with diverse language models to pick up on speech patterns. Karlovitz says these options cost much less than traditional transcription, but they still aren't as accurate or precise as a human captioner. And at least for now, many clients still prefer their captions to be generated by a human being, like Katie Ryan. Maybe a deaf person is in an area that there's tornadoes, and they turn on their local news.

Starting point is 00:17:32 We want those people to be able to have captioning that is as accurate and as clean as possible, so they know what to do and they can be safe. I will always advocate for a human captioner to be there to give the best service possible. When you watch TV, do you always use the captions? No. Never have captions on in my house. Really? Never, no. I sit in front of a computer and deal with that all day.

Starting point is 00:17:59 I don't need to worry about it. I'm off the clock. For the economics of everyday things, worry about it. I'm off the clock. For the economics of everyday things, I'm Zachary Krakat. This episode was produced by me and Sara Lilly and mixed by Jeremy Johnston. We had help from Daniel Moritz-Rapson and thanks to our listeners Owen Roberts and David Kennet for suggesting this topic. If you have an idea for an episode, feel free to email us at everydaythings at Freakonomics.com.

Starting point is 00:18:31 Our inbox is always open. All right, until next week. What if you're in the middle of like a live broadcast and you just really have to pee? Now from my office to my bathroom is like ten steps, so I can make it. The Freakonomics Radio Network. The hidden side of everything. Stitcher.

Your Ad Here

The Economics of Everyday Things - 90. Closed Captions

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.