Hidden Brain - Ep. 70: Who We Are At 2 A.M.

Episode Date: May 2, 2017

Have you ever googled something that you would never dream of saying out loud to another human being? Many of us turn to Google when we have a deeply personal or embarrassing question. And we're often... more honest when we type our questions into search engines than when we answer surveys or talk to friends. Seth Stephens-Davidowitz, a former data scientist at Google, says our online searches provide unprecedented insight into what we truly think, want, and do. This week on Hidden Brain, what big data knows about our deepest thoughts and secrets.

Transcript
Discussion (0)
Starting point is 00:00:00 And not before we get started, this episode includes a racial epithet and discussions about pornography. If you have small kids with you, please save this for later. This is Hidden Brain, I'm Shankar Vidhathan. We start today's show with a personal question. Have you ever Googled something that you would never dream of saying out loud to another human being? When we have a question about something embarrassing or deeply personal, many of us today don't turn to a parent or to a friend, but to our computers.
Starting point is 00:00:32 Because there's just some things you just can't ask a real person in real life, and you need to ask Google. Because it's completely anonymous, and there are no judgements attached. Google knows everything. I agree agree that. Every time we type into a search box,
Starting point is 00:00:47 we reveal something about ourselves. As millions of us look for answers to questions or things to buy or places to meet friends, our searches produce a map of our collective hopes, fears, and desires. You do learn a lot about people that's very, very different from what they say and kind of the weirdness at the heart of the human psyche That doesn't really reveal itself in everyday life or lunch tables, but does reveal itself at 2 a.m. on porn hub
Starting point is 00:01:17 Today on Hidden Brain what big data knows about our deepest thoughts and secrets about our deepest thoughts and secrets. My guest today is Seth Stevens-Dividewitz. He used to be a data scientist at Google and he's the author of the book, Everybody Lies, Big Data, New Data, and what the internet can tell us about who we really are. Seth, welcome to Hidden Brain. Oh, thanks so much for having me, Shankar.
Starting point is 00:01:44 So Seth, we all know that Google handles billions of searches every day, but one of the insights you've had is that the reason Google knows a lot about us is not just because of the volume of search terms, but because people turn to Google as they might turn to a friend or a confidant. That's exactly right. I think there's something very comforting about that little white box that people feel very comfortable telling things that they may not tell anybody else about their sexual interests, their health problems, their insecurities, and using this anonymous aggregate data. We can learn a lot more about people is through these very strange correlations, you find, for example, there's a relationship between the unemployment rate and the kinds of searches people make online. Yeah, I was looking at what searches correlate most with the unemployment rate,
Starting point is 00:02:35 and I was expecting something like new jobs or unemployment benefits, but during the time period I looked at the single search that was most highly correlated with the unemployment rate was slut load, which is a pornography site. And you can imagine that if a lot of people are out of work, they have nothing to do during the day. They may be more likely to look at porn sites. Another search that was high on the list was solitaire. So again, when people are out of work, they're bored. They do leisure activities, and potentially this measure of how
Starting point is 00:03:05 much leisure there is on the internet may help us know how many people are out of work on a given day. Of course, this helps us reconsider what we think of as data. When we think about the unemployment rate, as you say, our normal approach is to say how many people are selling jobs, let's track down all the jobs. This is coming at the question entirely differently. Yeah, I think the traditional way to collect data was to send a survey out to people and have them answer questions, checkboxes.
Starting point is 00:03:32 There are lots of problems with this approach. Many people don't answer surveys and many people lie to surveys. So the new era of data is kind of looking through all the clues that we leave. Many of them, not as part of questions or as part of surveys, but just clues we leave as we go through our lives. One of the important differences between mining this kind of data and the responses we get on surveys has to do with how people report their sexual orientation.
Starting point is 00:04:00 I understand that the kind of queries that you see on Google might reveal something quite different than if you ask people if they're gay That's right if you ask people in surveys today in the United States about two and a half or three percent of men say that they're primarily attracted to men and This number is far higher in certain states where tolerance to homosexuality is greater So there are a lot more gay men according to surveys in California than in Mississippi. But if you look at search data for gay male pornography, it's a tiny bit higher in California, but not that much higher.
Starting point is 00:04:39 And overall about 5% of male pornography search is for gay porn. So almost twice as high as the numbers you get in surveys. Your research has important implications for a topic that we've looked at a lot on hidden brain, the topic of implicit bias. People aren't always aware of the biases they hold, and so scientists have had to find clever ways to unearth these biases. You think that Google searchers can reveal some forms of implicit bias? That's right. So, one I look at is the questions that parents have about their children. If you ask many parents today, they would say that they treat their sons and daughters
Starting point is 00:05:17 equally, that they're equally excited about their intellectual potential, equally concerned about maybe their weight problems. But if you aggregate everybody's Google searches, you see large differences in gender that when parents in the United States ask questions starting, is my son, they're much more likely to use words such as gifted or a genius than they would in a search starting is my daughter. When parents in the United States search is my daughter, they're much more likely to complete it with is my daughter overweight or is my daughter ugly. So parents are much more excited about the intellectual potential of their sons and much more concerned about the physical appearance of their daughters.
Starting point is 00:06:00 You report that in some states after Barack Obama was elected president, there were more Google searches for a certain racist term than searches for first black president. I think there is a disturbing element to some of this search data where in the United States today, many people, and maybe this is a good thing, don't feel comfortable sharing that they have racist thoughts or racist feelings, but on Google, they do make these searches in strikingly high frequency. I need to use sorted language to this. The measure is the percent of Google searches that include the word nigger.
Starting point is 00:06:33 And these searches are predominantly searches, looking for jokes, mocking African-Americans. I should clarify, this is not searches for rap lyrics, which tend to use the word nigger, the ending in A. But if you look at the racist search volumes, I think if you had asked me, based on everything I had read about racism in the United States, I would have thought that racism in the United States predominantly concentrated in the South, that really the big divide of the United States
Starting point is 00:06:59 when it comes to racism is South versus North. But the Google data reveals that's not really the case that racism is actually very, very high in many places in the north, places like Western Pennsylvania, Eastern Ohio, or industrial Michigan, or rural Illinois, or upstate New York. The real divide these days when it comes to racism is not north versus south, it's east versus west. There's much higher racism, East of the Mississippi, than West of the Mississippi. So besides just saying, you know, we know that there are these patterns of racist searches
Starting point is 00:07:33 in different parts of the country, you're actually saying you can do more than that. You can actually predict how different parts of the country might vote in a presidential election based on the kind of Google searches you see in different parts of the country. Yeah, well, the first thing I found is that there was a large correlation between racist search volume and parts of the country where Obama did worse than other democratic candidates had done. So Barack Obama was the first major party general election nominee who was African-American, and you see a clear relationship that Obama lost large numbers of votes in parts of the
Starting point is 00:08:11 country where there are high racist search volumes and other researchers have found such as Nate Silver at 538 and Nate Cohn at the New York Times that there was a large correlation between racist search volumes and support for Donald Trump and the Republican Party that parts of the country that made racist searches in high numbers were much more likely to support Donald Trump. And this relationship was much stronger than really any other variable that they tested. I'm wondering how you try and understand that kind of information. It's hard not to listen to what you're saying and draw sort of what seems to be a superficial conclusion, which is that racist people vote for Donald Trump. I'm not sure, is that what you're saying?
Starting point is 00:08:51 That's one of those things where it sounds so offensive to say it that I think everyone tipped toes around the line. I will say that the data does show strong correlation between racist searches and support for Donald Trump that is hard to explain with any other explanation. You know, it's, yeah, I mean, yeah, that kind of is what I'm saying. I'm not saying that everybody who supported Donald Trump is racist by anti-stretch imagination.
Starting point is 00:09:17 There are plenty of people who support Donald Trump without this racist tendency, but a significant fraction of his supporters, I think were motivated by racial animus. You spend a lot of time in the book talking about sex. It turns out to be an area where marketers and companies know that what we say about ourselves is nowhere close to the truth. Most people report being not interested in pornography, but the website Pornhab reports
Starting point is 00:09:42 that in 2015 alone, viewers watched two and a half billion hours of porn, which is apparently longer than the entire amount of time that humans have been on Earth. What is the say about us, the fact that we either have very little insight about ourselves or we're actually lying through our teeth? Yeah, I'd say we're probably lying through our teeth. Yeah, I'd say that, I do talk a lot about sex in this book.
Starting point is 00:10:06 One thing I like to say is that big data is so powerful, it turned me into a sex expert because it wasn't a natural area of expertise for me, but I do talk a lot about sexuality. And I think you do learn a lot about people that's very, very different from what they say and kind of the weirdness at the heart of the human psyche that doesn't really reveal itself in everyday life or at lunch tables, but does reveal itself at 2am on PornHub. One of the things that I was wondering about as I read your book was how much search terms tell us about what people are actually thinking or actually feeling and how much they might just tell us about things that people are actually thinking or actually feeling, and how much they might just tell us about things
Starting point is 00:10:45 that people are curious about. So certainly people search for a lot of things related to sex that would indicate that there is a large amount of interest in, say, domesticism and fetishes and so forth, but could some of it just be that people are curious? People hear a lot about this in the news or on social media and they Google something because something because they just curious about it not necessarily because they themselves
Starting point is 00:11:08 want to you know be part of the BDSM culture. I think it depends on the particular question you're looking at. So the reasons we can trust the racism data is meaningful is because it correlates with voting patterns. With the sex data, there's not really necessarily something to check it against. On the internet, we do see the videos that people watch, and I think that is pretty telling about some people's fantasies, even if it's not definitive, because some people may just be curious. Pornography sites are the only ones gathering information
Starting point is 00:11:43 about our sexual and romantic preferences. We now have apps like Tinder and sites like OkCupid that gather tons of data about us. As a result, these apps and sites know a lot about our romantic preferences. But for a long time, we've had a human version of big data for romance. Grandma. Seth has some personal experience with this big data source. A couple of years ago, he was having Thanksgiving dinner with his family. He was 33, didn't have a date with him,
Starting point is 00:12:10 and his family was trying to figure out the qualities Seth needed in a romantic partner. My family was going back and forth. My sister was saying that I need a crazy girl because I'm crazy. My brother was saying that my sister was crazy, that I need a normal girl to balance me out and my mom was screaming at my brother and sister that I'm not crazy and my dad was then screaming at my mom that of course Seth is crazy. So it's kind of a classic Steven's Davidowitz family Thanksgiving
Starting point is 00:12:37 where everyone's just yelling at each other for being crazy and we're not really getting any progress in learning about what I need in my love life. And then my soft spoken 88 year old grandma started to speak and everyone went quiet. And she explained to me that I need a nice girl, not too pretty, very smart, good with people, social see you will do things, sense of humor because you have a good sense of humor. And I describe why I was her advice so much better than everybody else's. I think one of the reasons that she's big data, right? So, grandmas and grandpas throughout history have had access to more data points than anybody else.
Starting point is 00:13:13 And they've been able to correlate larger patterns than anybody else has because they've been around longer. And that's why they've been such an important source of wisdom historically. The problem, of course, as you also point out, is that it's very hard to disentangle your personal experiences from what actually happens in the world, and in your grandmother's case, she actually had a very specific piece of relationship advice about the kind of person you should want, and some of that might not actually be backed up by the empirical evidence. Yeah, well, my grandma has told me on multiple occasions that it's important to have a common set of friends and a partner.
Starting point is 00:13:51 So she lived in a small apartment in Queens, New York, with my grandfather, and every evening they'd go outside and gossip with their neighbors. And she thought that was a big part in why their relationship worked. But actually recently, computer scientists have analyzed data from Facebook, and they can actually look when people are in relationships and when they're out of relationships and try to predict what factors or relationship make it more likely to last.
Starting point is 00:14:15 One of the things they tested was having a common group of friends. Some partners on Facebook share pretty much the same friend group, and some people have totally isolated friend groups. And they found, contrary to my grandmother's advice, that having a separate social circle is actually a positive predictor of a relationship lasting. And so of course, the risk of trusting the individual is that the individual's intuition about what work for his or her life might not work for everyone else.
Starting point is 00:14:41 That's right. I think we tend to get biased by our own situation. Data scientists have a phrase called waiting data. Some data points get extra weight in our models. And our intuition gives too much weight to our own experience. And we tend to assume that what worked for us will work for others as well. And that's frequently not the case.
Starting point is 00:15:03 Many companies know that we don't really understand ourselves. When we come back, we look at how companies are using big data to predict what we're going to do before we know it ourselves. We'll also ask, if sites like Google can use data to forecast whether you're going to get a serious illness, should they give you that information? Stay with us. This is Hidden Brain, I'm Shankar Vedanthan. Netflix used to ask users what kind of movies they wanted to watch.
Starting point is 00:15:36 Seth Stevens' Davidowitz says eventually, the company realized that asking this kind of question was a complete waste of time. Yeah, initially Netflix would ask people what they want to view in the future. So they could queue up the movies that they said. And if you ask people, what are you going to want to watch tomorrow or this weekend? People are very aspirational. They want to watch documentaries or about World War II or avant-garde French films. But then when Saturday or Sunday comes around,
Starting point is 00:16:06 they wanna watch the same low-brow comedies that they've always watched. So, Netflix realized they had to just ignore what people told them and use their algorithms to figure out what they'd actually wanna watch. So, one of the things that's intriguing about what you just said, is it's, I don't think it's actually the case
Starting point is 00:16:22 that people were lying to Netflix when they said they wanted to watch the avant-garde film. They actually genuinely probably aspire to do that. It might actually be that big data understands people better that they understand themselves. Yeah, probably even more common than lying to other people is lying to ourselves. Particularly when we're trying to predict what we're going to do in two or three days, we tend to assume that we're going to go to the gym more than we go to the gym or eat better than we actually will eat or watch more intellectual stuff than we actually will watch. So the algorithms can correct for this over optimism that we all tend to share.
Starting point is 00:17:01 When you look at a company like Facebook, which has access to these huge amounts of data about us and what we like and whom we like in our relationships, you have to wonder how the company is using this data in all kinds of different ways. I remember Facebook got into some hot water a couple of years ago because they ran an experiment that seemed to be manipulating how people feel. Of course, there was a huge outcry about the experiment at the time. And since then, there hasn't been very much reported about what Facebook is doing. But I suspect that it might just be because Facebook is no longer telling us what it's
Starting point is 00:17:34 doing, but it's still doing it anyway. Every major tech company now runs lots and lots of what are called A, B, tests, which are little experiments where you put people into two different groups, a treatment and control group, and you show one group, one version of your site, and the other group, another version of the site, and you see which version gets the most clicks or the most views. This is really exploded in the tech industry. It's not just the tech industry that uses AB testing.
Starting point is 00:18:07 Newspapers do too. Newspapers lack the Boston Globe. A few years ago, the Globe tried out two different headlines for the same story, and then measured which headline got the most clicks. The newspaper then used the more effective headline for the rest of the day. I've been a journalist for about 25 years and spent most of that time working at newspapers. Seth wanted to test my headline writing expertise. He read out two versions of a headline for a Boston Globe story and he asked me to guess which
Starting point is 00:18:35 one worked better. So let's see, Shunker, if you can guess some of these winners. The first headline test, I'll give you headline A first and then headline B second, headline A, when the first subway opened in Boston, headline B, cartoons from when the first subway opened in Boston. All right, that's gonna be easy. Car tunes from when the first subway opened in Boston. No, it's headline A, got 33% more clicks for when the first subway opened at Boston. No, it's headline A, got 33% more clicks
Starting point is 00:19:05 for when the first subway opened at Boston. Oh, no. You want another one? Yeah, let's try it. I know where this is going, but let's try it. OK, headline A is woman makes bank off-rare baseball card. And headline B is woman makes $179,000 off rare baseball card.
Starting point is 00:19:28 I'm gonna go with the specific dollar amount, so B. No, it's headline A, 38% more clicks for headline A. You're over two. Is there a third one? Can I review myself? Yeah, okay, let's do another one. All right, headline A, hook up contest at heart of St. Paul Rape trial, headline B, no charges in prep school sex scandal.
Starting point is 00:19:54 All right, so I'm going to follow a completely different strategy than I did the last two times, which is I'm just going to pick a number and I picked the number before you even read it, read the headlines out to prevent myself from being biased. And I'm going to go with B again, just on the off chance that you couldn't tell me three the number before you even read it, read the headlines out to prevent myself from being biased. I'm going to go with B again, just on the off chance that you couldn't tell me three answers where all the answers were A. That's right. I didn't even realize I was doing that, but headline B is correct.
Starting point is 00:20:14 A hundred eight percent more clicks for headline B. So a good job. You got it one for three, not so bad. And the interesting thing, of course, is I use sort of an algorithmic solution. Yeah, you have to. You have to. I guess, right? of course, is I use sort of an algorithmic solution. Yeah, you have to. And I'm not too sure. Yeah, so I think what this shows is that the reason that A.B. testing is so important is because our intuition can trick us that you've been around journalism for many, many
Starting point is 00:20:37 years, and you have your own ideas of what makes a successful headline, but even someone like you is frequently wrong, And we can use AB testing to correct our faulty intuition, find what actually works, now what we think works. It's one thing when companies use big data to serve us better. You could argue that a newspaper that delivers to catch your headline is serving its audience better. But there are many, many instances where companies are now using big data against us. Banks and other financial institutions are using clues from big data to decide who should get a loan. I think it's an area of a big concern. So I talk about a study in the book where they started up here to peer lending a site,
Starting point is 00:21:21 and they started the text that people used in their requests for loans, and you can figure out just from what people say in their loans how likely they are to pay back. And there are some strange correlations. For example, if you mention the word God, your 2.2 times less likely to pay back, 2.2 times more likely to default. And this does get eerie.
Starting point is 00:21:40 Are you really supposed to be penalized if you mentioned God in a loan application? That would seem to be really wrong, even evil, to penalize somebody for a religious preference. Basically, everything's correlated with everything. Just about anything anybody does is going to have some predictive power for other things they do. The legal system is really not set up for a world in which companies potentially can mine correlations
Starting point is 00:22:09 over just about everything anybody does in their life. I was thinking about an ethical issue. I'm not sure if necessarily this is a legal issue, but you mentioned in the book that, if someone is Googling, I've been diagnosed with pancreatic cancer, what should I do? It's reasonable to assume that this person has been diagnosed with pancreatic cancer. What should I do? It's reasonable to assume that this person has been diagnosed with pancreatic cancer. But if you collect all the people who are Googling what to do about that diagnosis with pancreatic
Starting point is 00:22:32 cancer and then work backwards to see what they've been searching for in the weeks and months prior to their diagnosis, you can discover some pretty amazing things. Yeah, this is a study that researchers used Microsoft Bing data. They looked at people who searched for just diagnosis of pancreatic cancer, and then similar people who never made such a search. And then they looked at all the health symptoms they had made in the lead-up to either a diagnosis or no diagnosis. And they found that there were very, very clear patterns of symptoms that were far more likely
Starting point is 00:23:03 to suggest a future diagnosis of pancreatic cancer. For example, they found that searching for indigestion and then abdominal pain was evidence of pancreatic cancer while searching for just indigestion without abdominal pain meant a person was much more unlikely to have pancreatic cancer. And that's a really, really subtle pattern in symptoms, right? Like a time series of one symptom followed by another symptom is a evidence of a potential disease. It really
Starting point is 00:23:34 shows, I think the power of this data where you can really tease out very subtle patterns in symptoms and figure out which ones are potentially threatening and which ones are benign. So here's the ethical question. Once you establish that there is this correlation that you sort of say I have a universe of search terms seem to be correlated with people who go on to have the diagnosis versus these search terms that do not go on to predict a diagnosis. So does a company like Microsoft now have an obligation to tell people who are googling for these combinations of search terms? Look, you might actually need to get checked out. You might actually need to go see a doctor because of course, if you can be diagnosed with pancreatic cancer, you know,
Starting point is 00:24:28 four weeks earlier, you have a much better chance of survival than if you have to wait for a month. I lean in the direction of yes, some people would not lean that direction. It could be a little creepy. If Google right below the button, I feel lucky, you know, I have, you may have pancreatic cancer. It's not exactly the most friendly thing to see on a website. But personally, if I had some sort of symptom pattern that suggested I may have a disease and there was a chance of curing it if I was told, I'd want to know that. It's just another example that really the ethical and legal framework that we've set up is not necessarily prepared for big data.
Starting point is 00:25:09 Seth Stevens-Dividowicz is a former data scientist at Google and the author of the book, Everybody Lies, Big Data, New Data, and what the internet can tell us about who we really are. Seth, thank you for joining me today on Hidden Brain. Thanks so much for having me, Shankar. This week's episode was produced by Raina Cohen and edited by Tara Boyle. Our staff includes Jenny Schmidt, Maggie Pennman, and Renee Clarre. Our unsung hero this week is Hugo Rojo. Hugo walks on NPR's Media Relations team and he's one of those people who's always willing to be helpful. Hugo helps us with social media for the show.
Starting point is 00:25:45 He's also our in-house professional photographer. When we need a producer to record a line of narration and Spanish, Hugo puts up his hand. He's had some terrific ideas on how to reach new listeners, and he's always willing to share those ideas with us. Thanks, Hugo. Speaking of reaching listeners, we're hoping to get a better sense of how you found our
Starting point is 00:26:05 show. We've put together a quick survey, your feedback can help us find more listeners for Hidden Brain. You can find the survey at n.pr-hiddenbrainsurvey. That's n.pr-hiddenbrainsurvey. And thanks. I'm Shankar Vedantum, and this is NPR.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.