TED Talks Daily - Why AI needs a "nutrition label" | Kasia Chmielinski

Starting point is 00:00:00 TED Audio Collective Data reformist Kasia Chmielinski helps us think about artificial intelligence with this useful food metaphor and breaks down AI's ingredients to show how they can make us sick sometimes and how we can make these algorithms healthier after a break. Support for this show comes from Airbnb. If you know me, you know I love staying in Airbnbs when I travel. They make my family feel most at home when we're away from home. As we settled down at our Airbnb during a recent vacation to Palm Springs, I pictured my own home sitting empty. Wouldn't it be smart and better put to use welcoming a family like mine by hosting it on Airbnb? It feels like the practical thing to do, and with the extra income, I could save up for renovations to make the space even more inviting for ourselves and for future guests. Your home

Starting point is 00:01:10 might be worth more than you think. Find out how much at Airbnb.ca slash host. AI keeping you up at night? Wondering what it means for your business? Don't miss the latest season of Disruptors, the podcast that takes a closer look at the innovations reshaping our economy. Join RBC's John Stackhouse and Sonia Sinek from Creative Destruction Lab as they ask bold questions like, why is Canada lagging in AI adoption and how to catch up? Don't get left behind. Listen to Disruptors, the innovation era, and stay ahead of the game in this fast-changing world. Follow Disruptors on Apple Podcasts, Spotify, or your favorite podcast platform.

Starting point is 00:01:57 I want to tell you about a podcast I love called Search Engine, hosted by PJ Vogt. Each week, he and his team answer these perfect questions, the kind of questions that when you ask them at a dinner party, completely derail conversation. Questions about business, tech, and society, like is everyone pretending to understand inflation? Why don't we have flying cars yet? And what does it feel like to believe in God? If you find this world bewildering, but also sometimes enjoy being bewildered by it, check out Search Engine with PJ Vogt. Available now wherever you get your podcasts.

Starting point is 00:02:31 And now, our TED Talk of the day. Now, I haven't met most of you, or really any of you, but I feel a really good vibe in the room. And so I think I'd like to treat you all to a meal. What do you think? Yes. Yes? Great. So many new friends. So we're going to go to this cafe. They serve sandwiches. And the sandwiches are really delicious,

Starting point is 00:02:52 but I have to tell you that sometimes they make people really, really sick. And we don't know why. Because the cafe won't tell us how to make the sandwich. They won't tell us about the ingredients. And then the authorities have no way to fix the problem. But the offer still stands, so who wants to get a sandwich? Some brave souls we can talk after. But for the rest of you, I understand you don't have enough information to make good choices about your safety or even fix the issue. Now, before I further the anxiety here, I'm not actually trying to make you sick,

Starting point is 00:03:29 but this is an analogy to how we're currently making algorithmic systems, also known as artificial intelligence, or AI. Now, for those who haven't thought about the relationship between AI and sandwiches, don't worry about it. I'm here for you. I'm going to explain. You see, AI systems, they provide benefit to society. They feed us. But they're also inconsistently making us sick. We don't have access to the ingredients that go into the AI,

Starting point is 00:03:55 and so we can't actually address the issues. We also can't stop eating AI like we can just stop eating a shady sandwich, because it's everywhere. We often don't even know that we're encountering a system that's algorithmically based. So today, I'm going to tell you about some of the AI trends that I see. I'm going to draw on my experience building these systems

Starting point is 00:04:15 over the last two decades to tell you about the tools that I and others have built to look into these AI ingredients. And finally, I'm going to leave you with three principles that I think will give us a healthier relationship to the companies that build artificial intelligence. I'm going to start with the question, how did we get here? AI is not new. We have been living alongside AI for two decades.

Starting point is 00:04:40 Every time that you apply for something online, you open a bank account, or you go through passport control, you're encountering an algorithmic system. We've also been living with the negative repercussions of AI for 20 years. And this is how it makes us sick. These systems get deployed on broad populations, and then certain subsets end up getting negatively disparately impacted,

Starting point is 00:05:03 usually on the basis of race or gender or other characteristics. We need to be able to understand the ingredients to these systems so that we can address the issues. So what are the ingredients to an AI system? Well, data fuels the AI. The AI is going to look like the data that you gave it. So, for example, if I want to make a risk assessment system for diabetes, my training data set might be adults in a certain region.

Starting point is 00:05:33 And so I'll build that system, it'll work really well for those adults in that region. But it does not work for adults in other regions or maybe at all for children. So you can imagine if we deploy this for all those populations, there are going to be a lot of people who are harmed. We need to be able to understand the quality of the data before we use it. But I'm sorry to tell you that we currently live in what I call the wild west of data.

Starting point is 00:05:57 It's really hard to assess quality of data before you use it. There are no global standards for data quality assessment. There are very few data regulations around how you can use data and what types of data you can use. This is kind of like in the food safety realm. If we couldn't understand where the ingredients were sourced, we also had no idea whether they were safe for us to consume. We also tend to stitch data together.

Starting point is 00:06:23 And every time we stitch this data together, which we might find on the internet, scrape, we might generate it, we could source it, we lose information about the quality of the data. And the folks who are building the models are not the ones that found the data. So there's further information that's lost. Now, I've been asking myself a lot of questions about

Starting point is 00:06:43 how can we understand the data quality before we use it? And this emerges from two decades of building these kinds of systems. And the way I was trained to build systems is similar to how people do it today. You build for the middle of the distribution. That's your normal user. So for me, a lot of my training data sets would include information about people from the Western world who speak English,

Starting point is 00:07:06 of certain normative characteristics. And it took me an embarrassingly long amount of time to realize that I was not my own user. So I identify as nonbinary, as mixed race, I wear a hearing aid, and I just wasn't represented in the datasets that I was using. And so I was building systems that literally didn't work for me. For example, I once built a system that repeatedly told me that I was a white, Eastern European lady.

Starting point is 00:07:33 This did a real number on my identity. But perhaps even more worrying, this was a system to be deployed in health care, where your background can determine things like risk scores for diseases. And so I started to wonder, can I build tools and work with others to do this so that I can look inside of a dataset before I use it? In 2018, I was part of a fellowship at Harvard at MIT,

Starting point is 00:07:59 and I, with some colleagues, decided to try to address this problem. And so we launched the Data Nutrition Project, which is a research group and also a nonprofit that builds nutrition labels for data sets. So similar to food nutrition labels, the idea here is that you can look inside of a data set before you use it. You can understand the ingredients, see whether it's healthy for the things that you want to do.

Starting point is 00:08:24 And we launched this with two audiences in mind. The first audience are folks who are building AI, so they're choosing data sets. We want to help them make a better choice. The second audience are folks who are building data sets. And it turns out that when you tell someone they have to put a label on something, they think about the ingredients beforehand.

Starting point is 00:08:44 The analogy here might be, if I want to make a sandwich and say that it's gluten-free, I have to think about all the components as I make the sandwich, the bread and the ingredients, the sauces. I can't just put it on a sandwich and put it in front of you and tell you it's gluten-free. Now, we're really proud of the work that we've done. We launched this as a design and then a prototype

Starting point is 00:09:03 and ultimately a tool for others to make their own labels. And we've worked with experts at places like Microsoft Research, the United Nations, and professors globally to integrate the label and the methodology into their workflows and into their curricula. But we know it only goes so far. And that's because it's actually really hard to get a label on every single data set. And this comes down to the question of why would you put a label on a data set to begin with? Well, the first reason is not rocket science. It's that you have to. And this is, quite frankly, why food nutrition labels exist. It's because if they didn't put them on the boxes, it would be

Starting point is 00:09:41 illegal. However, we don't really have AI regulation. We don't have much regulation around the use of data. Now, there is some on the horizon. For example, the EU AI Act just passed this week. And although there are no requirements around making the training data available, they do have provisions for creating transparency labeling, like the data set nutrition label, data sheets, data statements. There are many in the space. We think this is a really good first step. The second reason that you might have a label on a data set is because it is a best practice or a cultural norm. The example here might be how we're starting to see more and more food packaging and menus at restaurants

Starting point is 00:10:25 include information about whether there's gluten. This is not required by law, although if you do say it, it better be true. And the reason that people are adding this to their menus and their food packaging is because there's an increased awareness of the sensitivity and kind of the seriousness of that kind of an allergy or condition. So we're also seeing some movement in this area.

Starting point is 00:10:48 Folks who are building data sets are starting to put nutrition labels, data sheets on their data sets, and people who are using data are starting to request the information. This is really heartening. So you might say, Kasia, why are you up here? Everything seems to be going well. It seems to be getting better. In some ways it is, but I'm also here to tell you that our It seems to be getting better. In some ways, it is.

Starting point is 00:11:08 But I'm also here to tell you that our relationship to data is getting worse. Now, the last few years have seen a supercharged interest in gathering data sets. Companies are scraping the web. They're transcribing millions of hours of YouTube videos into text. By some estimates, they'll run out of information on the internet by 2026. They're even considering buying publishing houses so they can get access to printed text in books. So why are they gathering this information?

Starting point is 00:11:36 Well, they need more and more information to train a new technique called generative AI. And I want to tell you about the size of these datasets. If you look at GPT-3, which is a model that launched in 2020, the training data set included 300 billion words or parts of words. Now, for context, the English language contains less than a million words. And just three years later, DBRX was launched, which was trained on 8 trillion words. So 300 billion to 8 trillion in three years.

Starting point is 00:12:09 And the data sets are getting bigger. And with each successive model launch, the data sets are actually less and less transparent. And even when we have access to the information, it's so big, it's so hard to look inside without any kind of transparency tooling. And the generative AI itself is also causing some worries. You've probably encountered this technique through chat GPT. I don't need to know what you do on the internet. That's between you and the internet. But you probably know, just like I do, how easy it is to create information using chat GPT and other generative

Starting point is 00:12:41 AI technologies and to put that out onto the web. And so we're looking at a situation in which we're going to encounter lots of information that's algorithmically generated, but we won't know it, and we won't know whether it's true. And this increases the scale of the potential risks and harms from AI. Not only that, I'm sorry, but the models themselves are getting controlled by a smaller and smaller number of private actors in US tech firms.

Starting point is 00:13:06 So if we go back to our cafe analogy, this is like you have a small number of private actors who own all the ingredients, they make all the sandwiches, globally, and there's not a lot of regulation. And so at this point, you're probably scared and maybe feeling a little uncomfortable, which is ironic, because a few minutes ago,

Starting point is 00:13:24 I was going to get you all sandwiches, and you said yes. This is why you should not accept food from strangers. But I wouldn't be up here if I weren't also optimistic. That's because I think we have momentum behind the regulation and the culture changes, especially if we align ourselves with three basic principles about how corporations should engage with data. The first principle is that companies that gather data should tell us what they're gathering. This would allow us to ask questions like, is it copyrighted material? Is that information private?

Starting point is 00:13:55 Could you please stop? It also opens up the data to scientific inquiry. The second principle is that companies that are gathering our data should tell us what they're going to do with it before they do anything with it. And by requiring that companies tell us their plan, this means that they have to have a plan, which would be a great first step.

Starting point is 00:14:17 It also probably would lead to the minimization of data capture, because they wouldn't be able to capture data if they didn't know what they were already going to do with it. And finally, principle three, companies that build AI should tell us about the data that they use to train the AI. And this is where data set nutrition labels and other transparency labeling comes into play.

Starting point is 00:14:38 In the case where the data itself won't be made available, which is most of the time, probably, the labeling is critical for us to be able to investigate the ingredients and start to find solutions. So I want to leave you with the good news, and that is that the data nutrition projects and other projects are just a small part of a global movement towards AI accountability. The data set nutrition label and other projects

Starting point is 00:15:05 are just a first step. Regulations on the horizon, the cultural norms are shifting, especially if we align with these three basic principles, that companies should tell us what they're gathering, tell us what they're going to do with it before they do anything with it, and that companies that are building AI should explain the data that they're using to build the system.

Starting point is 00:15:25 We need to hold these organizations accountable for the data that they're using to build the system. We need to hold these organizations accountable for the AI that they're building by asking them, just like we do with the food industry, what's inside and how'd you make it? Only then can we mitigate the issues before they occur as opposed to after they occur, and in doing so, create an integrated, algorithmic internet that is healthier for everyone. Thank you.

Starting point is 00:15:53 Support for this show comes from Airbnb. If you know me, you know I love staying in Airbnbs when I travel. They make my family feel most at home when we're away from home. As we settled down at our Airbnb during a recent vacation to Palm Springs, I pictured my own home sitting empty. Wouldn't it be smart and better put to use welcoming a family like mine by hosting it on Airbnb? It feels like the practical thing to do, and with the extra income, I could save up for renovations to make the space even more inviting for ourselves and for future guests. Your home might be worth more than you think. Find out how much at airbnb.ca slash host. That was Kasia Chmielinski at the TED Salon Big Bets event in 2024,

Starting point is 00:16:44 supported by the Rockefeller Foundation. If you're curious about TED's curation, find out more at ted.com slash curation guidelines. And that's it for today. TED Talks Daily is part of the TED Audio Collective. This episode was produced and edited by our team, Martha Estefanos, Oliver Friedman, Brian Green, Autumn Thompson, and Alejandra Salazar. It was mixed by Christopher Faisy-Bogan. Additional support from Emma Taubner,

Starting point is 00:17:04 Daniela Balarezo, and Will Hennessey. I'm Elise Hugh. I'll be back tomorrow with a fresh idea for your feed. Thanks for listening. Looking for a fun challenge to share with your friends and family? TED now has games designed to keep your mind sharp while having fun. Visit TED.com slash games to explore the joy and wonder of TED Games.

TED Talks Daily - Why AI needs a "nutrition label" | Kasia Chmielinski

What do sandwiches have to do with AI? Data reformist Kasia Chmielinski helps us think about artificial intelligence with a useful food metaphor — and breaks down why AI systems should have... "nutrition labels" to ensure the development of fairer, more transparent algorithms.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.