Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 06: Ethics and Bias in AI with @DataChick

Episode Date: September 29, 2020

Stephen Foskett is joined by Karen Lopez, an expert and speaker on data management, data quality, and data analysis. Karen focuses on the quality of the data underlying AI systems and the ethics of us...ing this data. She discusses concerns about data reuse, consent for use, and how changes of data cat impact the outcome of models. We also consider the impact of pervasive data collection, and how this flood of data can impact the outcome of AI models. We finish with a discussion of outliers and missing data, and how this can affect the integrity of artificial intelligence applications. This episode features: Stephen Foskett, publisher of Gestalt IT and organizer of Tech Field Day. Find Stephen's writing at GestaltIT.com and on Twitter at @SFoskett Karen Lopez, Senior Project Manager and Architect at InfoAdvisors. Find Karen's writing at DataModel.com and on Twitter at @Datachick Date: 09/29/2020 Tags: @SFoskett, @Datachick

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings experts in enterprise infrastructure together to discuss applications of AI in today's data center. I'm your host, Stephen Foskett, organizer of Gestalt IT and Tech Field
Starting point is 00:00:25 Day. You can find me online at gestaltit.com, and you can find me on Twitter at sfoskett. Now let's meet our guest. Hi, I'm Karen Lopez. I'm Datachick on Twitter. I tweet a lot, and I apologize in advance for that. I also blog at datamodel.com. I'm a data architect or data evangelist, whatever you want to call me. So Karen and I go way back and we've been focused on sort of the intersection of data and infrastructure for a long time. And that's why I wanted to talk to you about artificial intelligence and machine learning, because one of the most important aspects of building, example a machine learning infrastructure is the data so you've got the model you've got the data and then you've got the applications that come out of that and i know that this is something that you've been focused on for a long time and specifically with regard to the ethics of data and understanding um you know permissions and so on so i guess um when you first thought about how companies are going to be using
Starting point is 00:01:26 machine learning in production applications, what red flags went up for you as a data expert? Wow. So you're right. I mean, I've been thinking about this for a long time. I'm not an expert in ethics. I try to do the right things. I think most people do. But I started thinking about, you know, how could a data architect, someone who doesn't usually work with AI, but they design the underlying systems, how can they work better with data scientists and AI engineers, users, application developers, to understand as the data is coming into AI ML systems, how it impacts the outcome of those uses. So one of the big ones that I think is the largest one we've learned about is we've collected due to legislation, you know, a lot of consent to use
Starting point is 00:02:22 our data and for it to be collected. But often that data was collected before we used AI technologies. And now is it fair for us to take that data and put it potentially to a new use? And it all depends on how the privacy notice was stated. And most privacy notices are things like, you know, we're going to use your data to do our business. And that's good. So maybe the AI stuff is there. But what if we've now purchased external data about you, your demographic data from a third party? This is especially happening in the US because the legislation is less strict there. Did you agree to give us information about the last time you ran a marathon and now match it up with some other data about you so that we could maybe predict how long you're actually going to be alive to be our customer? Right? Yeah, I think that that's one of those interesting aspects here. And that's one of the things I know that's come up over the years with you and me, when we've been talking about this, this whole question of sort of the transformative value of applications and the fact that, you know, you've got information, you know, you've got data points, you know, just facts and inputs and so on. But applications then transform that into something wholly new and more valuable. And so, you know, you can say, oh, we'll sign our consent form or whatever.
Starting point is 00:03:54 But, you know, we don't really know what's going to come out of that. So, like, in your example, right, I mean, you know, maybe it's one of those apps that tracks your daily run. Can you imagine, certainly you could imagine, oh, I'm going to share my daily run information so that the app can tell me how I did today versus how I did yesterday. But it would be wholly transformative if, say, Facebook bought that data and then used it to predict all sorts of things about you. Or if a health insurance bought that app and then use that to predict like your future, you know, healthfulness and so on.
Starting point is 00:04:31 And, and, you know, oh boy, you know, she's slowing down, you know, she must be getting old or she must be, you know, I mean, there's, there's a wholly different aspect of the data depending on how it's used and depending on how it's combined with other things. That's the spot on sort of question about this particular issue. And I wanted to point out, like in a lot of jurisdictions, such as where I live in Canada, that kind of use of an app selling the data to Facebook or to your insurance company may be illegal. But ethics isn't the same as legal, right? They are related to each other. So in most cases, it's unethical to do something illegal in most cases. But an organization could choose not to do that because they want to,
Starting point is 00:05:21 they want their customers to trust them to keep supplying the data. And that's where the sort of ethics of AI and ML come in. So let's say that, like in one of my other presentations and discussions I've done with you, is that I assert that people lie about the data they give you, which we all know happens because mostly we do that as well. You know, what's your email address? What's your phone number when you're registering for something? Not everyone's truthful about those things. And then criminals aren't always truthful about the data they give as they're being arrested or something like that. So we've collected that data, we were going to use it for one use. But now what if a company is going to run that data through a bunch of risk assessments,
Starting point is 00:06:13 which is a typical sort of AI use case, but now you actually were less than ethical in supplying that data? And what if your insurance company now says, you know, we found a correlation between people who use TikTok and people who submit false claims on their insurance, which would be a valid use of data, because we all know that's how insurance company and how actuaries assess risk. Did we know our data was going to, that single piece of data was going to be matched up with other data in order to, maybe they ran a Twitter contest and you gave them your Twitter ID and they found your TikToks there. Like those are the types of ethical uses. It might be legal for them to do that, but maybe it's not an ethical use. And maybe, you know, as we move on to another topic is maybe the
Starting point is 00:07:14 insights we get from data just aren't valid. So for instance, in the data world, there's these stories of, you know, someone finds that children who do well in school also have their own books at home. So that came out through traditional analysis. But does that mean we should buy books for children to have at home? Will that correlation really cause that type of, you know, an increase in a child's abilities at school? Or was it something else that also resulted in children having books at home? Yeah. And actually I was going to get to that as well, because I think that that's a really interesting aspect of machine learning specifically in that machine learning and AI, you know, it finds all sorts of weird correlations. And sometimes they are valid.
Starting point is 00:08:06 And sometimes they are just totally off the wall. I mean, you know, and it can only, you know, it doesn't know anything. It certainly doesn't know ethics and morality. And so, like, if it found a correlation in a data set, there's using that machine learning system can even consent themselves to allow the system to decide something that's just totally off the wall. But yet that's sort of what might happen, right? We might find some strange correlation and it might start acting on a correlation that nobody is ready for. Or knows about, right? So one of the things about most machine learning and a lot of algorithms that are learned is that you can't go review those algorithms, right? Easily or at all, depending on how you're doing it. If you write
Starting point is 00:09:20 an algorithm that does this, then you have that insight into how it's all working. But typically with AI, machine learning, deep learning, you're not in there. You're feeding data, images, sounds into it, setting some parameters, choosing what type of model you want to use, then generating the models for reuse on bigger sets of data. So that brings us to another issue is that, one, how does a company give that notice for that use? Because I said the consent was the big problem, but also a big problem is companies have a hard time communicating what all they're doing with the data in a way that a customer is going to feel confident that their data isn't being abused and that the models
Starting point is 00:10:10 are properly assessed. This has come up a lot during the recent pandemic as people say, you know, oh, the models have changed, the models have changed. Well, it could be the models have changed, or it could be that the data going into them changed, and that's what caused it. It doesn't mean someone changed the model, but the outcomes have changed. So they've either been adjusted, or we're just getting bigger and bigger data sets, such as the recent addition of having more COVID data on children and young adults has changed what comes out at the other end of those models. Yeah, and I guess that's another aspect of it. So you can have strange correlations, but then you can also just input new data and find it acting in some way that you didn't expect. I mean, is there some way of getting our hands around this? Is there some way of, you know, as an industry, computing societies, research groups, universities,
Starting point is 00:11:26 the big companies using it, work with everyone as a community, as a profession on an ethics framework. So there are lots of uses of AI and ML that don't really have a huge risk for horrible outcomes. And then there are those that are going to deny people a mortgage, deny people access to healthcare, because maybe they're seen as either too low of a risk or too high of a risk for certain treatments. You know, I'm always really excited about these uses of modern insight related tools. But I'm always back here going, how much can I trust that? How much can I trust that analysis? And that particular question isn't really new to IT, because we had that just when people were coding analysis, like writing handwriting queries that matched up your
Starting point is 00:12:19 purchases this time versus last week versus last year. So we've always had that. I think that the thing that makes this more of a challenge is that we don't have that insight always into what the models are doing, like we did when we just looked at the code, and that could be audited. So the other issue that we have is all systems have bias and people misunderstand that word bias as being like you're bigoted but bias really just means the context of the data that was fed in the context of the parameters that were used the context of the models and the context of data that comes out. So I think it's important that we in AI document the biases as we understand them and realize that that constantly has to be refined.
Starting point is 00:13:17 Yeah, I think that that's one of those, yeah, we talked about this recently on that other, on the On-Premise IT Roundtable podcast, which we also did and we talked about this recently on that other, on the on-premise IT Roundtable podcast, which, you know, we also did, and we talked about, you know, basically trying to apply these things to the world and trying to understand, you know, biases. But, you know, one of the things I think that comes to me, now that we're speaking about this again. Another aspect that I think we need to consider is that, you know, in a way, AI and machine learning opens up sort of a Pandora's box of using data that we never used because there was simply too much of it. You know, like you said, you know, at the beginning here, you know, maybe,
Starting point is 00:14:09 you know, we did correlate, you know, what you purchased or what activity you did. But, you know, by having an electronic brain processing that data instead of an expensive, you know, fleshy, bloody brain processing that, it basically allows us to use more data, and not just a little bit more, just tremendously more. You know, so for example, it would be absolutely realistic for your marathon tracking running app to track not just every step but literally every heartbeat and every breath and correlate those and understand those whereas I mean it would be ludicrous to suggest a non AI system would be able to to track literally every heartbeat or every breath this opens up a whole world of possibilities
Starting point is 00:15:04 in terms of just sort of pervasive data collection. And the implications of that are just mind bending. They are. And for example, there's the concept of over collecting your data and then over retaining it. So both of those just substantially add to the risk for the security of the data because, you know, having more data to protect costs more and there's a greater risk if it's compromised. So those two things kind of go together. that analyzes their heartbeats and they're like does a little non-medical EKG and people got but got notified by their app that they need to go see their doctor and I'm like that's so cool but you know it comes with so what is your watch vendor doing with that data and what might they be tempted to do with it and if their database was stuck up in the cloud in an unsecured manner
Starting point is 00:16:06 in a bucket, what might some bad actor do with that data, right? So there's all those things about overcollection. The other problem with overcollection is that, like, I'll just take something really generic that we've collected over the years. Like, what if we've asked someone what your gender is, right? And the whole reason we wanted that is so when you talk to a CSR, they know, they have more confidence on how to refer to you. But now we know that that's not quite a one-to-one match. And we've introduced the problem of now people are going to have to tell, I'll just make this up, their app system,
Starting point is 00:16:52 that their gender has changed. And what if they're doing AI and have sold that data to something? And now information that was just collected mostly for how to refer to you on a phone call. Same thing happens with your salutation, whether you're Mr. or Ms. I mean, most people don't care whether you're married or single or divorced or whatever that might indicate. And it's a lousy something much bigger than what we supplied it for? Or what if, you know, the drop-down box had Mr. First and someone signing up for a newsletter, so they didn't bother to go through and change it to Dr. or Ms. or Mrs. And now all of a sudden, we've again incorrectly provided data, but only because back then it didn't matter. And now it might matter.
Starting point is 00:17:46 So what is, you know, this, this all just comes back to, if you don't understand this meaning in the metadata and the context of the data, your models are going to be wrong by that thing. And there are techniques for finding anomalies in the data and then doing something with it to fix it before it goes into a model, which for me as a transaction data person, that just makes it crazy for me. Yeah. And that situation actually is not, you know, just, you know, it may sound trivial that, oh, I always select misses, but now I'm going to select Mr. in the box. But when it comes to a machine understand a system that does health correlations that whether you are a Mr. or a Mrs. might dramatically affect how it interprets your
Starting point is 00:18:53 blood pressure or your heartbeat or your sleepfulness and resting and all sorts of other things. And by checking that box, suddenly you may have, you know, really tripped a switch inside the system that would give it a whole interesting one as well, because some things are, you know, sort of chemically, you know, biologically driven. Some things are, you know, genetically driven and, and, and, you know, the nurturing of, you know, your, your, yourself. And by, you know, moving across those boundaries, you may, it may come up with a whole world of incorrect assumptions about you, you know, because, oh, well, I prefer to be dressed as Mrs. But that, you know, that may not be a full indicator of the rest of my being. And I think that that's something that really is hard, hard to program in the best case, but certainly hard to program into a neural
Starting point is 00:20:07 network that is just combing through massive amounts of data. And I guess it all comes down to this question of outliers and sort of outliers in data. We've talked about this before. You and I have talked about the famous case of the self-driving car that never assumed that anyone would ride a bicycle perpendicularly across a limited access street. And so it just drove right on through. You know, the whole idea of outliers, I think, are fundamental to so many areas of data analysis and sort of understanding, and yet machine learning is
Starting point is 00:20:47 almost pathologically incapable of handling outliers. There's that, and one of the specific ones, it's, I mean, calling it an outlier, it is, but it's a special case in AI, is that AI doesn't want missing data. So I'm not just talking about, you know, whether or not you really have a middle name or not, because that's a transactional thing. But with AI systems, it's not a transactional system. They want to work with data that, you know, doesn't have everyone's date of birth, which is a common important thing that feeds into a model depending on what business you're in. So this really blew my mind the first time I went to a presentation about this. So there are many techniques for filling in missing data
Starting point is 00:21:40 when we don't know what that missing data is. And that blows my mind because as a transactional person, you know, someone who works with transactional systems, we don't make up data. Now the AI people, you know, we say we're not making up data, but from a transactional person, oh my gosh, that's making up data. It's not, it's somewhere in between that. And so what they do is, one of the techniques is looking at the rest of the data, your salutation, your gender, your name, some, you know, maybe what year you graduated university, whatever it is that we're doing. And it compares you, that missing data, I'll call it a row, to all the other data and make certain assertions about what your date of birth might be. Just totally, you know, or what your age might be, let's say
Starting point is 00:22:34 that, what your year of birth might be. And we would never do that in a banking system, or even in a health system normally. But those are transactional systems. We would do that in AI. And so if the data, the confidence or the understanding of the underlying data is wrong, then those assertions that are made for missing data, they're going to be less, you'll have less confidence in them. And that to me is something that I don't have to worry about in my day job that I would have to worry about in an AI job. So I guess in summary, you know, now that we've spent a little time talking about data and the sort of the ethics and implications of data, I guess, what have you
Starting point is 00:23:19 learned as a data, I don't know, pundit, what have you learned that you would wish to express to people who are trying to integrate, you know, vast amounts of data into an artificial intelligence system? What would you like to warn them about? What would you like to tell them? Yeah, so the first thing when I work with or mentor data scientists is I tell them the data is likely not at all what you think it is. So even a data scientist, which I know there's difference between a traditional data scientist and AI stuff, is, you know, there's that metric that data scientists spend four days of their five-day week sourcing, prepping, and cleansing data. And a lot of that is because they're trying to figure out why this data is missing, why there's no one in this data lake that has a last name that starts higher than M in the alphabet. And they might not even notice that. And then they run their models and they're like,
Starting point is 00:24:25 there's no one from the Midwest in this data set. Why is that? I didn't know that that was supposed to be a list of all of our customers. That sort of data architects like myself understand that about data. But most people who are on this analysis side have not had to feel all those pains. And so they can be overconfident in that data. The data is not as clean as you think. It's not as self-explanatory as you think. And there's a whole bunch of reasons for that. But as long as we know the context of the data, then we can deal with it in the AI way, which is different than in the transactional way.
Starting point is 00:25:10 Interesting. And I think that that's so true. I mean, overall, what I've learned myself in years of doing enterprise tech is that, you know, it's that old saying, we don't know what we don't know. And we all have to be honest about what we do and don't know, and actually try to bring people in who have our better understanding. So somebody, you know, but like yourself, who's better understanding of the issues that accompany, you know, data sets, that would be a really valuable voice in, you know, when you're trying to figure out, you know, how to use these data sets in an artificial intelligence context. So thank you very much for joining us today. Where can people connect with you and follow your thoughts on enterprise AI and other topics? Data Chick on
Starting point is 00:25:55 Twitter, datamodel.com is where I blog. And I also have a YouTube channel that you can find on my blog as well. Great. Well, thanks for listening to Utilizing AI. If you enjoyed this discussion, please remember to rate, subscribe, and review the show on iTunes, since that really helps our visibility with the AI and machine learning driven engines of iTunes. And please do share this show with your friends. This podcast was brought to you by gestaltit.com, your home for IT coverage across the enterprise. For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI. Thanks a lot, and we'll see you next time. you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.