Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 06: Ethics and Bias in AI with @DataChick
Episode Date: September 29, 2020Stephen Foskett is joined by Karen Lopez, an expert and speaker on data management, data quality, and data analysis. Karen focuses on the quality of the data underlying AI systems and the ethics of us...ing this data. She discusses concerns about data reuse, consent for use, and how changes of data cat impact the outcome of models. We also consider the impact of pervasive data collection, and how this flood of data can impact the outcome of AI models. We finish with a discussion of outliers and missing data, and how this can affect the integrity of artificial intelligence applications. This episode features: Stephen Foskett, publisher of Gestalt IT and organizer of Tech Field Day. Find Stephen's writing at GestaltIT.com and on Twitter at @SFoskett Karen Lopez, Senior Project Manager and Architect at InfoAdvisors. Find Karen's writing at DataModel.com and on Twitter at @Datachick Date: 09/29/2020 Tags: @SFoskett, @Datachick
Transcript
Discussion (0)
Welcome to Utilizing AI,
the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics.
Each episode brings experts in
enterprise infrastructure together to discuss
applications of AI in today's data center.
I'm your host, Stephen Foskett,
organizer of Gestalt IT and Tech Field
Day. You can find me online at gestaltit.com, and you can find me on Twitter at sfoskett.
Now let's meet our guest. Hi, I'm Karen Lopez. I'm Datachick on Twitter. I tweet a lot,
and I apologize in advance for that. I also blog at datamodel.com. I'm a data architect or data evangelist, whatever you want to call me.
So Karen and I go way back and we've been focused on sort of the intersection of data and infrastructure for a long time.
And that's why I wanted to talk to you about artificial intelligence and machine learning, because one of the most important aspects of building, example a machine learning infrastructure is the data so you've got the model you've got the data and then you've got the
applications that come out of that and i know that this is something that you've been focused on for
a long time and specifically with regard to the ethics of data and understanding um you know
permissions and so on so i guess um when you first thought about how companies are going to be using
machine learning in production applications, what red flags went up for you as a data expert?
Wow. So you're right. I mean, I've been thinking about this for a long time.
I'm not an expert in ethics. I try to do the right things. I think most people do.
But I started thinking about, you know, how could a data architect, someone who doesn't
usually work with AI, but they design the underlying systems, how can they work better
with data scientists and AI engineers, users, application developers, to understand as the data is coming into AI ML systems,
how it impacts the outcome of those uses. So one of the big ones that I think is the largest one
we've learned about is we've collected due to legislation, you know, a lot of consent to use
our data and for it to be collected. But often that data was collected
before we used AI technologies. And now is it fair for us to take that data and put it
potentially to a new use? And it all depends on how the privacy notice was stated. And most privacy
notices are things like, you know, we're going to use your data to do our business. And that's good. So maybe the AI stuff is there. But what if we've now purchased external data about you, your demographic data from a third party? This is especially happening in the US because the legislation is less strict there. Did you agree to give us information about the last
time you ran a marathon and now match it up with some other data about you so that we could maybe
predict how long you're actually going to be alive to be our customer? Right? Yeah, I think that that's
one of those interesting aspects here. And that's one of the things I know that's come up over the years with you and me, when we've been talking about this, this whole question of sort of the transformative value of applications and the fact that, you know, you've got information, you know, you've got data points, you know, just facts and inputs and so on. But applications then transform that into something wholly new and
more valuable. And so, you know, you can say, oh, we'll sign our consent form or whatever.
But, you know, we don't really know what's going to come out of that. So, like, in your example,
right, I mean, you know, maybe it's one of those apps that tracks your daily run.
Can you imagine, certainly you could imagine,
oh, I'm going to share my daily run information so that the app can tell me how I did today versus how I did yesterday.
But it would be wholly transformative if, say, Facebook bought that data
and then used it to predict all sorts of things about you.
Or if a health insurance
bought that app and then use that to predict like your future, you know, healthfulness and so on.
And, and, you know, oh boy, you know, she's slowing down, you know, she must be getting old or she
must be, you know, I mean, there's, there's a wholly different aspect of the data depending
on how it's used and depending on how it's combined with other things.
That's the spot on sort of question about this particular issue. And I wanted to point out,
like in a lot of jurisdictions, such as where I live in Canada, that kind of use of an app
selling the data to Facebook or to your insurance company may be illegal. But ethics isn't the same as legal,
right? They are related to each other. So in most cases, it's unethical to do something illegal
in most cases. But an organization could choose not to do that because they want to,
they want their customers to trust them to keep supplying the data. And that's where the
sort of ethics of AI and ML come in. So let's say that, like in one of my other presentations and
discussions I've done with you, is that I assert that people lie about the data they give you,
which we all know happens because mostly we do that as well. You know, what's your
email address? What's your phone number when you're registering for something? Not everyone's
truthful about those things. And then criminals aren't always truthful about the data they give
as they're being arrested or something like that. So we've collected that data, we were going to use it for one use.
But now what if a company is going to run that data through a bunch of risk assessments,
which is a typical sort of AI use case, but now you actually were less than ethical in supplying that data? And what if your insurance company now says,
you know, we found a correlation between people who use TikTok and people who submit
false claims on their insurance, which would be a valid use of data, because we all know that's
how insurance company and how actuaries assess risk. Did we know our data was going to, that
single piece of data was going to be matched up with other data in order to, maybe they ran a
Twitter contest and you gave them your Twitter ID and they found your TikToks there. Like those are
the types of ethical uses. It might be legal for them to do that,
but maybe it's not an ethical use. And maybe, you know, as we move on to another topic is maybe the
insights we get from data just aren't valid. So for instance, in the data world, there's these
stories of, you know, someone finds that children who do well in school also have
their own books at home. So that came out through traditional analysis. But does that mean we should
buy books for children to have at home? Will that correlation really cause that type of, you know,
an increase in a child's abilities at school? Or was it something else that
also resulted in children having books at home? Yeah. And actually I was going to get to that as
well, because I think that that's a really interesting aspect of machine learning specifically
in that machine learning and AI, you know, it finds all sorts of weird correlations. And sometimes they are valid.
And sometimes they are just totally off the wall.
I mean, you know, and it can only, you know, it doesn't know anything.
It certainly doesn't know ethics and morality. And so, like, if it found a correlation in a data set, there's using that machine learning system can even consent
themselves to allow the system to decide something that's just totally off the wall.
But yet that's sort of what might happen, right? We might find some strange correlation and it
might start acting on a correlation that nobody is ready for. Or knows about, right? So one of the things about
most machine learning and a lot of algorithms that are learned is that you can't go review
those algorithms, right? Easily or at all, depending on how you're doing it. If you write
an algorithm that does this, then you have that insight into how it's all working. But
typically with AI, machine learning, deep learning, you're not in there. You're feeding data, images,
sounds into it, setting some parameters, choosing what type of model you want to use,
then generating the models for reuse on bigger sets of data. So that brings us to
another issue is that, one, how does a company give that notice for that use? Because I said
the consent was the big problem, but also a big problem is companies have a hard time communicating
what all they're doing with the data in a way
that a customer is going to feel confident that their data isn't being abused and that the models
are properly assessed. This has come up a lot during the recent pandemic as people say, you
know, oh, the models have changed, the models have changed. Well, it could be the models have changed,
or it could be that the data going into them changed, and that's what caused it. It doesn't mean someone changed the model, but the outcomes
have changed. So they've either been adjusted, or we're just getting bigger and bigger data sets,
such as the recent addition of having more COVID data on children and young adults has changed
what comes out at the other end of those models.
Yeah, and I guess that's another aspect of it. So you can have strange correlations, but then you can also just input new data and find it acting in some way that you didn't expect.
I mean, is there some way of getting our hands around this? Is there some way of, you know, as an industry, computing societies, research groups, universities,
the big companies using it, work with everyone as a community, as a profession on an ethics
framework. So there are lots of uses of AI and ML that don't really have a huge risk for horrible
outcomes. And then there are those that are going to deny people a mortgage,
deny people access to healthcare, because maybe they're seen as either too low of a risk or too
high of a risk for certain treatments. You know, I'm always really excited about these uses of
modern insight related tools. But I'm always back here going, how much can I trust that? How much
can I trust that analysis? And that particular question isn't really new to IT, because we had
that just when people were coding analysis, like writing handwriting queries that matched up your
purchases this time versus last week versus last year. So we've always had that. I think that the thing
that makes this more of a challenge is that we don't have that insight always into what the
models are doing, like we did when we just looked at the code, and that could be audited.
So the other issue that we have is all systems have bias and people misunderstand that word bias as being
like you're bigoted but bias really just means the context of the data that was fed in the context of
the parameters that were used the context of the models and the context of data that comes out.
So I think it's important that we in AI document the biases as we understand them
and realize that that constantly has to be refined.
Yeah, I think that that's one of those,
yeah, we talked about this recently on that other,
on the On-Premise IT Roundtable podcast, which we also did and we talked about this recently on that other, on the on-premise IT Roundtable podcast, which, you know,
we also did, and we talked about, you know, basically trying to apply these things to the
world and trying to understand, you know, biases. But, you know, one of the things I think that
comes to me, now that we're speaking about this again. Another aspect that I think we need to consider is that, you know, in a way, AI and machine
learning opens up sort of a Pandora's box of using data that we never used because there
was simply too much of it. You know, like you said, you know, at the beginning here, you know, maybe,
you know, we did correlate, you know, what you purchased or what activity you did.
But, you know, by having an electronic brain processing that data instead of an expensive,
you know, fleshy, bloody brain processing that, it basically allows us to use
more data, and not just a little bit more, just tremendously more. You know, so for example,
it would be absolutely realistic for your marathon tracking running app to track not just every step but literally every heartbeat and every
breath and correlate those and understand those whereas I mean it would
be ludicrous to suggest a non AI system would be able to to track literally
every heartbeat or every breath this opens up a whole world of possibilities
in terms of just sort of pervasive
data collection. And the implications of that are just mind bending.
They are. And for example, there's the concept of over collecting your data and then over retaining
it. So both of those just substantially add to the risk for the security of the data because, you know, having more data to protect costs more and there's a greater risk if it's compromised.
So those two things kind of go together. that analyzes their heartbeats and they're like does a little non-medical EKG and people got
but got notified by their app that they need to go see their doctor and I'm like that's so cool
but you know it comes with so what is your watch vendor doing with that data and what might they
be tempted to do with it and if their database was stuck up in the cloud in an unsecured manner
in a bucket, what might some bad actor do with that data, right? So there's all those things
about overcollection. The other problem with overcollection is that, like, I'll just take
something really generic that we've collected over the years. Like, what if we've asked someone what your gender is, right? And the whole reason we wanted
that is so when you talk to a CSR, they know, they have more confidence on how to refer to you.
But now we know that that's not quite a one-to-one match. And we've introduced the problem of
now people are going to have to tell,
I'll just make this up,
their app system,
that their gender has changed.
And what if they're doing AI
and have sold that data to something?
And now information that was just collected
mostly for how to refer to you on a phone call. Same thing happens with your salutation, whether you're Mr. or Ms. I mean, most people don't care whether you're married or single or divorced or whatever that might indicate. And it's a lousy something much bigger than what we supplied it for?
Or what if, you know, the drop-down box had Mr. First and someone signing up for a newsletter,
so they didn't bother to go through and change it to Dr. or Ms. or Mrs. And now all of a sudden,
we've again incorrectly provided data, but only because back then it didn't matter. And now it might matter.
So what is, you know, this, this all just comes back to, if you don't understand this meaning in
the metadata and the context of the data, your models are going to be wrong by that thing. And
there are techniques for finding anomalies in the data and then doing something with it to fix it before it
goes into a model, which for me as a transaction data person, that just makes it crazy for me.
Yeah. And that situation actually is not, you know, just, you know, it may sound trivial that,
oh, I always select misses, but now I'm going to select Mr. in the box.
But when it comes to a machine understand a system that does health correlations
that whether you are a Mr. or a Mrs. might dramatically affect how it interprets your
blood pressure or your heartbeat or your sleepfulness and resting and all sorts of other
things. And by checking that box, suddenly you may have, you know, really tripped a switch inside the system that would give it a whole interesting one as well, because some things are,
you know, sort of chemically, you know, biologically driven. Some things are, you know,
genetically driven and, and, and, you know, the nurturing of, you know, your, your, yourself.
And by, you know, moving across those boundaries, you may, it may come up with a whole world of incorrect assumptions
about you, you know, because, oh, well, I prefer to be dressed as Mrs. But that, you know, that may
not be a full indicator of the rest of my being. And I think that that's something that really is
hard, hard to program in the best case, but certainly hard to program into a neural
network that is just combing through massive amounts of data.
And I guess it all comes down to this question of outliers and sort of outliers in data.
We've talked about this before.
You and I have talked about the famous case of the self-driving car that never assumed
that anyone would ride a bicycle perpendicularly across a limited access street.
And so it just drove right on through.
You know, the whole idea of outliers, I think, are fundamental to so many areas of data analysis
and sort of understanding, and yet machine learning is
almost pathologically incapable of handling outliers. There's that, and one of the specific
ones, it's, I mean, calling it an outlier, it is, but it's a special case in AI, is that AI doesn't
want missing data. So I'm not just talking about, you know, whether or not
you really have a middle name or not, because that's a transactional thing. But with AI systems,
it's not a transactional system. They want to work with data that, you know, doesn't have everyone's
date of birth, which is a common important thing that
feeds into a model depending on what business you're in. So this really blew my mind the first
time I went to a presentation about this. So there are many techniques for filling in missing data
when we don't know what that missing data is. And that blows my mind because as a transactional
person, you know, someone who works with transactional systems, we don't make up data.
Now the AI people, you know, we say we're not making up data, but from a transactional person,
oh my gosh, that's making up data. It's not, it's somewhere in between that. And so what they do is, one of the techniques is looking at the rest of the data, your salutation,
your gender, your name, some, you know, maybe what year you graduated university, whatever
it is that we're doing.
And it compares you, that missing data, I'll call it a row, to all the other data and make certain assertions about
what your date of birth might be. Just totally, you know, or what your age might be, let's say
that, what your year of birth might be. And we would never do that in a banking system,
or even in a health system normally. But those are transactional systems. We would do that in AI.
And so if the data, the confidence or the understanding of the underlying data is wrong,
then those assertions that are made for missing data, they're going to be less,
you'll have less confidence in them. And that to me is something
that I don't have to worry about in my day job that I would have to worry
about in an AI job. So I guess in summary, you know, now that we've spent a little time talking
about data and the sort of the ethics and implications of data, I guess, what have you
learned as a data, I don't know, pundit, what have you learned that you would wish to express to people
who are trying to integrate, you know, vast amounts of data into an artificial intelligence
system? What would you like to warn them about? What would you like to tell them?
Yeah, so the first thing when I work with or mentor data scientists is I tell them the data is likely not at all what you think it is.
So even a data scientist, which I know there's difference between a traditional data scientist and AI stuff, is, you know, there's that metric that data scientists spend four days of their five-day week sourcing, prepping, and cleansing
data. And a lot of that is because they're trying to figure out why this data is missing, why there's
no one in this data lake that has a last name that starts higher than M in the alphabet. And they
might not even notice that. And then they run their models and they're like,
there's no one from the Midwest in this data set. Why is that? I didn't know that that was
supposed to be a list of all of our customers. That sort of data architects like myself understand
that about data. But most people who are on this analysis side have not had to feel all those pains.
And so they can be overconfident in that data.
The data is not as clean as you think.
It's not as self-explanatory as you think.
And there's a whole bunch of reasons for that.
But as long as we know the context of the data, then we can deal with it in the AI way, which is different than in the transactional way.
Interesting. And I think that that's so true.
I mean, overall, what I've learned myself in years of doing enterprise tech is that, you know, it's that old saying, we don't know what we don't know. And we all have to be
honest about what we do and don't know, and actually try to bring people in who have our
better understanding. So somebody, you know, but like yourself, who's better understanding of the
issues that accompany, you know, data sets, that would be a really valuable voice in, you know,
when you're trying to figure out, you know, how to use these data sets in an
artificial intelligence context. So thank you very much for joining us today. Where can people
connect with you and follow your thoughts on enterprise AI and other topics? Data Chick on
Twitter, datamodel.com is where I blog. And I also have a YouTube channel that you can find on my
blog as well. Great. Well, thanks for listening
to Utilizing AI. If you enjoyed this discussion, please remember to rate, subscribe, and review the
show on iTunes, since that really helps our visibility with the AI and machine learning
driven engines of iTunes. And please do share this show with your friends. This podcast was
brought to you by gestaltit.com, your home for IT coverage
across the enterprise. For show notes and more episodes, go to utilizing-ai.com or find us on
Twitter at utilizing underscore AI. Thanks a lot, and we'll see you next time. you