Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x26: DataOps - Putting the Data in Data Science
Episode Date: March 29, 2022The quality of an AI application depends on the quality of the data that feeds it. Sunil Samel joins Frederic Van Haren and Stephen Foskett to discuss DataOps and the importance of data quality. When ...we consider data-centric AI, we must consider all aspects of the data pipeline, from storing, transporting, and understanding to controlling access and cost. We must look at the data needed to train our models, think about the desired outcomes, and consider the sources and pipeline needed to get that result. We must also decide how to define quality: Do we need a variety of data sources? Should we reject some data? How does the modality of the data type change this definition? Is there bias in what is included and excluded? Data pipelines are usually simple, ingesting and storing data from the source, slicing and preparing it, and presenting it for processing. But DataOps recognizes that the data pipeline can get very complicated and requires understanding of all these steps as well as adaptation from development to production. Three Questions: Frederic: Do you think we should expect another AI winter? Stephen: When will we see a full self-driving car that can drive anywhere, any time? Mike O'Malley, Seneca Global: Can you give an example where an AI algorithm went terribly wrong and gave a result that clearly wasn’t correct? Gests and Hosts Sunil Samel, VP of Products at Akridata. Connect with Sunil on LinkedIn or email him at sunil.samel@akridata.com. Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 3/29/2022 Tags: @SFoskett, @FredericVHaren
Transcript
Discussion (0)
I'm Stephen Foskett.
I'm Frederik van Heren.
And this is the Utilizing AI podcast.
Welcome to another episode of Utilizing AI,
the podcast about enterprise applications for machine learning,
deep learning, data science, and other artificial intelligence topics.
Over the years, we've spent a lot of time talking about unintended outcomes of AI.
We've talked about ethics and bias.
And one of the reasons that a lot of this stuff happens
is data quality.
And I think that that's something
that we really need to zoom in on as an industry.
Right, Frederick?
Yeah, it's all about the data quality,
not about the data quantity.
Collecting the data is one thing, but you have to send it through a pipeline to clean it, process it in order to create models.
Exactly. And that's leading to a new field to go along with MLOps, DataOps.
And I think that that's all about the idea of quality, of having a quality pipeline.
And it doesn't just mean the quality of the underlying data.
It means the quality of everything that leads into the model.
And that's why we've decided to invite Sunil Semel here to join us
to talk about DataOps and data quality.
Nice to have you, Sunil.
Thank you, Steve.
And thanks to you, Frederick, for introducing me to Steve. It's my pleasure to be here.
My name is Sunil Samal.
I am with a company called AccuData.
But I started my journey many, many years back.
I've been a bit of a startup nomad.
My first startup was out of a research institute in Belgium called IMEC.
And that was in designing,
tools for designing systems on chip,
after which I did a fabulous semi,
after which it was a bit of high-performance storage,
a stint in application security. And here we are in what we think
is a rather neglected side of AI and machine learning, which is
how do you manage your data so that you can train your models better?
And when we talk about data, I think it's important to draw a distinction.
Now, I've mentioned before, including last week, that I'm a big storage nerd, and I'm
really interested in how data is stored and organized and presented.
That's not what we mean here in terms of data ops, right?
We're talking data, which is the next level up from storage, right?
Absolutely.
Actually, you made a very, very good point.
A lot of times when it comes to data, people confound it with storage and maybe networking. It's about how do I collect data and
transfer it to someplace where I can do something with it. But that's not really what we mean by
data. I mean, if anything, with machine learning, with AI, the data is what informs the machine
learning model. And so selecting the right data is an important thing.
So whether it is quality or an even more kind of obscure notion of relevance, is this data more
meaningful for us? When people talk about long tails or rare scenarios, what is that? And how
do I find that out of the big data that I'm collecting?
And to both your points, it's not about the big data, it's about the right data.
And so there has been a recognition of that in recent times. And folks like Professor Andrew
Wong is coming up with the term data-centric AI. So it's about storing the data, it's about transporting the data, it's about understanding it, finding the right bits of data.
It's about in the days of privacy, we have to be careful about who accesses it and are they authorized to access it?
The data is being collected in different locations, all of these aspects.
And then, of course, there's this overriding thing about cost.
So just like an elephant's memory, data keeps on getting collected,
but somebody's got to pay for it.
So yeah, when we are talking about data ops,
that's what we are talking about.
It's not just about storage.
It's not just about networking.
It's interesting that you're talking about data-centric AI
because in my book, AI means it's all about data. So it's almost like
saying it twice. The reason why Professor
Andrew Ng brought it up, is that because he wanted to accentuate
the DataOps part of it?
I think I would say that he came to it
based upon his own experiences, which are vast.
And I think he's been recognizing
that it's not about indiscriminately training models
on a lot of data.
And that's one way to get bias in there.
I mean, we have these famous examples,
unfortunate examples of bots going rogue, where they kind of learn from everything on the internet
and then start becoming racist. So what is important then is to curate that data properly
so that you can find, you know, to give that even handedness to the training.
So when you look at that,
it's about selecting the data to train the models.
And then you, so that's where the data centricness comes.
Look at the data that you need to train your models,
feed them the right kinds of problem scenarios,
long tail scenarios, the scenarios from which you can learn,
and then you get the right kind of results.
So do you think that DataOps can be self-healing in the sense that it can recognize when there
is bad data, when there is ethical complications involved or something like that?
Yeah, this is an interesting point.
Actually, in some ways, you might say, can we apply AI to AI training?
And there are movements in that.
So where we are today is really allowing the people who are responsible for training models
or retraining models, improving models, giving them a nice framework
and infrastructure with which to collect data, understand it, select the right bits, manage it,
understand the cost implications, et cetera, et cetera. So there is a bit of a user-guided aspect
to it. But then along with this, there are aspects like active learning.
So basically the model participates in the process of recognizing which areas need improvement.
And so what we find, for example, let me give you an example of AI being used for AI or computer vision being used for AI training.
Self-driving cars, that's one of the areas where machine learning is being used a lot.
And you want to deal with a specific scenario and usually it's an area where different roads or traffic
conditions come together and you are now trying to figure out which you know how
can I find more samples of data where where there are scenes with crosswalks
and there is a pedestrian on the crosswalk but there's also pedestrian
out of the crosswalk and
how do you find them well people are actually applying object detection models and saying
show me areas which have these kinds of scenarios so now you have this little model
running through spidering across your data sets and And you might say, find it across, you know, in these streets in Munich,
between these dates on these models of the car with this set of sensors.
So now this is how you can apply, you know,
these kinds of machine learning models to select the right kind of data.
So to answer your question, Frederick,
we are currently starting with a more
user-guided notion, and that's where it's important that the data scientists can state what is
relevant to me today. It's also creating an interesting situation where nobody's willing
to throw away data because you never know that what is not relevant today might suddenly become
relevant tomorrow. Besides leaving aside the fact that there might be
regulatory aspects that you cannot, you know,
if you've still collected that data,
then you need to kind of keep it and have it be auditable.
But interestingly, we kind of thought
that we'll call it filtering
because filtering has built into it the notion of
you can then throw away the data,
but filtering was used more as prioritizing or sorting and
saying this is the data I need now and this rest of the data can follow later.
As in can come in slowly or it can go into cold tiers.
So Steve, going back to your storage background, this is a lot of tiering but
with a different kind of policy other than
the hotness of the data or the frequency of access, the aging of data, et cetera. Here
you're looking at relevance of the data.
So I'm interested in a question that gets right at this whole idea of data quality,
and that is quite simply, how do you define quality?
If you're trying to have quality data,
what even is quality?
Do we have a definition?
So again, it's in the eye of the beholder,
partly because different, and mind you,
some of the bias in my experience will come through we tend to find the
data ops challenges to be exacerbated with the the automotive ADAS AV
autonomous vehicles assisted you know highly assisted driving kind of
scenarios and there what we have is a lot of different kinds of data.
So you have different kinds of cameras,
you have multiple cameras, you may have a LiDAR,
you have different kinds of radar,
you have your canvas data, you have GPS,
your positional data, et cetera, et cetera.
And all of that data needs to be collected.
And with each modality of each sensor comes a different notion of quality. So I'm here I'm going to talk about more of physical quality, and in the, maybe if I use the example of a camera, you might say that an image with a lot of glare in it is bad quality. An image which is obscured because
some bug flew into the camera is bad quality. So that's one aspect of quality. LIDARs will have
their own version of it, radars, and so on and so on. So there's that aspect of quality. And then there is the other
aspect of, I guess, relevance is what we would say that you see a lot of empty roads, or you
have situations where the road is straight and nothing much is happening. Not really interesting
data. So how can I find these interesting bits of data? You're driving on, you're in California on I-5 at 80 miles an hour, nothing is happening, there's only cows on the quality, but I got the feeling that, Steve,
you were referring to that also when you said quality.
Yeah. And this actually leads me to sort of the corollary of that. And this is also a concern,
especially with machine learning applications. And that is that by rejecting data, we could,
and scrubbing data, we could actually be rejecting the most valuable data. So as you mentioned, imagine if we decided that, oh, a bug flew in front of the camera, so therefore that's not going to know how to deal with coyotes. And if you reject
pedestrians that aren't in the crosswalk, or if you reject cars driving the wrong way on the highway,
you know, these may seem like poor quality, but those are actually extremely valuable data sources.
Absolutely. And again, if we are just talking about that one camera, that one sensor, which could be a camera or just a log, and then, you know, something gets garbled, maybe or car usage, you have potentially the benefit
of having multiple overlapping modalities.
So what got missed by the front camera, center camera might have been captured by the front
right camera.
And so you can still get that with a slightly different perspective.
It might have
been captured by the LiDAR, et cetera. So that thing can compensate. But yes, if you just take
that one sensor and you miss that, then you might have kind of lost that one sample of that one
class of objects or use cases that you are looking for. And we also have to, of course, consider the quality of the data
in terms of what we didn't capture.
If everything you've got is industrial automation systems that are working
and you never capture the ones that are failing,
then that's not going to be a very good model either.
So yeah, there's all sorts of things that go into this question of data quality.
And this is something that is kind of very much experienced by practitioners because
they realize that they cannot be driving around waiting for that coyote to cross I-5.
So now there is an increase in people who are providing synthetic data generation, people who are providing ways of
driving on a boarding stretch of I-5 and then making a coyote run across.
And you can choose. You can have a coyote run across, you can have chickens walk across,
you can have a duck waddle. Or you can actually have a stroller across the highway.
So that's kind of getting to really interesting things.
But it's an interesting problem
because it's not just a matter of what we as humans perceive,
but for machine learning to work,
it has to be what the camera has seen.
So you have to go beyond just kind of saying,
hey, this is a Unity kind of a generated scene
where, like in a gaming system, I make this coyote cross the road, but now it has to be
faithful to what the camera would have seen. Otherwise, my training won't, you know, actually
give the right kind of results. And so there are people in, I would say, our partners who can then create these kinds of artificial scenes to extend the scenes that you've collected with your real life driving. you know, numbers like you have to be driving 11 billion miles or something like that.
There's no way you're going to be doing that.
So you need these kinds of scenarios.
And you can actually then, you know, conjure up a scenario and then have your model, you
know, subject your model to it.
Yeah, this topic brings a lot of flashbacks.
I mean, kind of a funny note.
I mean, when you talk about data quality, the one thing I learned is somebody's trash
is somebody else's treasure, right?
And for that particular reason,
that brings me to another topic you brought up earlier,
is people don't throw away data
because they never know if the trash is going to become treasure
or if the treasure is going to become trash.
So they just keep the storage around forever.
So going back to DataOps, DataOps is a methodology.
Can you talk a little bit about the tools that people are using today in DataOps?
Is there a trend?
Is there to-dos and things not to do regarding to data ops as far as tools is concerned?
Yeah. So usually, and use the term, Frederick, you call it a data pipeline. So people set up
the data pipeline and usually it starts off simply as, let me ingest the data, let me transfer it to
my data center or my cloud, and then I will
process it. And what that processing is depends upon what you want to do. And some of that can
be standardized, like my images from this camera have to meet this kind of attributes to be useful.
And then they have to get sliced into tiles and so on. So then that becomes your data pipeline.
But now as you start rolling it out in different parts of the world,
and you've got some tests going on in Germany and in China,
now you have to deal with very different sets of data.
Now you also have to deal with data transfers
because you might be actually based in the
US and you will have issues with transferring data across from China to the US, et cetera,
from GDPR perspectives from these different parts.
So slowly this thing kind of starts, you know, bloating up and then it's typically a scaling
problem.
But essentially at the, let's say at the lowest level, it's a matter of ingesting data, storing data, analyzing data.
And then, as with any case, we've seen that really what helps us understand data is the metadata.
And the metadata is usually a tiny percentage of the bulk of the data, especially when it comes to, you know, complex data like images, LiDAR, etc.
Anything which is a sequence of actions that gives you that.
So you will use some compute, you will use some kind of a compute stack.
You might be using Spark. People used to use Hadoop.
You might use computer vision algorithms. Then it's a matter of stitching together some kind of orchestration that you have to do action one, followed by action two, followed by action three. If there's an error, do this. Then you have tools that come in like airflow. And so on, the stack keeps on growing, the data keeps on growing. Then
you have to kind of manage your storage costs. You have to see what part you have to egress.
Now you can start coming into cloud dynamics about, or cloud economics about, oh, my hardware
simulators are outside and I'm pulling out maybe 10 terabytes of data.
So all of these things start getting more complex and growing.
So how do we manage that pipeline? How can we make sure that it can scale from my three cars that I have running around
in Detroit to maybe 300 cars across 10 different locations while I'm testing to 3 million cars,
which is going to be launched in 2022, 2023, or they have been already launched.
And obviously here, needless to say, I think Tesla has been, you know, leading the way.
But of course, they are not sharing all their secrets.
But I think they do talk about how they are doing things, saying like now others can go ahead and do it.
And if you look at it, that kind of data pipeline in Tesla's case has gone inside the car because your data
scientists can say here is a scenario I'm interested in please tag these
things and any of my million cars sees that. Well it could be a smaller section
that I mean not everybody is going to be in that but they have a fleet of a
million cars so that gives them a nice set of agents on the road, agent smiths, if you will.
And so a stack that combines storage, data processing.
Data processing could even mean as simple as converting a proprietary format into a standard format, into a vendor-specific image format into MP4,
H.264, understanding that. So maybe it's an object detector, maybe it's a classifier,
doing some fusion, saying images from this camera at these locations with the sun at 12 o'clock,
so checking the timestamp maybe. All of those things need to be put together.
So one of the things we are seeing is what people fail to appreciate is the difficulties of scaling.
So scaling is in two ways. One is scaling your pipeline vertically, so making it more complex.
Scaling your pipeline horizontally, letting it stretch across from test centers into the cars, into your data centers, into the cloud, being able to move things, being able to evolve it.
Hey, I mean, if you're writing any code, you better be ready for DevSecOps.
You need to be able to integrate with your enterprise IAM systems. You have to be able to manage it with how you're doing DevOps, et cetera. And so that's the infrastructure.
But then you have this other more meaty part, which is if I'm generating a lot of data, which of that data is relevant and that
matches, say, my set of interests? And so to your point about one man's trash is another man's
treasure, if you happen to be doing one kind of scene recognition, whereas your colleague is doing
lanes, you're both looking at the same image in a different way
and saying that is useless.
I don't have any lanes in this case.
But then for you, that's the one where there is,
I don't know, a moose crossing a fire road.
So there you have it.
So did that answer your question
or did I cover the topics there?
No, you did. I was just trying to make sure that people understood the different scaling methodologies, do you see any particular methodology that is've tried to take at Accredate, at least, is that
there will be an existing infrastructure that needs to get absorbed into something else. So
we've tried to keep it like a framework, a fabric. And I think that is a good way to do this,
because even within companies, especially large companies that have different entities across the world, we see that they have different set of tools, different sets of infrastructure choices.
And then you need to blend them all together.
So having this notion of flexibility is an important thing so that your Japan entity can work and exchange data with your
US entity and the both of them can do that with their European entity. What we haven't seen in
this is, okay, we managed to get, okay, let's say we managed to get control of the petabyte a month
per project that I'm generating and I've somehow stored it.
What next?
Which thousand, 10,000 images do I choose
out of this petabyte a month?
That's like a 0.01%.
Which 0.01% do I actually go find within that?
So that is another area where we've been focusing on.
So this is like, in some ways, helping the data science and the IT teams take control
to helping the data science now move it closer to training.
I mean, remember, we are still at the data level.
The next level is you select your data set that you'll use for training.
Then you chop up that data set into a training set and a testing set and a validation set,
et cetera.
So what is going to be your training set?
What do I need to, what will give me the best coverage of possible classes or use cases?
And just doing training is also an expensive effort.
So can I optimize it to find the smallest test set that covers all the possibilities and gives good coverage?
So we're seeing that people really haven't found a way to deal with that. That notion of how do I explore my
data efficiently from a gross level of filtering to a very fine level of
filtering saying here is the last 10,000 clips that I need versus out of these
many many many hours from you know different cars. And that's a really good point, because especially now we're seeing an increasing size of data
sets and data models.
And we're realizing that you can't just take all the data.
You have to have a training set, you have to validate it, and you have to make sure
that it works in the real world. So I guess to sort of sum up here, Sunil,
how would you define DataOps overall
as distinct from MLOps?
Yeah, so MLOps, where there's been a lot of focus,
is more focused on somehow finding the data,
but then finding the right set of algorithms to apply to that, to go through the iterations of training, to then insert it into an application, to deploy it,
and then keep it running and monitoring it. So that to me kind of forms at the highest level the elements of
ML Ops. But starting from there, collecting the data, which might be spread across the world or
across different kinds of, let's call it edge locations, making a determination of which data
is most useful and most important that I can then transfer
in a timely way. Being able to transfer it in a timely way, and this is another funny thing we see
because people just think that, oh, if I have so much data, I need to have these 10 gigabit links
or 40 gigabit links, which of course makes the networking providers happy, but it's not something
that is feasible for every project. So if you knew that out of
the 10 petabytes that you've collected, it is the 10 gigabytes, which is the most important,
that you could send on maybe a Comcast link, but then send the rest of it over FedEx.
Just ship out those disks. The ability to do that, as we talked about, data quality, just raw data quality, being able to extract metadata, being able to manage access, control of that data might also want to explore it at a more higher level of abstractions, like saying that here is a scenario that I find.
And it might not be.
You're looking for scenes with ambulances, and you haven't defined an ambulance.
So how do you now find those things?
Being able to version, because remember, now your data is becoming code.
So now you're going to have to version the data that you've used for training a certain, you know,
a certain version of the model, managing the storage.
So which data stays in your hot tier, which data stays in your archive tier, et cetera.
And being able to do this in a DevOps fashion,
which means that your data pipelines have to be
evolvable, that they will keep on changing, that your metadata catalog may keep on evolving.
How do I keep this all in sync? While also don't forget the compliance folks, they're going to be
saying, make sure that any data that goes from here to the US needs to be scrubbed off all faces and number plates
and all PII, something like that.
So all of that aspect, which serves, in a way,
the MLOps model training aspects,
that we call it as DataOps.
And I know that we might have been maybe co-opting
another enterprise term, but this is what we mean
in the data-centric AI, data ops side of things. Well, thanks a lot. I think that's a great
summary of an introduction to data ops and the challenges related to it. So we've reached the
time in our podcast where we sort of shift gears here, and we're going to surprise our guest with
three questions that he has not been prepared for to get a little have a little bit of fun and get a
little bit of uh futures involved in the show so uh let's go ahead and start off frederick you want
to ask yours first yep so my question for you is do you think we should expect another AI winter?
AI winter?
Yes.
Quite the contrary.
If anything, we are seeing, and I think, what would be an AI summer?
That would be my view based upon...
Now, of course, it depends upon what you mean by AI.
If you're meaning AI as in Skynet,
maybe that's a different thing,
but look at where things are.
By the way, talking about Skynet,
given, I don't know if this is okay to say that,
but given the current geopolitical situation,
we might see the emergence of Skynet.
But that aside, there are so many other ways in
which I think AI is here to stay. And I think it's quite the opposite. I don't see this as an AI winter.
Well, given your background in vehicle AI, I just have to ask, when will we see a full self-driving car that can
drive anywhere at any time on its own, level five?
Level five, the only pragmatic way it will come out is within very controlled environments. So I do expect that we will define
certain, you know, corralled situations where you can drive. So people have tried doing things with
deliveries. Nuro, for example, we are based here in Los Altos, California, and then I see the Nuro cars mapping the streets.
So something like that, where there is a your driver's seat around and take it anywhere.
I don't know.
As in, I am tempted to make a guess, but it's really, there are so many, it just is a testament to evolution that we have come to this point that we can drive almost anywhere.
And it's taking cars who should have been, I guess, self-driving in 2020, you know, time to get there.
Well, now, as promised, we're using a question from a previous podcast guest.
So this one comes from Michael Malley, the SVP of Marketing and Sales at Seneca Global.
Mike, take it away.
This is Michael Malley, SVP of Marketing and Sales for Seneca Global.
And my question is, can you give an example where an AI algorithm went terribly wrong and gave a result that clearly wasn't correct. I'd love to hear that.
Would a rogue bot, as in the one from Microsoft, qualify? I mean, when the bot goes
amok and kind of starts telling you to, to you know go commit suicide or do something really
nasty or start using really bad language i would say that that qualifies for that and that comes
from indiscriminate uh using the use of indiscriminate data for training so i will
i will tie that back to you know bad, bad curation of data for training.
That's a data quality issue.
Well, thanks so much for joining the discussion.
We look forward to hearing what your question might be for a future guest.
And if our listeners want to join in, just send an email to host at utilizing-ai.com and we'll record your question.
So Sunil, it's been great having you here. Thank you so much for taking the time to join us on the podcast.
Where can people connect with you and follow your thoughts?
And are there anything interesting coming up in your life?
Yeah, you can connect with me on LinkedIn, Sunil Samal.
Or if you would like to drop me a line, that's sunil.samal at akridata.com.
That's A-K-R-I data, A-K-R-I data.com. Sunil Samal or if you would like to drop me a line that's Sunil.Samal
at AKRIdata.com that's A-K-R-I data.com in terms of you know getting more
information about us and how we look at or handle data ops for very large
scale machine learning we will have a session with our partners
microsoft azure at nvidia gtc so hopefully you would have seen it but we can always share the
recordings of that or point you to that material so i look forward to people reaching out. Frederick, go ahead.
Yeah.
So I'm actively designing and deploying large scale GPU clusters for
customers. And I'm also working on a startup around data management.
You can find me on LinkedIn and Twitter as Frederick V Heron.
And as for me,
you can find me here at the podcast on social media at S. Foskett.
And also, of course, I'm working on our AI Field Day event, which is coming up May 18th through 20th.
We would love to have you join us as a presenter, as a delegate, or just in the audience.
So please do check out techfieldday.com to learn more about that.
So thanks for listening to the podcast. If you enjoyed this discussion, please do subscribe in your favorite podcast platform and
give us a rating or review wherever that is. This podcast is brought to you by gestaltit.com,
your home for IT coverage from across the enterprise. For show notes and more episodes,
go to utilizing-ai.com or follow us on Twitter at utilizing underscore AI.
Thanks for listening and we'll see you next time.