Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x26: DataOps - Putting the Data in Data Science

Starting point is 00:00:00 I'm Stephen Foskett. I'm Frederik van Heren. And this is the Utilizing AI podcast. Welcome to another episode of Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, data science, and other artificial intelligence topics. Over the years, we've spent a lot of time talking about unintended outcomes of AI. We've talked about ethics and bias.

Starting point is 00:00:31 And one of the reasons that a lot of this stuff happens is data quality. And I think that that's something that we really need to zoom in on as an industry. Right, Frederick? Yeah, it's all about the data quality, not about the data quantity. Collecting the data is one thing, but you have to send it through a pipeline to clean it, process it in order to create models.

Starting point is 00:00:53 Exactly. And that's leading to a new field to go along with MLOps, DataOps. And I think that that's all about the idea of quality, of having a quality pipeline. And it doesn't just mean the quality of the underlying data. It means the quality of everything that leads into the model. And that's why we've decided to invite Sunil Semel here to join us to talk about DataOps and data quality. Nice to have you, Sunil. Thank you, Steve.

Starting point is 00:01:20 And thanks to you, Frederick, for introducing me to Steve. It's my pleasure to be here. My name is Sunil Samal. I am with a company called AccuData. But I started my journey many, many years back. I've been a bit of a startup nomad. My first startup was out of a research institute in Belgium called IMEC. And that was in designing, tools for designing systems on chip,

Starting point is 00:01:52 after which I did a fabulous semi, after which it was a bit of high-performance storage, a stint in application security. And here we are in what we think is a rather neglected side of AI and machine learning, which is how do you manage your data so that you can train your models better? And when we talk about data, I think it's important to draw a distinction. Now, I've mentioned before, including last week, that I'm a big storage nerd, and I'm really interested in how data is stored and organized and presented.

Starting point is 00:02:25 That's not what we mean here in terms of data ops, right? We're talking data, which is the next level up from storage, right? Absolutely. Actually, you made a very, very good point. A lot of times when it comes to data, people confound it with storage and maybe networking. It's about how do I collect data and transfer it to someplace where I can do something with it. But that's not really what we mean by data. I mean, if anything, with machine learning, with AI, the data is what informs the machine learning model. And so selecting the right data is an important thing.

Starting point is 00:03:06 So whether it is quality or an even more kind of obscure notion of relevance, is this data more meaningful for us? When people talk about long tails or rare scenarios, what is that? And how do I find that out of the big data that I'm collecting? And to both your points, it's not about the big data, it's about the right data. And so there has been a recognition of that in recent times. And folks like Professor Andrew Wong is coming up with the term data-centric AI. So it's about storing the data, it's about transporting the data, it's about understanding it, finding the right bits of data. It's about in the days of privacy, we have to be careful about who accesses it and are they authorized to access it? The data is being collected in different locations, all of these aspects.

Starting point is 00:04:04 And then, of course, there's this overriding thing about cost. So just like an elephant's memory, data keeps on getting collected, but somebody's got to pay for it. So yeah, when we are talking about data ops, that's what we are talking about. It's not just about storage. It's not just about networking. It's interesting that you're talking about data-centric AI

Starting point is 00:04:28 because in my book, AI means it's all about data. So it's almost like saying it twice. The reason why Professor Andrew Ng brought it up, is that because he wanted to accentuate the DataOps part of it? I think I would say that he came to it based upon his own experiences, which are vast. And I think he's been recognizing that it's not about indiscriminately training models

Starting point is 00:04:59 on a lot of data. And that's one way to get bias in there. I mean, we have these famous examples, unfortunate examples of bots going rogue, where they kind of learn from everything on the internet and then start becoming racist. So what is important then is to curate that data properly so that you can find, you know, to give that even handedness to the training. So when you look at that, it's about selecting the data to train the models.

Starting point is 00:05:32 And then you, so that's where the data centricness comes. Look at the data that you need to train your models, feed them the right kinds of problem scenarios, long tail scenarios, the scenarios from which you can learn, and then you get the right kind of results. So do you think that DataOps can be self-healing in the sense that it can recognize when there is bad data, when there is ethical complications involved or something like that? Yeah, this is an interesting point.

Starting point is 00:06:07 Actually, in some ways, you might say, can we apply AI to AI training? And there are movements in that. So where we are today is really allowing the people who are responsible for training models or retraining models, improving models, giving them a nice framework and infrastructure with which to collect data, understand it, select the right bits, manage it, understand the cost implications, et cetera, et cetera. So there is a bit of a user-guided aspect to it. But then along with this, there are aspects like active learning. So basically the model participates in the process of recognizing which areas need improvement.

Starting point is 00:06:59 And so what we find, for example, let me give you an example of AI being used for AI or computer vision being used for AI training. Self-driving cars, that's one of the areas where machine learning is being used a lot. And you want to deal with a specific scenario and usually it's an area where different roads or traffic conditions come together and you are now trying to figure out which you know how can I find more samples of data where where there are scenes with crosswalks and there is a pedestrian on the crosswalk but there's also pedestrian out of the crosswalk and how do you find them well people are actually applying object detection models and saying

Starting point is 00:07:52 show me areas which have these kinds of scenarios so now you have this little model running through spidering across your data sets and And you might say, find it across, you know, in these streets in Munich, between these dates on these models of the car with this set of sensors. So now this is how you can apply, you know, these kinds of machine learning models to select the right kind of data. So to answer your question, Frederick, we are currently starting with a more user-guided notion, and that's where it's important that the data scientists can state what is

Starting point is 00:08:31 relevant to me today. It's also creating an interesting situation where nobody's willing to throw away data because you never know that what is not relevant today might suddenly become relevant tomorrow. Besides leaving aside the fact that there might be regulatory aspects that you cannot, you know, if you've still collected that data, then you need to kind of keep it and have it be auditable. But interestingly, we kind of thought that we'll call it filtering

Starting point is 00:08:59 because filtering has built into it the notion of you can then throw away the data, but filtering was used more as prioritizing or sorting and saying this is the data I need now and this rest of the data can follow later. As in can come in slowly or it can go into cold tiers. So Steve, going back to your storage background, this is a lot of tiering but with a different kind of policy other than the hotness of the data or the frequency of access, the aging of data, et cetera. Here

Starting point is 00:09:33 you're looking at relevance of the data. So I'm interested in a question that gets right at this whole idea of data quality, and that is quite simply, how do you define quality? If you're trying to have quality data, what even is quality? Do we have a definition? So again, it's in the eye of the beholder, partly because different, and mind you,

Starting point is 00:10:00 some of the bias in my experience will come through we tend to find the data ops challenges to be exacerbated with the the automotive ADAS AV autonomous vehicles assisted you know highly assisted driving kind of scenarios and there what we have is a lot of different kinds of data. So you have different kinds of cameras, you have multiple cameras, you may have a LiDAR, you have different kinds of radar, you have your canvas data, you have GPS,

Starting point is 00:10:37 your positional data, et cetera, et cetera. And all of that data needs to be collected. And with each modality of each sensor comes a different notion of quality. So I'm here I'm going to talk about more of physical quality, and in the, maybe if I use the example of a camera, you might say that an image with a lot of glare in it is bad quality. An image which is obscured because some bug flew into the camera is bad quality. So that's one aspect of quality. LIDARs will have their own version of it, radars, and so on and so on. So there's that aspect of quality. And then there is the other aspect of, I guess, relevance is what we would say that you see a lot of empty roads, or you have situations where the road is straight and nothing much is happening. Not really interesting data. So how can I find these interesting bits of data? You're driving on, you're in California on I-5 at 80 miles an hour, nothing is happening, there's only cows on the quality, but I got the feeling that, Steve,

Starting point is 00:12:07 you were referring to that also when you said quality. Yeah. And this actually leads me to sort of the corollary of that. And this is also a concern, especially with machine learning applications. And that is that by rejecting data, we could, and scrubbing data, we could actually be rejecting the most valuable data. So as you mentioned, imagine if we decided that, oh, a bug flew in front of the camera, so therefore that's not going to know how to deal with coyotes. And if you reject pedestrians that aren't in the crosswalk, or if you reject cars driving the wrong way on the highway, you know, these may seem like poor quality, but those are actually extremely valuable data sources. Absolutely. And again, if we are just talking about that one camera, that one sensor, which could be a camera or just a log, and then, you know, something gets garbled, maybe or car usage, you have potentially the benefit of having multiple overlapping modalities.

Starting point is 00:13:32 So what got missed by the front camera, center camera might have been captured by the front right camera. And so you can still get that with a slightly different perspective. It might have been captured by the LiDAR, et cetera. So that thing can compensate. But yes, if you just take that one sensor and you miss that, then you might have kind of lost that one sample of that one class of objects or use cases that you are looking for. And we also have to, of course, consider the quality of the data in terms of what we didn't capture.

Starting point is 00:14:08 If everything you've got is industrial automation systems that are working and you never capture the ones that are failing, then that's not going to be a very good model either. So yeah, there's all sorts of things that go into this question of data quality. And this is something that is kind of very much experienced by practitioners because they realize that they cannot be driving around waiting for that coyote to cross I-5. So now there is an increase in people who are providing synthetic data generation, people who are providing ways of driving on a boarding stretch of I-5 and then making a coyote run across.

Starting point is 00:14:51 And you can choose. You can have a coyote run across, you can have chickens walk across, you can have a duck waddle. Or you can actually have a stroller across the highway. So that's kind of getting to really interesting things. But it's an interesting problem because it's not just a matter of what we as humans perceive, but for machine learning to work, it has to be what the camera has seen. So you have to go beyond just kind of saying,

Starting point is 00:15:21 hey, this is a Unity kind of a generated scene where, like in a gaming system, I make this coyote cross the road, but now it has to be faithful to what the camera would have seen. Otherwise, my training won't, you know, actually give the right kind of results. And so there are people in, I would say, our partners who can then create these kinds of artificial scenes to extend the scenes that you've collected with your real life driving. you know, numbers like you have to be driving 11 billion miles or something like that. There's no way you're going to be doing that. So you need these kinds of scenarios. And you can actually then, you know, conjure up a scenario and then have your model, you know, subject your model to it.

Starting point is 00:16:18 Yeah, this topic brings a lot of flashbacks. I mean, kind of a funny note. I mean, when you talk about data quality, the one thing I learned is somebody's trash is somebody else's treasure, right? And for that particular reason, that brings me to another topic you brought up earlier, is people don't throw away data because they never know if the trash is going to become treasure

Starting point is 00:16:41 or if the treasure is going to become trash. So they just keep the storage around forever. So going back to DataOps, DataOps is a methodology. Can you talk a little bit about the tools that people are using today in DataOps? Is there a trend? Is there to-dos and things not to do regarding to data ops as far as tools is concerned? Yeah. So usually, and use the term, Frederick, you call it a data pipeline. So people set up the data pipeline and usually it starts off simply as, let me ingest the data, let me transfer it to

Starting point is 00:17:23 my data center or my cloud, and then I will process it. And what that processing is depends upon what you want to do. And some of that can be standardized, like my images from this camera have to meet this kind of attributes to be useful. And then they have to get sliced into tiles and so on. So then that becomes your data pipeline. But now as you start rolling it out in different parts of the world, and you've got some tests going on in Germany and in China, now you have to deal with very different sets of data. Now you also have to deal with data transfers

Starting point is 00:18:04 because you might be actually based in the US and you will have issues with transferring data across from China to the US, et cetera, from GDPR perspectives from these different parts. So slowly this thing kind of starts, you know, bloating up and then it's typically a scaling problem. But essentially at the, let's say at the lowest level, it's a matter of ingesting data, storing data, analyzing data. And then, as with any case, we've seen that really what helps us understand data is the metadata. And the metadata is usually a tiny percentage of the bulk of the data, especially when it comes to, you know, complex data like images, LiDAR, etc.

Starting point is 00:18:50 Anything which is a sequence of actions that gives you that. So you will use some compute, you will use some kind of a compute stack. You might be using Spark. People used to use Hadoop. You might use computer vision algorithms. Then it's a matter of stitching together some kind of orchestration that you have to do action one, followed by action two, followed by action three. If there's an error, do this. Then you have tools that come in like airflow. And so on, the stack keeps on growing, the data keeps on growing. Then you have to kind of manage your storage costs. You have to see what part you have to egress. Now you can start coming into cloud dynamics about, or cloud economics about, oh, my hardware simulators are outside and I'm pulling out maybe 10 terabytes of data. So all of these things start getting more complex and growing.

Starting point is 00:19:57 So how do we manage that pipeline? How can we make sure that it can scale from my three cars that I have running around in Detroit to maybe 300 cars across 10 different locations while I'm testing to 3 million cars, which is going to be launched in 2022, 2023, or they have been already launched. And obviously here, needless to say, I think Tesla has been, you know, leading the way. But of course, they are not sharing all their secrets. But I think they do talk about how they are doing things, saying like now others can go ahead and do it. And if you look at it, that kind of data pipeline in Tesla's case has gone inside the car because your data scientists can say here is a scenario I'm interested in please tag these

Starting point is 00:20:50 things and any of my million cars sees that. Well it could be a smaller section that I mean not everybody is going to be in that but they have a fleet of a million cars so that gives them a nice set of agents on the road, agent smiths, if you will. And so a stack that combines storage, data processing. Data processing could even mean as simple as converting a proprietary format into a standard format, into a vendor-specific image format into MP4, H.264, understanding that. So maybe it's an object detector, maybe it's a classifier, doing some fusion, saying images from this camera at these locations with the sun at 12 o'clock, so checking the timestamp maybe. All of those things need to be put together.

Starting point is 00:21:54 So one of the things we are seeing is what people fail to appreciate is the difficulties of scaling. So scaling is in two ways. One is scaling your pipeline vertically, so making it more complex. Scaling your pipeline horizontally, letting it stretch across from test centers into the cars, into your data centers, into the cloud, being able to move things, being able to evolve it. Hey, I mean, if you're writing any code, you better be ready for DevSecOps. You need to be able to integrate with your enterprise IAM systems. You have to be able to manage it with how you're doing DevOps, et cetera. And so that's the infrastructure. But then you have this other more meaty part, which is if I'm generating a lot of data, which of that data is relevant and that matches, say, my set of interests? And so to your point about one man's trash is another man's treasure, if you happen to be doing one kind of scene recognition, whereas your colleague is doing

Starting point is 00:23:01 lanes, you're both looking at the same image in a different way and saying that is useless. I don't have any lanes in this case. But then for you, that's the one where there is, I don't know, a moose crossing a fire road. So there you have it. So did that answer your question or did I cover the topics there?

Starting point is 00:23:26 No, you did. I was just trying to make sure that people understood the different scaling methodologies, do you see any particular methodology that is've tried to take at Accredate, at least, is that there will be an existing infrastructure that needs to get absorbed into something else. So we've tried to keep it like a framework, a fabric. And I think that is a good way to do this, because even within companies, especially large companies that have different entities across the world, we see that they have different set of tools, different sets of infrastructure choices. And then you need to blend them all together. So having this notion of flexibility is an important thing so that your Japan entity can work and exchange data with your US entity and the both of them can do that with their European entity. What we haven't seen in this is, okay, we managed to get, okay, let's say we managed to get control of the petabyte a month

Starting point is 00:25:01 per project that I'm generating and I've somehow stored it. What next? Which thousand, 10,000 images do I choose out of this petabyte a month? That's like a 0.01%. Which 0.01% do I actually go find within that? So that is another area where we've been focusing on. So this is like, in some ways, helping the data science and the IT teams take control

Starting point is 00:25:31 to helping the data science now move it closer to training. I mean, remember, we are still at the data level. The next level is you select your data set that you'll use for training. Then you chop up that data set into a training set and a testing set and a validation set, et cetera. So what is going to be your training set? What do I need to, what will give me the best coverage of possible classes or use cases? And just doing training is also an expensive effort.

Starting point is 00:26:06 So can I optimize it to find the smallest test set that covers all the possibilities and gives good coverage? So we're seeing that people really haven't found a way to deal with that. That notion of how do I explore my data efficiently from a gross level of filtering to a very fine level of filtering saying here is the last 10,000 clips that I need versus out of these many many many hours from you know different cars. And that's a really good point, because especially now we're seeing an increasing size of data sets and data models. And we're realizing that you can't just take all the data. You have to have a training set, you have to validate it, and you have to make sure

Starting point is 00:26:59 that it works in the real world. So I guess to sort of sum up here, Sunil, how would you define DataOps overall as distinct from MLOps? Yeah, so MLOps, where there's been a lot of focus, is more focused on somehow finding the data, but then finding the right set of algorithms to apply to that, to go through the iterations of training, to then insert it into an application, to deploy it, and then keep it running and monitoring it. So that to me kind of forms at the highest level the elements of ML Ops. But starting from there, collecting the data, which might be spread across the world or

Starting point is 00:27:54 across different kinds of, let's call it edge locations, making a determination of which data is most useful and most important that I can then transfer in a timely way. Being able to transfer it in a timely way, and this is another funny thing we see because people just think that, oh, if I have so much data, I need to have these 10 gigabit links or 40 gigabit links, which of course makes the networking providers happy, but it's not something that is feasible for every project. So if you knew that out of the 10 petabytes that you've collected, it is the 10 gigabytes, which is the most important, that you could send on maybe a Comcast link, but then send the rest of it over FedEx.

Starting point is 00:28:38 Just ship out those disks. The ability to do that, as we talked about, data quality, just raw data quality, being able to extract metadata, being able to manage access, control of that data might also want to explore it at a more higher level of abstractions, like saying that here is a scenario that I find. And it might not be. You're looking for scenes with ambulances, and you haven't defined an ambulance. So how do you now find those things? Being able to version, because remember, now your data is becoming code. So now you're going to have to version the data that you've used for training a certain, you know, a certain version of the model, managing the storage. So which data stays in your hot tier, which data stays in your archive tier, et cetera.

Starting point is 00:29:39 And being able to do this in a DevOps fashion, which means that your data pipelines have to be evolvable, that they will keep on changing, that your metadata catalog may keep on evolving. How do I keep this all in sync? While also don't forget the compliance folks, they're going to be saying, make sure that any data that goes from here to the US needs to be scrubbed off all faces and number plates and all PII, something like that. So all of that aspect, which serves, in a way, the MLOps model training aspects,

Starting point is 00:30:17 that we call it as DataOps. And I know that we might have been maybe co-opting another enterprise term, but this is what we mean in the data-centric AI, data ops side of things. Well, thanks a lot. I think that's a great summary of an introduction to data ops and the challenges related to it. So we've reached the time in our podcast where we sort of shift gears here, and we're going to surprise our guest with three questions that he has not been prepared for to get a little have a little bit of fun and get a little bit of uh futures involved in the show so uh let's go ahead and start off frederick you want

Starting point is 00:30:57 to ask yours first yep so my question for you is do you think we should expect another AI winter? AI winter? Yes. Quite the contrary. If anything, we are seeing, and I think, what would be an AI summer? That would be my view based upon... Now, of course, it depends upon what you mean by AI. If you're meaning AI as in Skynet,

Starting point is 00:31:29 maybe that's a different thing, but look at where things are. By the way, talking about Skynet, given, I don't know if this is okay to say that, but given the current geopolitical situation, we might see the emergence of Skynet. But that aside, there are so many other ways in which I think AI is here to stay. And I think it's quite the opposite. I don't see this as an AI winter.

Starting point is 00:31:57 Well, given your background in vehicle AI, I just have to ask, when will we see a full self-driving car that can drive anywhere at any time on its own, level five? Level five, the only pragmatic way it will come out is within very controlled environments. So I do expect that we will define certain, you know, corralled situations where you can drive. So people have tried doing things with deliveries. Nuro, for example, we are based here in Los Altos, California, and then I see the Nuro cars mapping the streets. So something like that, where there is a your driver's seat around and take it anywhere. I don't know. As in, I am tempted to make a guess, but it's really, there are so many, it just is a testament to evolution that we have come to this point that we can drive almost anywhere.

Starting point is 00:33:31 And it's taking cars who should have been, I guess, self-driving in 2020, you know, time to get there. Well, now, as promised, we're using a question from a previous podcast guest. So this one comes from Michael Malley, the SVP of Marketing and Sales at Seneca Global. Mike, take it away. This is Michael Malley, SVP of Marketing and Sales for Seneca Global. And my question is, can you give an example where an AI algorithm went terribly wrong and gave a result that clearly wasn't correct. I'd love to hear that. Would a rogue bot, as in the one from Microsoft, qualify? I mean, when the bot goes amok and kind of starts telling you to, to you know go commit suicide or do something really

Starting point is 00:34:28 nasty or start using really bad language i would say that that qualifies for that and that comes from indiscriminate uh using the use of indiscriminate data for training so i will i will tie that back to you know bad, bad curation of data for training. That's a data quality issue. Well, thanks so much for joining the discussion. We look forward to hearing what your question might be for a future guest. And if our listeners want to join in, just send an email to host at utilizing-ai.com and we'll record your question. So Sunil, it's been great having you here. Thank you so much for taking the time to join us on the podcast.

Starting point is 00:35:05 Where can people connect with you and follow your thoughts? And are there anything interesting coming up in your life? Yeah, you can connect with me on LinkedIn, Sunil Samal. Or if you would like to drop me a line, that's sunil.samal at akridata.com. That's A-K-R-I data, A-K-R-I data.com. Sunil Samal or if you would like to drop me a line that's Sunil.Samal at AKRIdata.com that's A-K-R-I data.com in terms of you know getting more information about us and how we look at or handle data ops for very large scale machine learning we will have a session with our partners

Starting point is 00:35:47 microsoft azure at nvidia gtc so hopefully you would have seen it but we can always share the recordings of that or point you to that material so i look forward to people reaching out. Frederick, go ahead. Yeah. So I'm actively designing and deploying large scale GPU clusters for customers. And I'm also working on a startup around data management. You can find me on LinkedIn and Twitter as Frederick V Heron. And as for me, you can find me here at the podcast on social media at S. Foskett.

Starting point is 00:36:27 And also, of course, I'm working on our AI Field Day event, which is coming up May 18th through 20th. We would love to have you join us as a presenter, as a delegate, or just in the audience. So please do check out techfieldday.com to learn more about that. So thanks for listening to the podcast. If you enjoyed this discussion, please do subscribe in your favorite podcast platform and give us a rating or review wherever that is. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com or follow us on Twitter at utilizing underscore AI. Thanks for listening and we'll see you next time.

CODACE Plant Stand

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x26: DataOps - Putting the Data in Data Science

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x26: DataOps - Putting the Data in Data Science

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.