Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x12: Democratizing Unstructured Data at Scale with Edward Cui of Graviti

Starting point is 00:00:00 I'm Stephen Foskett. I'm Frederick Van Haren. And this is the Utilizing AI podcast. Welcome to another episode of Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Frederick, in the past, we've talked quite a lot about growing sizes of AI models. We've also talked about the growing sizes of the data sets that are underpinning those models and the challenges that people face in storing large volumes of data.

Starting point is 00:00:35 But one of the things we didn't really talk about is the challenges faced by researchers and academics and sort of non-corporate entities who are trying to create massive unstructured data sets as well for themselves. Right. I mean, the biggest problem is that the world is filled with unstructured data, which is obviously more difficult than dealing with structured data. The question really is not just about collecting unstructured data, but where do you go from there? And then additionally, AI is all about sharing. So how do you share data such that everybody benefits from this? And I think this is a good topic to talk about and to see how you can share data, but also how you can scale and do this efficiently. Exactly. And that's the reason that we decided to invite on our podcast, the guest that we have today.

Starting point is 00:01:30 So I'd like to introduce the founder and machine learning expert, Edward Choi. Say hello and tell us a little bit who you are. Hey, everyone. This is Edward. Well, I did my undergrad study as a mechanical engineer for three years. And in the last year, you know, like I took up the machinery class and it's totally mind blowing. And I decided that's, I really want to rest of my study. I got my master's degree there. Later, I joined Uber ATC or Uber ATG, the advanced technology group, which is the self-driving division of Uber. I started at the very beginning, stayed there for three years. And after that, I worked in another AI startup for about 10 months. And then I discovered that, you know, building infrastructure for AI and especially like

Starting point is 00:02:32 having a really good one for, you know, managing and use data, especially as structured data at scale is super, super hard. And that's why we later decided to start our own company, which is Gravity. We are building the platform to manage unstructured data at scale. And also we want to kind of give back to the community by, you know, open up the platform to the entire AI community where, you know, people can use our platform to host data for free forever if they want their data to be open to the rest of the world. And anyone in the AI community or anyone in the overall community can, you know, freely access those data

Starting point is 00:03:16 and use those data to, you know, to, you know, research on more machine learning topics and work on more machine learning applications that will in turn benefit the entire human race. And that's the goal. Yeah, and I'm looking at the website and you've got a whole bunch of open data sets in here already.

Starting point is 00:03:38 You know, I've heard people describe Gravity as kind of like a GitHub, but to me, it reminds me a lot of Thingiverse and the 3D printing community where you've got people uploading, you know, their own sets of data, and then you can kind of download, remix, use, you know, explore these data sets. I imagine that these could be used for productive purposes, but also for researchers. And it's really exciting to see things in agriculture and autonomous driving and, you know, design and all sorts of areas. You know, it's a pretty cool combination of data.

Starting point is 00:04:12 Yep. Yeah. So the reason we call ourselves GitHub is basically, well, we are a bunch of engineers, right? Like every software engineer did not get help really well, right? Like we all use open source software at some point, either in school or at work. I think if we look at, you know, the past 30 years, right? Like we have a huge advantage. We have a lot of innovation in the tech sector, right? The reason we can be there, the reason there are a lot of startups, they can start their own companies is because they have free software available to them, right? And they can build, you know, their new products using those free software. And open source software basically being the factor which kind of drive the innovation forward in the last 30 years, right?

Starting point is 00:05:00 And if we think about what can happen in the next 30 years, like AI is definitely one of the biggest technology that will change the entire landscape, how people interact with computers and how people interact with our physical words. But what will be the similar factor as open source to software? And if we're looking at all the innovation in AI, they all kind of link to open data.

Starting point is 00:05:28 For example, like Dr. Fei-Fei Li, who is still a professor at Stanford University, she released ImageNet. And that became the really famous vision benchmark for all the machine learning researchers. They can use that vision benchmark to do studies in computer vision, image recognition. And because the existing of such data set, people have the materials they need to work on new algorithms. And because of the data set, they can use that data set as a benchmark to compare their

Starting point is 00:06:07 work. So we know we make progress in those fields. And to be honest, like right now, like we don't have a lot of open data set yet. There could be more. A lot of the area could be really interesting if someone or some organization companies, they can share a small portion of their data. So people, the innovative people, the talented people, they can use that materials to work on solutions to solve that specific problems

Starting point is 00:06:37 that will actually benefit that sector a lot. And we believe if we have this platform and if we have this effort to help people to prepare open data sets and we have this platform freely available for anyone or any organization who can host their data set for free and make open data set more accessible accessible that will just benefit everyone. Yeah, I think it's a great concept. I mean, sharing data is part of the whole concept of AI. So if I'm an entrepreneur and I want to start, then how do I find data? I mean, what kind of criteria do I need to use in order to find data?

Starting point is 00:07:25 And then a follow-up question would be, you know, my concern obviously is open source data sets. It's great. I don't have to collect the data, but I also don't have control of the quality of the data that was collected, right? So how do you, how do you, do you have like metrics or so to kind of understand the quality of the data, how well or how bad the data is? Yeah, exactly. So to answer the first question, that's actually the question faced by a lot of the people, right? Like when they're trying to solve a specific problem, when they're trying to design the algorithms, it used to be really hard for them to find such data set, right?

Starting point is 00:08:00 Like they can do Google search, but oftentimes they really couldn't find the data set that they want to use. So it ended up being like a person asked another person who kind of familiar with this matters, ask for help and ask like what data set they have used in their previous work. And that's not efficient at all. Even if there could be some really great data

Starting point is 00:08:26 that's out there, but people just couldn't find it. So that's why we came up with the idea where we have a single platform where everyone can share the data for free, and then it will be much easier for the user of the data to find those data sets right away. And that's gonna be super helpful, especially someone they have a specific topic in mind they want to do research on, right?

Starting point is 00:08:51 So the quality is a really important piece there. I remember I read a paper about the quality of the open data. Even the really famous ImageNet, they have about like 2% of error rate in the data set, and some of the other data set could be could have bigger error rate. That's indeed a problem. That's why we are kind of, we just recently launched a project with Lin's foundation called open base. In that effort, we kind of want to work with people in the community on, you know, bringing standards about the qualities of data sets to the entire communities,

Starting point is 00:09:29 helping people making procedures on how to produce really high quality data. To be honest, at the moment, there's really no metrics. There's no single metrics yet to, you know, to say, hey, what's the quality of this data set or that data set? We all know there's errors in there. And the problem of those errors is when we use those data as the ground truth, but it's not actually the ground truth, then we couldn't really trust the result produced by those data. And that creates huge problems. And that's exactly what we're trying to solve there. And also, you know, in recent years, especially this year, Professor Andrew Ng, they gave a speech

Starting point is 00:10:14 on data centric AI, basically talking about techniques, how you can actually improve the quality of the data. And, you. And improving a model in recent years is harder and harder because no matter how hard you work, you can only improve less than 1%. But improving the quality of the data will in turn have much better improvement in the end-to-end settings of the entire model, even though you don't have to change the model at all,

Starting point is 00:10:49 that the result they show is actually quite amazing. But just get back to the question, we are working with Lynx Foundation on this new project called OpenBase, and we are working really hard with the community to build that quality standards and build procedures to produce really high quality data sets. Right.

Starting point is 00:11:10 So to kind of continue a little bit on the topic. So if I'm using one of those data sets, can I then, and if I find data that doesn't fit the quality criteria, can I go in and maybe market as, you know, don't use this anymore? Or is it really like GitHub, you know, where you can, you have layers of versions and so data never, never deletes. Because another concern would be if I use this, a data set for building models, you know, I don't want that data set to disappear, right? That would be, that would be problematic.

Starting point is 00:11:43 Yeah. So one thing really nice about this platform is, you know, when you use the data, right, like you train a model and oftentimes, let me just tell you a story. Like when I work at Uber, sometimes I have some of the data, you know, collected from the self-driving cars. I have people who annotate that data and we train the model and we evaluate that model and see whether that model really produced good result or not. And then when it says the model

Starting point is 00:12:13 actually gets something wrong, I look at the images, I look at the predictions. It turns out it was our original annotation was wrong. And that's why I said sometimes you couldn't really trust the ground truth. So the really nice thing about this platform is when you use that data to train a model, and when you use that model to kind of, you can use that model to predict on the training

Starting point is 00:12:39 data, and you will find the data which didn't agree with your model. And sometimes when you take a look of the data, it could be the training data was wrong. And then you can actually contributing that result back to the data. And if more and more people using the data set to train a model, they all train different models, right?

Starting point is 00:12:58 And those different models will basically find what's wrong with the original training set. And everyone can contribute back to the training set. And we wrong with the original training set and everyone can contribute back to the training set and we can constantly improve the training set and make it better and better. It's not like mark the data is wrong, rather hey, the model can automatically mark there could be some problem with the data and then if we have a lot of models, they basically do a vote on the data. And then we can automatically

Starting point is 00:13:28 correct the data. And then the data can be better and better with more and more people using it. I mean, the concept of sharing is caring, I guess. Do you see enterprises reaching out to you and saying, hey, we have large data sets.

Starting point is 00:13:47 Are you interested? Is that something that you see happening? We do. We have several enterprises. They have different reasons to share data. Some of the enterprises, they really want to, they create a lot of data and they want people to use that data, but they don't have, you know, they have no exposure, you know, to the researchers, to people, to other enterprises, right? So they decided the prediction, well,

Starting point is 00:14:21 the future data is really important, is really valuable. And sometimes they open up some of the history data on the platform. So the researchers or other enterprises can use that data to build a prototype. And if they really like the quality of those data, they can, you know, at the end, connect those companies and, you know, purchase those data or building other type of collaboration effort between those two enterprises. And sometimes other enterprise is, you know, sometimes they accumulate a huge amount of data, but they don't have the talent to use that data. They don't even know, you know, how to deal with the data, what value they can get from the data. They just need talented people from the community to work on the data, see how they can use the data, what value they can get off the data. And sometimes those companies get inspired

Starting point is 00:15:18 by the work carried out by the community members. And that in turn, in the end, will turn into product of those corporations. That's some of the trend we have observed. Yeah, it almost becomes a marketplace of data, a free or paid marketplace of data, right? I mean, I could see as well, you could have a new kind of company that is collecting data sets for the purposes of putting it out there for others to use in hopes that then they can, you know, kind of build a business out of it, sort of a data farmer in a way, and that's hoping that somebody else can find something else to do with the data that they're collecting, right? Yeah. So the pity thing is a lot of the corporations, they actually generate a lot of data in their normal operations, right? They don't know how to use that data. They

Starting point is 00:16:15 just throw that away. And I think one of the practices, we are building this data platform for enterprise, right? Like sometimes it's just the enterprise need to kind of plan out that even before they apply real machine learning applications in their organization, they have to accumulate the data. Because if they don't accumulate those data, a height of hiring a machine learning engineers, when they onboard that guy, he doesn't really have anything to work on, right? Like he has to wait for months and just wait for the raw data to come in to build machine learning model on those data. So I think every organization, they should start to accumulate data. If they produce the data, don't just throw them away.

Starting point is 00:17:07 Accumulate those data. If they don't know how to use that data, just open up a small portion to the community. And that will be tremendous useful for the rest of the world. That will kind of maybe inspire a lot of innovation. And that will change our life forever. So this sounds really hopeful, but not to be a naysayer, but isn't there also a possibility that the quality of the data could be driven downward by these kinds of pressures? So essentially, if there are errors that those errors could propagate through other applications

Starting point is 00:17:45 and other users of the data, and then that could end up being contributed back into the repository as a correction that's actually not a correction, that's actually a wrong direction correction that would really change the data only based on a single vote. Sometimes, if there are going to be enough models, then we will use the vote of different models to evaluate whether the data has some errors or has some issues. To be honest, and sometimes we could also have human involved, volunteers involved to kind of give a second check of the data to make sure it indeed being corrected,

Starting point is 00:18:38 not getting worse and worse. So there's ways and definitely we need volunteers and we'll design our product in the way where, you know, people can come in, volunteers can come in and can give a check. Right. I mean, I think data quality is also referring to the metadata, right? So I can upload, you know, thousands of pictures of dogs and I can put in the metadata pictures of cats, right? I mean, that will, while the data, the quality of the data, pictures of the dog might be right, you know, the fact that it's cats kind of, you know, ruins it. Do you have any concerns about copyrights where somebody might upload some data that they don't own the copyright titles to it?

Starting point is 00:19:23 Yeah, that's a really good question. That's actually a really, really important question we need to solve. And there's not just like someone upload the data they don't own, and sometimes it could be a bigger data set includes a portion of the data, which actually under different licenses. And the license cascading creates,

Starting point is 00:19:43 definitely creates a problem. So we, in the OpenBase projects, we actually talk with a lot of the law experts, lawyers, and we talk with some other efforts, for example, like the CDLA effort, also in the Linz Foundation, which trying to build a MIT-like licenses for open data. You know, like for open source software,

Starting point is 00:20:06 we have Apache license, we have MIT, we have GPL. But for, you know, open data sets, there are a hundred different licenses has been used by different people and they're not actually designed for data sets. They didn't really kind of say, hey, you like who owns the raw data? Who owns the metadata? Right? Like what the derivatives of the model trained by the data,

Starting point is 00:20:36 what the model trained using the data, who has the ownership, who has the copyright of that model, a lot of the licenses don't really have the necessary information to kind of rule all those aspects of open data, right? And in the OpenBAS effort, we are actually talking with a bunch of different law experts on the license issues. And hopefully we will come up with a system where like a set of new licenses

Starting point is 00:21:08 dedicated for open data sets and a system where we can track all the licenses of the data sets and make sure like they are being properly used. They are under proper licenses and we want to make sure like everyone who want to use the dataset understand the license and know exactly what they can use the dataset for,

Starting point is 00:21:33 know what they can't really use the dataset for and that's also some of the effort we want to collaborate with the community. So back to the previous question though, like about the quality of the metadata, I think that's exactly why the effort with the community is really great because people in the communities, they always identify those problems. They can pop up those problems and we will solve that problems

Starting point is 00:22:04 based on the reporting of the member from the community. So the question of open data and licensing also kind of begs the question, what about data that is intentionally licensed or, you know, constrained in some way? Can you see a future where there are open data sets with open licenses, but also proprietary data sets with proprietary licenses, maybe even paid data sets that exist in the Gravity or in some other repository and that are used for specific purposes by or offered for purposes by a vendor? Yep. I can definitely see that happen. But I think we still a little bit, you know, we still a little bit far away from that.

Starting point is 00:22:51 So for structured data, there's technologies to, you know, to protect the structured data. But for unstructured data is a little bit harder. You know, like we have new technologies like federated learning, we have some new technologies. We also have a patent called Sandbox. We propose a way to use data in a sandbox in a safe environment where you can train the model, you can take the model away, but you had to leave the data inside the sandbox. There's several efforts on that front. We also think transfer learning could help in that context, but we still think there's certain technologies need to be developed before we know, trade unstructured data. And a follow up question to that. Do you also see people uploading data where they basically say free for all except for governments, military, you know, ethical kind of things?

Starting point is 00:23:57 Do you see those requests also? We have not yet, to be honest. Yeah, we have not yet, but that's really a good question. I can approach to the committee members and kind of ask them whether they see that situation, they see that scenarios. We have not yet, but I believe, you know, the governments or other agencies, they may have their own private data the entire public, not doing bad things using those data. So we don't see that yet, but that's definitely something we should keep in mind and approach the community members and kind of talk a little bit more on that. Well, I mean, no doubt that could be written into the license.

Starting point is 00:25:06 Any arbitrary license can include arbitrary text. As a reminder, Apple's end user license agreement forbids any third party from using it to develop, design, or manufacture nuclear missiles or chemical or biological weapons. I'm not sure that that has been enforced with iTunes specifically, but you know, I mean, it is in the license and I suppose anybody could put anything in the license if they so chose. But yeah, I definitely think when it comes to data science and machine learning, these kind of ethical questions might become more pertinent than music sharing service?

Starting point is 00:25:48 Well, it could be. Well, you know, software are, data is very similar to software, right? Like, they are, you know, the building blocks of products. So I think we can learn from the experience, you know, from the open source software to kind of guide how open data being used. That's definitely like we learn a lot from the open source software and the entire processes. So on that note, I think that there is definitely an analogy between what you're doing with open data sets and with something like GitHub with open source software. What are the, as somebody who's been in this space and done both open source software and open data, what are some of the surprising differences between open data sets and open

Starting point is 00:26:41 source software? What are the areas in which they are not the same? Yeah, that's a really good question. So contributing open data is a lot harder than contributing open source software, right? If you have a machine, if you understand programming, you can always contribute to the open source software, but collecting data sometimes is time consuming

Starting point is 00:27:04 and is also very costly. So we saw right now most of the open data are contributed by either institutions or organizations. There is rarely individual contributors can contribute on the open data. And also the characteristic of open data, sometimes contributing open data is important, but more often contributing the model, the algorithms associate or produced, derived from the open data is very important. So not like open source software,

Starting point is 00:27:44 people just write software together. But for open data, if we create a value for every single AI developers, not just through the open data itself, but also through the models produced by the open data and the comparison of the models. And that's the differences between open data and open source software.

Starting point is 00:28:06 Yeah. And I think also that there's many more tools for software development, like linting tools, for example, right? Where the tool doesn't really need to understand what it's doing, but it's looking at the syntax to figure out what's going on while for data, it's a lot more challenging. And it's also, you know, what happens if you have PCI data or social security numbers, right? How do you figure that out? There are no, well, actually there are tools to look for that, but it's very time consuming, right? Linting a 1 million lines of code happens really fast.

Starting point is 00:28:42 You know, 1 million pictures and analyzing them takes significantly more time. Yeah. Yeah. So in the future, we can imagine where, you know, the people, the data are opened, but not being seen by any people. It can only be accessed by machines.

Starting point is 00:29:01 You can train a model on the data, but you cannot actually see the data itself. So hopefully in that way, or in similar technologies, people can still use the open data, but they cannot really just take the data away. Especially there's privacy issues with the data. There may be other issues with the data. So we kind of see there could be technologies to solve the problems. Well, before we wrap up here, I just want to give you one last moment. Where do you see Gravity going in the future?

Starting point is 00:29:41 You know, what do you think that this is going to look like five years from now? Yeah, that's a really good question. So I think we saw how GitHub grew, right? Like it from a website just for a bunch of geeks working on Ruby on Rails now became the single most popular platform for all the software developers. And we want Gravity to be a similar platform, but for all AI developers, for all the people working in machine learning, in AI. We want Gravity to be the hub. GitHub. And also, we know, you know, making, supporting that community, we need a really successful commercial product. So we, that's why we have a similar model compared to GitHub, which is pay for privacy, right? We have the model where, you know, if organizations or companies,

Starting point is 00:30:42 they want to manage their unstructured data at scale inside their own organization, internally in their own organization and collaborate on those data, they can pay for the software. So we want to be successful both in terms of the community side, being that platform providing free services for everyone.

Starting point is 00:31:04 And we also want to be successful in the commercial space and helping organizations of any size to be able to use their unstructured data, to be able to use AI to accelerate their internal efforts. Well, thank you so much. It's been a really interesting discussion. But now comes the time in our podcast when we shift gears a little bit. It's time for three questions, a tradition we started in season two. Note to listeners that our guest has not been prepped on these questions ahead of time. So we're going to get some off the cuff answers right now. This season, we're also changing the questions up a bit. I'm going to get some off the cuff answers right now. This season we're also changing the questions up a bit.

Starting point is 00:31:46 I'm going to ask one as well as Frederick, but a third question will come from a special guest. So Frederick, why don't you go first with your question? So are there any jobs that will be completely eliminated by AI in the next five years? Wow, that's a really good one uh let me think to be honest um i i work in the industry of ai for you know almost 10 years um so to me ai is just another tool to help people to do their job better. So I couldn't really see a job eliminated by AI in the five years, but it could be, you know, in 10 years, in 20 years. So five years is not enough, it's not long enough. And AI is still not powerful enough to, you know, eliminating jobs, but rather it will help people to do their job better and easier.

Starting point is 00:32:50 So following on to that prediction question, and of course, since you did mention that you worked in a self-driving car environment in the past, I need to know, when will we see a full self-driving car that can go anywhere, anytime, no limits? Wow, that's a really good question. So in the self-driving industry, we all know the self-driving car kind of launching where, you know, based on capabilities, right? At the very beginning, we kind of set up a zone, a special zone, operating zone for the self-driving car to start to operate.

Starting point is 00:33:35 And when the technology becoming more and more mature, we kind of increase the size of that zone. So it's going to be a really graduate way to involve this type of technology. It's not going to be, you know, the other day, one day you get up in the morning and then there's every car going to, you know, drive autonomously. It's not going to happen, you know, in that way.

Starting point is 00:34:04 So I think to answer that question, I really don't know because, you know, technology is so amazing. Like people working super hard. It could be in five years, it could be in 10 years, but we'll see there. They're going to be more and more suffering cars on the road. It's just this, they gradually start from the easy area of the city and it will be operating in bigger and bigger areas. So the third question actually comes from somebody here as well at Gestalt IT. So Tom Hollingsworth, the networking nerd,

Starting point is 00:34:44 who was a guest on a previous episode talking about network management with artificial intelligence, asks this question. Hi, I'm Tom Hollingsworth, the networking nerd of Gestalt IT and Tech Field Day. And my question is, can AI ever recognize that it's biased and learn how to overcome it? Wow, that's a really good question. I think yes, well, in some way, because, you know, most of the AI technologies we are using today, they're based on statistics, right? When they, the output of a lot of the model is not really the answer, really not the yes or no answer. It's most of the time it's opposite a probability. And to answer that question, I think sometimes AI is really confident, the output of the model is really confident about something,

Starting point is 00:35:41 and sometimes it's not really confident about something, right? And oftentimes it's basically the engineers who kind of set up a value where if the confidence score exceeds that value then we think it's A or lower than that value if think it's not A, right? So based on that probability, we actually know how confident is the model. So the model itself sometimes know it kind of, it could be making mistakes by output a relative low probability. So I think if we design the model really carefully, and because the model, most of the models are from statistics, we actually can know what the model itself can output, whether it have a large probability being biased versus not. So I think the answer to that question is the model could. Well, thank you very much for those answers. We do look forward to also hear what your questions might be for a future guest. If you or if any of our listeners want to join this, you can just send

Starting point is 00:37:02 us an email at host at utilizingai.com and we'll record your question. So thanks for this great conversation, Edward. It was really interesting to learn about the world of open data sets. Where can people connect with you and follow your thoughts on these topics and these projects you're working on? Yeah, so you can always visit gravity.com. So there's an I instead of Y at the end. You can also send us email by using the email address contact at gravity.com and also follow us on LinkedIn or Twitter or Medium. You can find us there.

Starting point is 00:37:39 Frederick, I've been very busy working on the Utilizing AI podcast, of course, and planning for our next AI Field Day event, which is scheduled for April. So what are you working on lately? Yeah, I'm talking to many enterprises about data management and designing large-scale clusters, right? As we know, there are many more parameters in AI models and GPUs are the way to go there. You can find me on LinkedIn and on Twitter as Frederick V. Heron. Well, thanks for listening to the Utilizing AI podcast. If you enjoyed this discussion, remember to subscribe, rate, and review the show on iTunes or your favorite podcast

Starting point is 00:38:18 application, since that does help. And please do share this with others. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI.

Your Ad Here

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x12: Democratizing Unstructured Data at Scale with Edward Cui of Graviti

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.