Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x12: Democratizing Unstructured Data at Scale with Edward Cui of Graviti
Episode Date: November 30, 2021Machine learning applications require massive datasets, but it can be challenging to build and store large amounts of unstructured data. In this episode of Utilizing AI, Edward Cui of Graviti discusse...s his creation of an open repository for unstructured data with Frederic Van Haren and Stephen Foskett. Coming from Uber's self-driving organization, Cui realized the value of data and the challenge of storing massive amounts of unstructured data, so he created the Graviti platform and made it available for free to open datasets. These datasets enable development of a variety of applications, from agriculture and environmental science to gaming and robotics. To address the challenge of data sharing and quality, Graviti is working with the Linux Foundation on the OpenBytes project. Three Questions Frederic: Are there any jobs that will be completely eliminated by AI in the next five years? Stephen: When will we see a full self-driving car that can drive anywhere, any time? Tom Hollingsworth, Gestalt IT: Can AI ever recognize that it's biased and learn how to overcome it? Guests and Hosts Edward Cui, Founder of Graviti. You can follow Graviti on LinkedIn or twitter at @graviti_ai. You can also email Graviti at contact@graviti.com. Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 11/30/2021 Tags: @graviti_ai, @FredericVHaren, @SFoskett
Transcript
Discussion (0)
I'm Stephen Foskett.
I'm Frederick Van Haren.
And this is the Utilizing AI podcast.
Welcome to another episode of Utilizing AI,
the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics.
Frederick, in the past, we've talked quite a lot about growing sizes of AI models.
We've also talked about the growing sizes of the data sets that are underpinning those models and the challenges that people face in storing large volumes of data.
But one of the things we didn't really talk about is the challenges faced by researchers and academics and sort of non-corporate entities who are trying to create massive
unstructured data sets as well for themselves. Right. I mean, the biggest problem is that the
world is filled with unstructured data, which is obviously more difficult than dealing with
structured data. The question really is not just about collecting unstructured data, but where do you go from there? And then
additionally, AI is all about sharing. So how do you share data such that everybody benefits from
this? And I think this is a good topic to talk about and to see how you can share data, but also
how you can scale and do this efficiently. Exactly. And that's the reason that we decided to invite on our podcast,
the guest that we have today.
So I'd like to introduce the founder and machine learning expert, Edward Choi.
Say hello and tell us a little bit who you are.
Hey, everyone.
This is Edward.
Well, I did my undergrad study as a mechanical engineer for three years. And in the last year, you know, like I took up the machinery class and it's totally mind blowing. And I decided that's, I really want to rest of my study. I got my master's degree there. Later, I joined
Uber ATC or Uber ATG, the advanced technology group, which is the self-driving division of Uber.
I started at the very beginning, stayed there for three years. And after that, I worked in another AI startup for about 10 months.
And then I discovered that, you know, building infrastructure for AI and especially like
having a really good one for, you know, managing and use data, especially as structured data
at scale is super, super hard.
And that's why we later decided to start our own company, which is Gravity.
We are building the platform to manage unstructured data at scale.
And also we want to kind of give back to the community by, you know, open up the platform to the entire AI community where, you know, people can use our platform to host data for free forever if they want their data to be open to the rest of the world.
And anyone in the AI community
or anyone in the overall community
can, you know, freely access those data
and use those data to, you know,
to, you know, research on more machine learning topics
and work on more machine learning applications
that will in turn benefit the entire human race.
And that's the goal.
Yeah, and I'm looking at the website
and you've got a whole bunch of open data sets
in here already.
You know, I've heard people describe Gravity
as kind of like a GitHub,
but to me, it reminds me a lot of Thingiverse and the
3D printing community where you've got people uploading, you know, their own sets of data,
and then you can kind of download, remix, use, you know, explore these data sets. I imagine that
these could be used for productive purposes, but also for researchers. And it's really exciting to
see things in agriculture and autonomous driving and,
you know, design and all sorts of areas. You know, it's a pretty cool combination of data.
Yep. Yeah. So the reason we call ourselves GitHub is basically, well, we are a bunch of engineers,
right? Like every software engineer did not get help really well, right? Like we all use open
source software at some point, either in school or at
work. I think if we look at, you know, the past 30 years, right? Like we have a huge advantage.
We have a lot of innovation in the tech sector, right? The reason we can be there, the reason
there are a lot of startups, they can start their own companies is because they have free software available to them, right? And they can build, you know, their new products using those free software.
And open source software basically being the factor which kind of drive the innovation
forward in the last 30 years, right?
And if we think about what can happen in the next 30 years, like AI is definitely one of the biggest technology
that will change the entire landscape,
how people interact with computers
and how people interact with our physical words.
But what will be the similar factor
as open source to software?
And if we're looking at all the innovation in AI,
they all kind of link to open data.
For example, like Dr. Fei-Fei Li,
who is still a professor at Stanford University,
she released ImageNet.
And that became the really famous vision benchmark
for all the machine learning researchers.
They can use that vision benchmark to do studies in computer vision, image recognition.
And because the existing of such data set, people have the materials they need to work on new algorithms.
And because of the data set, they can use that data set as a benchmark to compare their
work.
So we know we make progress in those fields.
And to be honest, like right now, like we don't have a lot of open data set yet.
There could be more.
A lot of the area could be really interesting if someone or some organization companies, they can share a small portion of their data.
So people, the innovative people, the talented people,
they can use that materials to work on solutions
to solve that specific problems
that will actually benefit that sector a lot.
And we believe if we have this platform and if we have this effort to help people to prepare
open data sets and we have this platform freely available for anyone or any organization who
can host their data set for free and make open data set more accessible accessible that will just benefit everyone.
Yeah, I think it's a great concept.
I mean, sharing data is part of the whole concept of AI.
So if I'm an entrepreneur and I want to start, then how do I find data?
I mean, what kind of criteria do I need to use in order to find data?
And then a follow-up question would be, you know, my concern obviously is open source data sets. It's great. I don't have to collect the data, but I also don't have control
of the quality of the data that was collected, right? So how do you, how do you, do you have
like metrics or so to kind of understand the quality of the data, how well or how bad the data
is? Yeah, exactly. So to answer the first question,
that's actually the question faced by a lot of the people, right?
Like when they're trying to solve a specific problem,
when they're trying to design the algorithms,
it used to be really hard for them to find such data set, right?
Like they can do Google search,
but oftentimes they really couldn't find the data set that they want to use.
So it ended up being like a person asked another person
who kind of familiar with this matters,
ask for help and ask like what data set
they have used in their previous work.
And that's not efficient at all.
Even if there could be some really great data
that's out there, but people just couldn't find it.
So that's why we came up with the idea
where we have a single platform
where everyone can share the data for free,
and then it will be much easier for the user of the data
to find those data sets right away.
And that's gonna be super helpful, especially someone they have a specific topic in mind
they want to do research on, right?
So the quality is a really important piece there.
I remember I read a paper about the quality of the open data.
Even the really famous ImageNet, they have about like 2% of error rate in the data set, and some of
the other data set could be could have bigger error rate.
That's indeed a problem.
That's why we are kind of, we just recently launched a project with Lin's foundation called
open base.
In that effort, we kind of want to work with people in the community on, you know, bringing standards about the qualities of data sets to the entire communities,
helping people making procedures on how to produce really high quality data.
To be honest, at the moment, there's really no metrics.
There's no single metrics yet to, you know, to say, hey,
what's the quality of this data set or that data set? We all know there's
errors in there. And the problem of those errors is when we use those data as the ground truth,
but it's not actually the ground truth, then we couldn't really trust the result produced by those
data. And that creates huge problems. And that's exactly what we're trying to solve there.
And also, you know, in recent years, especially this year, Professor Andrew Ng, they gave a speech
on data centric AI, basically talking about techniques, how you can actually improve the
quality of the data. And, you. And improving a model in recent years
is harder and harder because no matter how hard you work,
you can only improve less than 1%.
But improving the quality of the data
will in turn have much better improvement
in the end-to-end settings of the entire model,
even though you don't have to change the model at all,
that the result they show is actually quite amazing.
But just get back to the question,
we are working with Lynx Foundation
on this new project called OpenBase,
and we are working really hard with the community
to build that quality standards and
build procedures to produce really high quality data sets.
Right.
So to kind of continue a little bit on the topic.
So if I'm using one of those data sets, can I then, and if I find data that doesn't fit
the quality criteria, can I go in and maybe market as, you know, don't use this anymore?
Or is it really like GitHub, you know, where you can, you have layers of versions and so
data never, never deletes.
Because another concern would be if I use this, a data set for building models, you
know, I don't want that data set to disappear, right?
That would be, that would be problematic.
Yeah. So one thing really nice about this platform
is, you know, when you use the data, right, like you train a model and oftentimes, let me just tell
you a story. Like when I work at Uber, sometimes I have some of the data, you know, collected
from the self-driving cars. I have people who annotate that data and we train the model
and we evaluate that model
and see whether that model
really produced good result or not.
And then when it says the model
actually gets something wrong,
I look at the images,
I look at the predictions.
It turns out it was
our original annotation was wrong.
And that's why I said sometimes you couldn't really trust the ground truth.
So the really nice thing about this platform is when you use that data to train a model,
and when you use that model to kind of, you can use that model to predict on the training
data, and you will find the data which didn't agree with your model.
And sometimes when you take a look of the data,
it could be the training data was wrong.
And then you can actually contributing
that result back to the data.
And if more and more people using the data set
to train a model,
they all train different models, right?
And those different models will basically find
what's wrong with the original training set.
And everyone can contribute back to the training set. And we wrong with the original training set and everyone can contribute
back to the training set and we can constantly improve the training set and make it better and
better. It's not like mark the data is wrong, rather hey, the model can automatically mark
there could be some problem with the data and then if we have a lot of models, they basically do a vote
on the data.
And then we can automatically
correct the data.
And then the data can be better and better
with more and more people using it.
I mean, the concept of sharing
is caring, I guess.
Do you see enterprises
reaching out to you
and saying, hey, we have large data sets.
Are you interested?
Is that something that you see happening?
We do.
We have several enterprises.
They have different reasons to share data.
Some of the enterprises, they really want to,
they create a lot of data and they want people to use that data, but they don't have, you know, they have no exposure, you know, to the
researchers, to people, to other enterprises, right? So they decided the prediction, well,
the future data is really important, is really valuable. And sometimes they
open up some of the history data on the platform. So the researchers or other enterprises can use
that data to build a prototype. And if they really like the quality of those data, they can,
you know, at the end, connect those companies and, you know, purchase those data or building
other type of collaboration effort between those two enterprises. And sometimes other enterprise
is, you know, sometimes they accumulate a huge amount of data, but they don't have the talent
to use that data. They don't even know, you know, how to deal with the data, what value they can get from the data. They just need talented people from the community to work on the data, see how they can
use the data, what value they can get off the data. And sometimes those companies get inspired
by the work carried out by the community members. And that in turn, in the end, will turn into product
of those corporations. That's some of the trend we have observed.
Yeah, it almost becomes a marketplace of data, a free or paid marketplace of data, right? I mean,
I could see as well, you could have a new kind of company that
is collecting data sets for the purposes of putting it out there for others to use in hopes
that then they can, you know, kind of build a business out of it, sort of a data farmer
in a way, and that's hoping that somebody else can find something else to do with the data that they're collecting, right? Yeah. So the pity thing is a lot of the corporations, they actually
generate a lot of data in their normal operations, right? They don't know how to use that data. They
just throw that away. And I think one of the practices, we are building this data platform for enterprise, right?
Like sometimes it's just the enterprise need to kind of plan out that even before they
apply real machine learning applications in their organization, they have to accumulate
the data.
Because if they don't accumulate those data, a height of hiring a machine learning engineers, when they onboard that guy,
he doesn't really have anything to work on, right? Like he has to wait for months and just
wait for the raw data to come in to build machine learning model on those data. So I think every
organization, they should start to accumulate data. If they produce the data, don't just throw them away.
Accumulate those data.
If they don't know how to use that data, just open up a small portion to the community.
And that will be tremendous useful for the rest of the world.
That will kind of maybe inspire a lot of innovation.
And that will change our life forever.
So this sounds really hopeful, but not to be a naysayer, but isn't there also a possibility that
the quality of the data could be driven downward by these kinds of pressures? So essentially,
if there are errors that those errors could propagate through other applications
and other users of the data, and then that could end up being contributed back into the repository
as a correction that's actually not a correction, that's actually a
wrong direction correction that would really change the data only based on a single vote. Sometimes,
if there are going to be enough models, then we will use the vote of different models to
evaluate whether the data has some errors or has some issues.
To be honest, and sometimes we could also have human
involved, volunteers involved to kind of give a second check
of the data to make sure it indeed being corrected,
not getting worse and worse.
So there's ways and definitely we need volunteers and we'll design our product
in the way where, you know, people can come in, volunteers can come in and can give a check.
Right. I mean, I think data quality is also referring to the metadata, right? So I can
upload, you know, thousands of pictures of dogs and I can put in the metadata pictures of cats, right? I
mean, that will, while the data, the quality of the data, pictures of the dog might be right,
you know, the fact that it's cats kind of, you know, ruins it. Do you have any concerns about
copyrights where somebody might upload some data that they don't own the copyright titles to it?
Yeah, that's a really good question.
That's actually a really, really important question
we need to solve.
And there's not just like someone upload the data
they don't own, and sometimes it could be a bigger data set
includes a portion of the data,
which actually under different licenses.
And the license cascading creates,
definitely creates a problem.
So we, in the OpenBase projects,
we actually talk with a lot of the law experts, lawyers,
and we talk with some other efforts,
for example, like the CDLA effort,
also in the Linz Foundation,
which trying to build a MIT-like licenses for open data.
You know, like for open source software,
we have Apache license, we have MIT, we have GPL.
But for, you know, open data sets,
there are a hundred different licenses
has been used by different people
and they're not actually designed for data sets.
They didn't really kind of say,
hey, you like who owns the raw data? Who owns the metadata?
Right? Like what the derivatives of the model trained by the data,
what the model trained using the data, who has the ownership, who has the copyright of that model, a lot of the licenses don't really have
the necessary information to kind of rule
all those aspects of open data, right?
And in the OpenBAS effort,
we are actually talking with a bunch of different law experts
on the license issues.
And hopefully we will come up with a system
where like a set of new licenses
dedicated for open data sets
and a system where we can track
all the licenses of the data sets
and make sure like they are being properly used.
They are under proper licenses
and we want to make sure like
everyone who want to use
the dataset understand the license and know exactly what they can use the dataset for,
know what they can't really use the dataset for and that's also some of the effort we want to
collaborate with the community. So back to the previous question though,
like about the quality of the metadata,
I think that's exactly why the effort with the community
is really great because people in the communities,
they always identify those problems.
They can pop up those problems
and we will solve that problems
based on the reporting of the member
from the community.
So the question of open data and licensing also kind of begs the question, what about
data that is intentionally licensed or, you know, constrained in some way?
Can you see a future where there are open data sets with open licenses, but also
proprietary data sets with proprietary licenses, maybe even paid data sets that exist in the
Gravity or in some other repository and that are used for specific purposes by or offered for
purposes by a vendor? Yep. I can definitely see that happen. But I think we still a little bit, you know, we still a little bit far away from that.
So for structured data, there's technologies to, you know, to protect the structured data.
But for unstructured data is a little bit harder. You know, like we have new technologies like federated learning, we have some new technologies.
We also have a patent called Sandbox. We propose a way to use data in a sandbox in a safe environment
where you can train the model, you can take the model away, but you had to leave the data inside the sandbox. There's several efforts
on that front. We also think transfer learning could help in that context, but we still think
there's certain technologies need to be developed before we know, trade unstructured data.
And a follow up question to that.
Do you also see people uploading data where they basically say free for all except for governments, military, you know, ethical kind of things?
Do you see those requests also?
We have not yet, to be honest.
Yeah, we have not yet, but that's really a good
question. I can approach to the committee members and kind of ask them whether they see that
situation, they see that scenarios. We have not yet, but I believe, you know, the governments or other agencies, they may have their own private data the entire public, not doing bad things using those
data. So we don't see that yet, but that's definitely something we should keep in mind
and approach the community members and kind of talk a little bit more on that.
Well, I mean, no doubt that could be written into the license.
Any arbitrary license can include arbitrary text.
As a reminder, Apple's end user license agreement
forbids any third party from using it to develop, design,
or manufacture nuclear missiles or chemical or biological weapons.
I'm not sure that that has been enforced with iTunes specifically,
but you know, I mean, it is in the license and I suppose anybody could put anything in the license
if they so chose. But yeah, I definitely think when it comes to data science and machine learning,
these kind of ethical questions might become more pertinent than music sharing service?
Well, it could be. Well, you know, software are, data is very similar to software, right? Like,
they are, you know, the building blocks of products. So I think we can learn from the
experience, you know, from the open source software to kind of guide how open data being used.
That's definitely like we learn a lot from the open source software and the entire processes.
So on that note, I think that there is definitely an analogy between what you're doing with open data sets
and with something like GitHub with open source software.
What are the, as somebody who's been in this space and done both open source software and
open data, what are some of the surprising differences between open data sets and open
source software?
What are the areas in which they are not the same?
Yeah, that's a really good question.
So contributing open data is a lot harder
than contributing open source software, right?
If you have a machine, if you understand programming,
you can always contribute to the open source software,
but collecting data sometimes is time consuming
and is also very costly.
So we saw right now most of the open data are contributed by either institutions or organizations.
There is rarely individual contributors can contribute on the open data. And also the characteristic of open data,
sometimes contributing open data is important,
but more often contributing the model,
the algorithms associate or produced,
derived from the open data is very important.
So not like open source software,
people just write software together.
But for open data,
if we create a value for every single AI developers,
not just through the open data itself,
but also through the models produced by the open data
and the comparison of the models.
And that's the differences between open data
and open source software.
Yeah. And I think also that there's many more tools for software development, like linting
tools, for example, right? Where the tool doesn't really need to understand what it's doing, but
it's looking at the syntax to figure out what's going on while for data, it's a lot more challenging. And it's also, you know, what happens if you have PCI data
or social security numbers, right?
How do you figure that out?
There are no, well, actually there are tools to look for that,
but it's very time consuming, right?
Linting a 1 million lines of code happens really fast.
You know, 1 million pictures and analyzing them
takes significantly
more time.
Yeah.
Yeah.
So in the future, we can imagine where, you know, the people, the data are opened, but
not being seen by any people.
It can only be accessed by machines.
You can train a model on the data, but you cannot actually see the data
itself. So hopefully in that way, or in similar technologies, people can still use the open data,
but they cannot really just take the data away. Especially there's privacy issues with the data. There may be other issues with the data.
So we kind of see there could be technologies
to solve the problems.
Well, before we wrap up here,
I just want to give you one last moment.
Where do you see Gravity going in the future?
You know, what do you think that this is going to look like
five years from now?
Yeah, that's a really good question. So I think we saw how GitHub grew, right? Like
it from a website just for a bunch of geeks working on Ruby on Rails now became the single most popular platform for all the software developers.
And we want Gravity to be a similar platform, but for all AI developers, for all the people working in machine learning, in AI.
We want Gravity to be the hub. GitHub. And also, we know, you know, making, supporting that community, we need a really
successful commercial product. So we, that's why we have a similar model compared to GitHub,
which is pay for privacy, right? We have the model where, you know, if organizations or companies,
they want to manage their unstructured data at scale
inside their own organization,
internally in their own organization
and collaborate on those data,
they can pay for the software.
So we want to be successful
both in terms of the community side,
being that platform providing free services for everyone.
And we also want to be successful
in the commercial space and helping organizations of any size to be able to use their unstructured
data, to be able to use AI to accelerate their internal efforts.
Well, thank you so much. It's been a really interesting
discussion. But now comes the time in our podcast when we shift gears a little bit.
It's time for three questions, a tradition we started in season two. Note to listeners that
our guest has not been prepped on these questions ahead of time. So we're going to get some off the
cuff answers right now. This season, we're also changing the questions up a bit. I'm going to get some off the cuff answers right now. This season we're also changing the questions up a bit.
I'm going to ask one as well as Frederick, but a third question will come from a special
guest.
So Frederick, why don't you go first with your question?
So are there any jobs that will be completely eliminated by AI in the next five years?
Wow, that's a really good one uh let me think to be honest um i i work in the industry
of ai for you know almost 10 years um so to me ai is just another tool to help people to do their job better. So I couldn't really see a job eliminated by AI in the
five years, but it could be, you know, in 10 years, in 20 years. So five years is not enough,
it's not long enough. And AI is still not powerful enough to, you know, eliminating jobs, but rather it will help people to do their job better and easier.
So following on to that prediction question, and of course,
since you did mention that you worked in a self-driving car environment in the
past, I need to know,
when will we see a full self-driving car that can go anywhere,
anytime, no limits?
Wow, that's a really good question. So in the self-driving industry, we all know the self-driving car kind of launching where, you know, based on capabilities, right? At the very beginning, we kind of set up a zone,
a special zone, operating zone for the self-driving car
to start to operate.
And when the technology becoming more and more mature,
we kind of increase the size of that zone. So it's going to be a really graduate way
to involve this type of technology.
It's not going to be, you know, the other day,
one day you get up in the morning
and then there's every car going to, you know,
drive autonomously.
It's not going to happen, you know, in that way.
So I think to answer that question, I really don't know because, you know,
technology is so amazing. Like people working super hard.
It could be in five years, it could be in 10 years, but we'll see there.
They're going to be more and more suffering cars on the road.
It's just this, they gradually start from the easy area of the city
and it will be operating in bigger and bigger areas.
So the third question actually comes from somebody here as well at Gestalt IT.
So Tom Hollingsworth, the networking nerd,
who was a guest on a previous episode
talking about network management with artificial intelligence, asks this question.
Hi, I'm Tom Hollingsworth, the networking nerd of Gestalt IT and Tech Field Day. And
my question is, can AI ever recognize that it's biased and learn how to overcome it? Wow, that's a really good question. I think yes, well, in some way, because, you know,
most of the AI technologies we are using today, they're based on statistics, right? When they,
the output of a lot of the model is not really the answer, really not the yes or no answer. It's
most of the time it's opposite a probability. And to answer that question, I think
sometimes AI is really confident, the output of the model is really confident about something,
and sometimes it's not really confident about something, right?
And oftentimes it's basically the engineers who kind of set up a value where
if the confidence score exceeds that value then we think it's A or lower than that value if think it's not A, right? So based on that probability, we actually know
how confident is the model. So the model itself sometimes know it kind of, it could be making
mistakes by output a relative low probability. So I think if we design the model really carefully, and because the model, most of the models are from statistics, we actually can know what the model itself can output, whether it have a large probability being biased versus not.
So I think the answer to that question is the model could.
Well, thank you very much for those answers. We do look forward to also hear what your questions
might be for a future guest. If you or if any of our listeners want to join this, you can just send
us an email at host at utilizingai.com and we'll record your question.
So thanks for this great conversation, Edward. It was really interesting to learn about the
world of open data sets. Where can people connect with you and follow your thoughts
on these topics and these projects you're working on? Yeah, so you can always visit gravity.com.
So there's an I instead of Y at the end.
You can also send us email by using the email address contact at gravity.com and also follow
us on LinkedIn or Twitter or Medium.
You can find us there.
Frederick, I've been very busy working on the Utilizing AI podcast, of course, and planning
for our next AI
Field Day event, which is scheduled for April. So what are you working on lately?
Yeah, I'm talking to many enterprises about data management and designing large-scale clusters,
right? As we know, there are many more parameters in AI models and GPUs are the way to go there.
You can find me on LinkedIn and on Twitter
as Frederick V. Heron. Well, thanks for listening to the Utilizing AI podcast. If you enjoyed this
discussion, remember to subscribe, rate, and review the show on iTunes or your favorite podcast
application, since that does help. And please do share this with others. This podcast is brought
to you by gestaltit.com, your home for IT coverage from across the enterprise.
For show notes and more episodes,
go to utilizing-ai.com
or find us on Twitter at utilizing underscore AI.