Epicenter - Learn about Crypto, Blockchain, Ethereum, Bitcoin and Distributed Technologies - Andrew Trask: OpenMined – A Decentralised Artificial Intelligence Platform
Episode Date: January 11, 2018A significant part of the modern digital economy, is underpinned by machine learning models that are trained to perform tasks such as facial recognition, content curation, health diagnostics etc. Data... to train machine learning models is the essential commodity of this century – a sentiment captured by epithets such as “”Data is the new oil””. Today’s dominant AI paradigm has companies focus their efforts on gathering data from their users in order to train models and monetise usage of the model. This model has many consequences such as loss of privacy for the user, consolidation of data in a handful of large companies, low access to data for startups and a fundamental impossibility of collecting sensitive data such as markers for depression. Our guest, Andrew Trask, is building OpenMined – a platform that merges cryptographic techniques such as homomorphic encryption and multi-party computation and blockchain technology to create the ability to train ML models with private user data. OpenMined will allow AI companies of the future to develop models, have them trained on user data without compromising user privacy, and incentivise users to train their model. We walk through the OpenMined vision and its potential impact on AI business models and AI safety Topics covered in this episode: Challenges with the current AI paradigm OpenMined’s vision to allow training of AI models with private user data How OpenMined works under the hood Applications enabled by OpenMined Current state and OpenMined hackathon Episode links: OpenMined hackathon OpenMined website This episode is hosted by Brian Fabian Crain and Meher Roy. Show notes and listening options: epicenter.tv/217
Transcript
Discussion (0)
This is Epicenter Episode 217 with guest Andrew Trask.
EpiCenter, the show which talks about the technologies, projects and startups
selling decentralization and the global blockchain revolution.
My name is Brian Farman Crane.
And I'm Meheroy.
Today we are talking to Andrew Trask of Open Mind.
Open Mind is a very interesting project that uses tools from cryptography and blockchains
in order to create decentralized AI that can.
can use private data from different people around the world to train AI models without compromising
the privacy of the people contributing the data. Andrew, welcome to the show.
Thanks. Great to be here.
So before we start, tell us a bit about your background and how you came to create open mind.
Yeah. So before I moved here, I'm a PhD student at the University of Oxford. And before I was a PhD
student, I worked for a really exciting company based in the U.S.
digital reasoning and they they specialized in on-prem enterprise deep learning and
AI services and they still do and and one of the things I came to really appreciate
then was was how difficult it was to get access to private data and so during my
time here as a PhD I got really involved in AI safety conversations and and
really how open mind got started was a blog post looking for for the right tools to
build safe AI and to try to
contribute to that conversation. That's how I came across multi-party computation, homomorphic encryption,
and there's a kind of a blog post out from last March. And then we opened up the GitHub repositories
in, I guess, July of last year. And we've been writing code ever since. So tell us, like,
why is cryptography, like homomorphic encryption and multi-party computation relevant to safe
AI and AI in general?
Yeah, that's actually a great question.
And for I think most of their existence, the conversation around AI and machine learning and these cryptography tools have sort of lived in isolation.
I mean, there's been a handful of workshops at like NIPS or ICML where there'll be a few papers that discuss private machine learning.
But it's still a really small field, but growing quickly.
But the reason that it's important is that
Cryptography is all about protecting really valuable digital assets, right?
And also limiting the use of various digital assets.
So in this most simple case with just asymmetric encryption,
I have a data file, I want to be able to put it somewhere,
and I want only me to be able to do stuff with it
and only me to be able to access it.
But in the context of machine learning,
often you want to have something that's a little bit more nuanced.
You want to have the ability to just do,
training or just do prediction.
Or you only want to be able to do prediction if a lot of people are participating.
Or you have some sort of digital asset that you want to interact with, but you don't want
to be able to interact with in arbitrary ways.
And that's where the conversation between kind of cryptography and machine learning really
gets started.
So the private kind of private machine learning as a field is all about either I need to put
a model in a place where people might try to steal it.
or I have data that I need to learn things about without actually having a copy of.
And so the blog post that really kicked off OpenMine was talking about homomorphically encrypted models,
where I can take a model that's fully encrypted and I can give it to you,
and you can make it smarter, but you don't have the ability to predict with it.
And so it's sort of like a one-way gate for sort of intelligence, right?
And there are various long-standing problems in this field, so the biggest one being efficiency.
But that's sort of what the conversation is all about, right?
I want to have limited ability so that the people who are participating in AI machine learning are protected in various ways.
So you mentioned briefly AI safety.
Now, I'm sort of familiar with what that is.
probably many people are, but for those who aren't and aren't so deep into machine learning,
can you define what that is?
Yeah, I'm going to do the best that I can.
I would not call myself primarily an AI safety researcher.
So if any of the folks that I know that are real AI safety researchers, they might correct me on this.
But AI safety really deals with, in particular AI safety, deals with the problem of how
how do we as individuals with, if we're lucky, IQ of 150,
limit the abilities and the values and the goals of something with an IQ of 15,000?
And that's actually, it's closely related with a lot of concepts.
Funny enough in political science,
where you have a large group of population,
and you're trying to make it so that these few people at the top
that have control of a lot of resources act in the interest of the population.
And so that's kind of most of the AI safety conversation is around that.
We're very close to the AI safety and that we try to innovate on like the tools that you would use to do that.
I think that the most precise term for what we work on is probably AI security.
And to the extent that we're trying to build really good boxes and really good tools that allow you to do different things
and kind of give the AI safety community more knobs that they can turn.
Right.
So so in the case of narrow AI, one of those knobs that we're trying to,
to bring online into the into general knowledge and ease of use is the ability to train machine
learning models on data that you can't see and train machine learning models on data you don't have
access to and that's like it's a really good example of us building a new tool that is then valuable
for the AI safety community in the narrow AI space and obviously we'd like to keep going into the
the generally ii safety space as well yeah maybe maybe you can come back to that in a second but i just
sort of would like to understand this big picture now
I you know we kind of started talking immediate about this idea of how open mind a little bit approaches these things but maybe to give a bit more context as well there's a lot of news these days right a lot of kind of discussion and discourse around yeah the issue that you know a few companies like Google Facebook Amazon are acquiring like huge amounts of data and
and acquiring a huge amount of power, right?
We also see political backlash against this.
And in Europe, for example, with things like this right to be forgotten
or, like, data protection laws and stuff like that.
So what do you see kind of as the current state, like where the trajectory is,
and how does that fit into open mind?
Where the rubber really meets the road, it's a big, hairy issue.
There's a lot of players involved.
There's a lot of conflicting ideas.
And probably the biggest theme of 2017 is regulators really getting their hands around it.
There was big, big research interests from, I know, U.S. Congress and Parliament here in the UK,
as well as the European Parliament, obviously, with GDPR.
I think that the way that I see it, the general theme is that a new asset came into the marketplace, right, data.
And with tools like for big data becoming widely accessible, companies started collecting it.
And this went largely unregulated, like all new tools.
And now we're starting to learn a little bit about the externalities of how it should be protected.
You know, governments tend to be a little bit more responsive and reactive than proactive.
So I think what we've seen is that this incredibly valuable asset has been aggregated as, you know, sort of like maybe oil was when it was.
first, you know, back in the, I guess it would be the, you know, around like Andrew,
Andrew Carnegie Oil and Steel and like all those, these kinds of things, like when they really
came online. And we're sort of rounding the corner, just like they did back then to, to significantly
more regulation. Now, as far as where open mind fits in with that, I've watched many of these
regulatory talks. And the one tragic thing is that no one seems to be aware, like none of the
regulators seem to be aware of the notion that you can train model.
without having access to data.
Like the conversation almost always boils down to,
well, we have this tradeoff between innovation and between privacy.
How do we know how much privacy to give up and innovation to receive?
And just even the existence of these tools, it's just, it's almost like it's totally,
it's just not reaching the public discourse.
And so a huge part of what we're trying to do in the Open Mind project is just almost let people know
that this field private machine learning exists, that that,
that many of the tools are ready for practical use,
and we're building software to make that as accessible as possible.
So in the short run, if we're very successful as a community,
you know, that discourse will change
because people will actually realize
that you don't have this nasty trade-off
between privacy and innovation with kind of these new pieces of technology.
Okay, no, that's, I think that's very helpful and makes sense.
Now, maybe kind of one last thing on that,
I don't fully, can you elaborate a little bit on the connection between private AI and AI safety?
Because, you know, you explained, right, why multi-party computation and these things obviously have implications for how AI works today and this centralization of data and power.
But how does that relate to safe AI?
So the real difference between those two things is narrow AI versus general AI.
So AI safety is mostly concerned with general AI, meaning they're concerned with something having an exorbitantly high IQ that threatens humanity, right?
But the privacy conversation is mostly about narrow AI.
So it's about, you know, the things that industry has optimized certain business use cases for that leverage private data and that may or may not be also defending the interests of the people who own that.
that private data, right?
Or, or so, so they, they are slightly different, but the, the, the, the overlap is primarily
in, in, in the tools that are involved, because the tools that you use to, to mitigate the
tradeoffs in, in, in, in, in, in narrow AI, um, we hope will have broad implications
as the conversation around AI safety for general AI unfolds.
Um, because it's also a matter of like what's tractable now and what's, what's, what's, what's, we
expect to be important in AI in the future.
Yeah, maybe one thought on that.
I mean, I don't know much about AI,
but when I was reading maybe some sort of science fiction novel,
like AI takes over the world and stuff,
it seemed pretty obvious to me how blockchain could play a huge role in that,
that you could have maybe some kind of, you know,
reputation system managed on a chain, right,
where there could be some kind of,
you know, way to turn off access for these AIs or like some kind of economic management layer,
right, that kind of controls AI and puts some sort of governance over.
Do you see that also as a direction in the long term to handle this AI safety problem?
Yes, but maybe not in the way that, maybe not in the way that is most obvious from science fiction literature.
the thing that blockchain brings to the table in terms of governance is not really any new
is nothing new unlike the theoretical governance side right like it's still you still have the
same problems with with value alignment like how do you get these super smart things to have the
same values as us which is the same as saying how do we have our political leaders have the same
values as us and so even for the the normal kind of political alignment conversation it's not like
there aren't new themes. What, what blockchain really brings is, potentially, is, is tremendous
amounts of liquidity and transparency to that conversation. So, so whereas right now, we, we, we vote on
basically everything for the whole year with this one very discrete event, right, which is when
we vote for one political leader to take power. And, and, um, you could think of, you know,
there could be scenarios in the future for AGI where, where, you know, you could have very
discretized events for for how you how you make your say and try to have the the
AI to be aligned to your own values the really nice thing about blockchain is that you
can make that a lot more nuanced right like because you can do lots of
transactions really really quickly so so particularly with with AI what I'm what
I'm really excited about is is the ability that that instead of saying instead of
saying oh this AGI did something bad let's let's pass a bill and like turn it off
right um you could say hey um this aGI is is behaving um you know its value is misaligned a little bit in
this direction let's let's curb the resources on that side and give it a little bit more reward on
this side and and and sort of direct it towards towards where we want it to go right so when i say
increased liquidity i mean increase nuance in the way that we can interact with the with the two
main resources that it lives on right which is compute and data um and and so that's that's that's kind of
why I'm most excited about about blockchain participating in the AI safety conversation.
It's it gives us much more fine-grained control over, in theory, than traditional kind of governance
structures based on sort of hierarchical voting schemes do. And I think that's net new. That's very
interesting. And obviously the interplay with blockchain being accessible to consumers to be able
to participate is also extreme and extremely exciting.
So from AGI, let's sort of return to narrow AI and open mind.
So open mind is one of the first, I think one of the first systems that,
that uses like blockchain and in order to do something interesting,
in order to deliver a capability that is new, the way I understand it is,
generally with traditional AI and machine learning, the way the industry works is there's a company
in the center that collects data from a lot of users.
It creates a huge data set.
And then this data set is used to train models inside the company.
And once the model is trained, it creates some kind of application or API that others can access.
and it monetizes basically the access to that model in some way.
Now, open mind is about flipping this architecture.
And tell us how, tell us like what open mind would allow.
And how would models be built on open mind and monetize on open mind?
That was brilliantly articulated kind of description of sort of how the machine learning industry works,
right now. And we do live in a world where this is this is kind of the structurally, like all of
our machine learning and AI tools assume that you've collected all the data into one place. So there's
this kind of not another option, right? And even from a business model standpoint, the machine
learning business model is dependent on this idea because that's how all the resources and all the
tools and all the kind of almost even just the ideas in the VC's heads kind of kind of sit.
And the story for how open mind participates in this really starts with how much tools can define what people actually spend their time doing.
Because ultimately, what we are is we're a community and we're a set of open source tools, right?
And so what we would like to see is people pick up tools like federated learning and really change that dynamic.
So federated learning in particular, right?
This is a piece of technology that was developed internally at Microsoft and Google.
to allow you to train machine learning models on data that's not centralized.
And they originally started working on this in the context of smartphones.
So from a networking standpoint,
you didn't want to actually have to pull data from smartphones
to be able to update the model that recommends the next word for when you're texting.
And so what we would like to do is be able to make that kind of federated learning technology
widely accessible, right?
So that instead of, if I want to turn it,
train a machine learning model right now, I would go to, you know, a VC who and say,
hey, I think that I can train a model on Fitbit data and that will use your heart rate to predict,
you know, how well you're going to sleep that night or some of the lifestyle, some of the
lifestyle goal you're trying to optimize for. Then you would, then you, once you convince that
VC, they'd write a big check. You would buy the data. You would, you would kind of clean it,
train your model and then sell, sell the use of that application. This is all contingent on whether
or not, you can convince the VC that this model is going to make money. You can kind of forget the
idea of doing it for just social good, right? Where open mind changes that story is if we execute
on the vision that we've laid out, instead of going to VC and going through this whole rigmarole,
without even like getting out of your pajamas, you'll just walk over to your laptop, sit down
and open up, you know, your Jupiter notebook and train a model on private data, that it will be
readily accessible, that there will be no privacy kind of trade-off as an externality,
and that kind of your everyday machine learning developer will have access to these kinds
of datasets, and not just for commercial use, but also for just building a model for social good.
And that's, you know, the name Open Mind is really inspired by open source to the extent that
the open source community took something that was almost all, like, that was built in industry and
only used for kind of making money and instead allowed large groups of individuals to collaborate
on it in such a way that you can also spend time doing things for social good. And in the machine
learning space, that means access to compute and access to data with tools like federated
learning. So that effectively means that if you're the engineer building the model for Fitbit data,
so let's assume that these Fitbit devices, they collect heart rates and sleep patterns. So you're the
engineer building the model and I am a user of this Fitbit and my Fitbit collects my data,
then you can essentially create a model and I can help train your model with my data
in such a way that my data always stays with me and never needs to be shipped to you.
Correct.
So you get a better model and I get compensated in some form for the service of training your model.
Exactly.
And there's no privacy implication for me.
So yes, there are always privacy implications like from a cryptographic security standpoint.
You know, like there's there was a new vulnerability against almost every operating system on the planet that came out three days ago.
But barring events like that, the system is designed to mitigate the need for machine learning engineers to actually see the data points that they're training.
And that's that's net new and making that existence.
has really, really nice externalities.
This kind of seems amazing from a, you know, from a crypto mindset, right?
Like crypto people are generally very, very privacy conscious.
And we have people in our community that don't participate in Facebook,
don't have like Twitter accounts that probably, like, encrypt all of their communications
they send out.
They're very, very of these large companies collecting all that data.
And somehow, somehow there's an unspoken challenge that if the whole world were to be
paranoid about data like that, we just wouldn't build any good machine learning models,
right?
So as the crypto community, we want to make the whole world paranoid about their data.
But then in the current kind of framework of how things happen, if we do succeed as the crypto community, then machine learning in today's paradigm might have a hard time.
So in some sense, it's like open mind is this is sort of the solution or something in the direction towards the solution where the crypto community fantasy can also be satisfied, while also the AI community being able to build the models that we need in order to.
to basically upgrade our civilization.
Yeah, absolutely, absolutely.
And really the best part is that like those tools have been invented.
That's like that's sort of the good news, right?
The good news is that it's not like this is some mysterious hocus pocus thing in the future.
These things exist.
Like there are there are conferences that have been publishing about them for years.
And as computation has gotten faster, there are many of them that are ready for prime time.
we're just not using them.
Like people don't know that that tradeoff is a false choice.
And that's like the real tragedy.
And hopefully we can mitigate that.
So yeah, I'm super excited about that.
There's always like a discussion centered around principles, like we as a crypto community,
like in principle don't like the idea of big corporates collecting all our data.
But then capitalism has its own rhythm, right?
So new technology or a new system will be adopted because it enables new ways of making money that weren't possible before.
That's the logic of capitalism, right?
Principles be damned.
Show us a way to make money.
Can I quote you on that?
Like speaking for the voice of capitalism?
Yeah, definitely, definitely.
So, like, let's assume, like, yeah, this is possible that you can, like, train models to do various things.
without compromising the privacy of data.
What kind of applications do you think will be enabled by it
that just aren't there today and with which people can...
Oh, man.
Companies can make money.
This is my favorite question.
There are so many amazing things that we could be building
if there wasn't this private privacy tradeoff, right?
I mean, the first one that always comes to mind for me is private data protects
the things in our lives that are personal, right?
Things that are personal and things that we tend to try to hide are also some of our
greatest pain points and some of our greatest weaknesses and some of our greatest vulnerabilities,
right? And so when you open up the potential for people to innovate in that space
without the privacy tradeoff, the kinds of products and services that you could
build without the threat that it actually, you know, that it takes advantage of the people
that are participating, right? Like the types of products and services that become
possible to build are astounding and they give me goosebumps whenever I talk about them.
Like if you wanted to train a machine learning model that was going to help predict when someone
was going to have some sort of mental disorder or breakdown or depression or the extreme
versions of depression like self-harm. Like these are the things that no one tells anyone, right?
And being able to help predict that in advance and intervene. And not intervene in a bad way.
I mean, intervene isn't like, like reach out and be like, hey, are you okay, right?
Or if people even know themselves, like, hey, you're, you're walking down a road that's likely going to lead to self-harm or to despair.
Like, if you wanted to try to predict that ahead of time and give someone an app that's like, hey, predict when you're headed to a bad place,
you would need to have access to unbelievably private data, right, underneath the current scheme.
Like, in order for a company to aggregate that, it would be insane.
Like, you would never, ever want to trust a company to aggregate that.
detailed of amount of information, but like, that's like, that's like, you know, home transcripts or
something. Like, it's like, uki, spooky stuff, right? But if instead, with federated learning,
you're taking the machine learning model to that person and you're letting them train it, right?
And you don't have this privacy tradeoff. It could be possible for us as a community to work
together to train models that actually help, help predict and mitigate and trade off some of these,
some of the most private and personal aspects of our life, right? That's the hope. That's the whole thing.
That's the whole thing, right?
We're severely limited in our ability to help people with their most personal and most intimate problems
because we don't know how to do it.
We seem to not know how to do it without aggregating data and putting those people,
the very people we're trying to help at risk.
And the tech open mind is about getting rid of that and showing another way.
Cool.
That's very articulate and good.
And clearly you're passionate about this point.
And I want to understand one aspect here better.
So that argument makes sense to me, right?
So I have a bunch of private data.
I mean, I would like to learn more, maybe have it be used in some way.
And now it may be possible to do that.
It will be possible to do that, something like open mind.
But if you look at this overall kind of transformation of how the industry works that,
you know, Meher and you kind of described, right, where we have this shift.
from data and algorithm in a silo controlled by a huge company to, you know, data being kind of
commonly available but encrypted algorithms being improved on this almost like public good.
So I see that one driver could be individuals saying, oh, privacy, my privacy is better protected
in that world.
But do you think there are like big economic drivers that are also pushing in that direction?
or is it kind of a conflict, right, where the economic driver is more like reinforcing the centralized
road and we have a privacy kind of need going in the other direction?
Or can you see how you see these forces playing out against each other or supporting each other?
This is where open mind becomes real and becomes something more than just, hey, wouldn't it be nice?
Because in this particular configuration, it is more efficient, less costly,
to do the same thing that aggregating data does.
And here's why I mean that.
If you have an organization, an entity, anything,
that has to act as a middleman in the creation of value, right?
So data is extracted.
You can think of that like being oil pulled from the ground, right?
And then you have this middleman that cleans it and curates it
and gets it ready for production, right?
And then it gets distributed.
Federated learning from a tech standpoint,
eliminates the need for a middleman
by allowing machine learning specialists
and data owners to interact directly.
So from an economic standpoint,
I mean, depending on how, like,
I guess the computation shakes out,
that's just more efficient.
Data and machine learning practitioners together
can create a certain amount of value.
And right now they're doing it with the middleman.
if you offer a technological ability to not do that,
you have a bigger margin.
And if you have a bigger margin,
it means you can either undercut on cost or grow faster.
Now, we as Open Mind are not, we're not a company.
We're not even a nonprofit.
We're a community that's trying to make these tools as accessible as possible.
Like this is a movement that's going to happen.
And we're just trying to be there to help facilitate it.
And to make these tools as accessible as possible
because we believe in the social consequences of this movement.
And, yeah, like 10 to 20 years from now,
I hope that people refer to open mind like they do open source
to the extent that it's a classification of tools
that allow for the shared training and ownership of models
based on distributed private assets being data.
Yeah, I mean, one thing I can also see
is just the aspect of having this very fine-grained,
the economical layer to all of this, right, so that you can literally pay people a little bit of
money for providing their data and have very fine-grained mechanisms for improving.
And I think that's maybe theoretically possible for a centralized company to do, but much harder.
And I just don't think it's naturally how, like with blockchain, you have to build it like
that, right?
Like, this is how it's going to work.
With a centralized company, it may be possible to try to build it like that, but it's not
the natural way it would be built, and thus I don't think it's going to happen in that context.
Yeah, no, I agree.
So let's drill down into what open mind is, right?
So what is it?
Is it a protocol?
Is it a system described to us the constituent parts of it?
So it's an ecosystem and it's a community.
So I am the first to say that the most valuable piece of open mind is the 2,000 people that are in a Slack team.
And like this next weekend, there's going to be a hackathon where there are people in 21 different cities that actually go to a cafe and meet together, right?
That's the part that really matters.
It's the awareness and making these tools accessible.
But like so, but you're asking about the software.
So the actual software itself is an ecosystem of, of, of, of,
libraries. The main one, if you go to our GitHub repository, you'll see is actually a Unity engine,
or your Unity, I hate to use the word video game, but it's a piece of software that's packaged
inside of the Unity game engine. And what that is, and we'll just call that a mine, right?
And that is a piece of software that is designed to hold an individual's data and protect it
while allowing them to be able to train machine learning models. So, so, um,
When delivered, you should be able to go to the kind of the Xbox gaming store and download a video game
and have clear instructions on your television or on your phone or wherever you're downloading it to,
to load in your personal data.
And on your behalf, it will earn a passive income stream.
And it's responsible for downloading models from the blockchain,
training them locally, and sending updates to that model back out to the blockchain.
So if you're picturing it in your head, it's kind of this sort of graph, right?
Where at the middle sits the blockchain and all these edges go out to individuals who then, with their private data, send gradients back up to the head node, which are then distributed back to whoever's training the model.
Technically, what are the pieces?
Like, what are the different technologies that you're using for it?
Yeah, so there's a deep learning library called SIFT.
And the reason we call it SIFT is it's like sifting through sand where you're keeping the intelligence that you want, but you're leaving the data behind.
And then there's a blockchain smart contract piece.
It's called Sonar, which is inspired by Sonar, which is your ability to glean information about something far away without actually having access to it.
And packaged inside of SIFT is also the kind of the encryption pieces.
So that's actually, it's a part of SIF.
So it's a homomorphically encrypted and multi-party encrypted deep learning library.
So you can think of it.
It's structured a lot like Pi-Torch or Keras.
And actually, we have interfaces now where if, so Pi-Torch and Keras, if you're not familiar
and you're watching this, those are the main, two of the main deep learning frameworks out
there is also TensorFlow.
And the way that we're setting up our SIFT library is that if you use Pi-Torch or Keras,
you have an interface that is indistinguishable.
So that's like a deep learning library that you're.
It has the exact, down to the method calls.
It's exactly the same, but it can be attached to very large private data repositories in the back end.
So, yeah, deep learning library, smart contracts system.
And I guess the only other piece that's worth mentioning is a piece that we haven't yet released it yet,
but this is kind of coming out in the next few weeks is an open grid,
which is basically the distributed compute network that this will train on.
So if you want to use kind of the GPU that's sitting,
in your game console, your PlayStation or your Xbox, or you have some other Nvidia GPU or something like that,
and you want to earn an income passively by training deep learning models, you'll be able to do so.
And the whole network will sort of sit on top of that compute layer.
But the monetization, I mean, let's say now I want to do that.
I want to provide my computational capacity, this network, to contribute to better models or give my data or something like that.
There's no monetization built into this software yet, right?
There's no token to pay people or do you see that as kind of a layer that would be developed on top?
Or do you see that different people would develop different approaches to solving this problem on top of these basic set of tools that you guys are developing?
Perhaps.
The interesting thing is that we're trying to build technology that allows a marketplace to happen.
And if you want to have a healthy marketplace, it's a lot easier if anyone can bring their own currency.
right? So we've been kind of attached to this idea that that people should be able to trade with
whatever they want to trade with. Tokenization for us, it looks a little bit more interesting
when you look at shared ownership of the model afterwards. That's kind of interesting. But even
then, that ownership is actually enforced through the cryptography, like the multi-party computation
shares as opposed to a discrete token. So like tokenization, it's a lot, it's a much more difficult
conversation in this sense because first off, whatever we build has to be not just useful,
but also more accessible than the next best thing, right? We're all about making private computation,
federated learning as accessible as possible. And so when we look at a token model, it's like,
okay, we could allow anyone to take to bring in any token that they want, right? Or we could
add an artificial barrier to entry by having our own. So like that, that isn't, we're still
evaluating that. But the
consensus right now is that
we want to make software so that you can show up and if you want to
hook it up to your Ethereum account or on Cosmos
if you want to use atoms or something like that,
like it you should be able to trade with whatever you want to trade with.
Does that make sense? Yeah. And so are
you guys then building a kind of
decentralized exchange where you can a lot
allow this marketplace to occur or what kind of infrastructure are you building to, you know,
for that actual marketplace and those actually economic transactions?
Yeah.
So that's the software that could support that kind of a marketplace or an exchange is,
is what we're most concerned with.
And actually making that exchange occur efficiently is like the,
that's the main problem that we work on day to day.
It's all about, okay, if you have a data set and I have a model and I send you a model
and you're sending me incremental updates, how do we as efficiently as a possible,
without without executing too many transactions and losing money on gas or or without
doing the cryptography wrong and then making the gradient aggregation process
really really expensive for you know homeworth encryption and multi-party
computation being inherently expensive algorithms like the main problem that we
focus on is making that that exchange of value as efficient as possible
which is honestly a very academic pursuit like that that and that's kind of the
main theme if you if you hang out in the Slack team you'll hear us talking about
that most of the time.
So the basic idea is obviously like, so you want to build a model and there's a hundred,
hundred of us, right, Brian and me, two of them who have data that's relevant to train
your model. So you send us the model. I train the model on my side that ship the update.
Then Brian downloads the updated model, trains that model, ships the update.
then person number 37 downloads the model, trains the model, ships the update.
And slowly as we keep training the model, each of us individually on our side,
the model keeps getting better and better.
And we are all compensated for basically training the model in ether or atoms or whatever,
the currency underlying the system.
So this can be a compensation mechanism can be like a blockchain token,
like a standard token.
Yeah.
Or a stable coin, because like, the biggest threat of a token for us
is that the volatility is so much that no one actually wants to use the marketplace
that we build for what we built it for, right?
I mean, I was extremely excited that one of the ways you introduced the project a few minutes ago
was like one of the first projects that's trying to do something real and new with the blockchain.
Like our community cares a lot about this thing being used for exactly what we built it for.
And for you to like put your your your bounty for training a machine learning model on the blockchain and for it to go up and down in value 10% per day is unacceptable, right?
That's just it it's it's no one's going to train a million dollar health care model that way.
It's just we need this to be used by real people trying to create real value.
And and so like if if someone comes along and solves the stable coin issue or or you know, one of those other innovations like it's not within our scope, but we we don't have.
time resources to try to play federal reserve with our own own cryptocurrency to create that
kind of stable market like we we we kind of need that to be solved elsewhere to be honest
but but that's that's sort of that's where we're coming from and I know most other projects are
like ICO first and write software later and we're we're totally inverted from that standpoint to
the extent that that we have we've raised no funding and but we're still you know the
110 people have contributed code and 2,000 people are in the Slack and and you know our hackathon next week will be all over the world and like people are really really engaged and involved. But that's that's sort of we really want it to be used for what we're building it for. And that's kind of how we've set up our priorities.
One thing that I don't understand is in in all of these blockchain systems, how to partition a resource into multiple different areas.
agents for a service they deliver is an outstanding problem.
Like if you look at Bitcoin, there's a bunch of miners.
Each of them is an economically distinct entity.
And then you have the reward, which is like the block reward, and you want to distribute
this block reward into these entities in some proportion of the value contributed to them.
So in the case of Bitcoin, that is, turns out that's pretty easy.
Like you can measure the amount of work contributed by each of these miners and therefore you can partition this reward pretty efficiently.
Now, in your case, you have the central company that wants to train a model and it has, it wants to distribute a reward to all of these different agents in order to get the model trained.
And then you have these different agents like me, Brian, that will contribute to training.
this model and our contributions might have different value to the company right so maybe when the
model is entirely new my data set might be very similar to brian's but i trained it first so that
first bit of training is very useful but then a hundred other people trained it and brian was the
hundred and first and brian's training is less valuable so how do you decide how much to give to
each agent that trained?
Yeah, so we're super excited about our solution to this problem.
And the way that we look at it is directly inspired from machine learning research.
So whenever you're going to go to train a model, the way that you objectively evaluate
how accurate it is is using a validation set.
That's also the best way to specify what kind of model you want to be trained, right?
If I just say a sentiment model, I could mean, like we talked beforehand, like I could mean
a sentiment model for hospital patients, or I can mean a sentiment model for movie reviews,
or I can mean a set up model for product reviews. And like all these things, like,
they're domain specific, right? So the best possible way for someone to specify, hey, I want a model
that knows how to do this is to provide a small validation data set. That also gives us the ability
to objectively evaluate how valuable each gradient is, or each set of gradients is,
to the incremental increase in intelligence of that model.
So the most expensive version you could think of like this, you send in a gradient, we update the model, run the validation set over that model, and now we see that it's 1% smarter, right?
Or it's 1% closer to the target goal, and thus 1% of the bounty is allocated to you.
Now, obviously, what we're working on are efficient approximations of this, so that you don't have to run the validation set after every gradient, right?
and that's what a lot of our innovations based on.
But as a sort of high-level starting point,
that's what it's more or less based on.
So incremental increases in the intelligence
as measured by a validation data set,
allow me to know, up, his data's crap,
up, his data's really good.
I'm going to pay him more, right,
without me ever actually seeing the data.
The other piece you mentioned about showing up earlier or showing up later
is solved using just a slightly different trick.
So on the one hand,
we actually do want to incentivize people
to be first because that's what helps makes the model get trained fast enough. But because these
models tend to follow a pretty, you know, it's a down theward sloping curve, right? You can,
you can sort of kind of adjust just using a function that offsets sort of, you know, the first 50%
receives this much bounty, the next 30% receives this much bounty, the next 10% receives this bounty,
and you can do that in a continuous way so that you can control how much you want to incentivize
the first guy. Because anyone who's trained in machine learning model,
those the last 5% is where you spend most of your work.
Okay, so you can kind of adjust it,
and you can basically kind of overpay the later improvements.
Exactly, exactly.
And that's something that you guys would decide,
or would that each?
Each person submitting a model can decide that.
Okay, okay.
So depending on how much they care.
Like if they just really need a model right now
and they want it to train really quickly,
well, then I imagine they would skew their rewards
more towards the beginning of training.
But if they're looking for the state-of-the-art model,
then they would reward heavily for the last mile, right?
And they would really try to get everyone to eke out that last 1%
and incentivize them to participate.
So this sort of assumes that there is a validation data set,
which is, let's say the model is to predict,
let's repeat that Fitbit example.
You have my heart rate during the day,
and using that heart rate or whatever biological data,
you're going to predict how well I'm going to sleep in the night, right?
Or something like that.
Yeah.
So if you want to make a model like that,
then you need to have a data set that is like very carefully curated.
Like it's a set of observations that you have collected in the field that,
hey, when people have had this heart rate during the day,
They did sleep well, this much percentage of the time.
And that highly curated data set becomes your validation set.
And then every time somebody ships an update to the model, you're going to use that curated
data set you have, run it across the new model that comes and like measure the sort of
performance of that model really well.
Right?
Yep.
So is it possible that a process like this?
can be attacked by creating some kind of biases that you aren't optimizing for.
Like, for example, like, you're training your model to be very good at recognizing X,
but I am the person that is training your model.
You don't know me because that's the advantage of the system.
You don't know who I am.
You don't know my data, but I'm training your model.
Can I, like, can I force some kind of biases to emerge in your model?
such that those biases are not obvious because they're not obvious.
You did end up paying me, but actually I have done a disservice to a model because I have
introduced, let's say, some kind of racial bias.
Like maybe the bias might be that, I don't know, for some kind of, some kind of particular
race, their heart rates and their sleep patterns are quoted in a particular way, but that's
not true across other groups.
and I as a trainer of that data
present more and more data
for one particular racial category
to the detriment that
your model performs really well for that one category
but then there are other categories
where it just doesn't
and I'm doing that
in order to screw with your model basically
few things first
this is an active high priority
in the machine learning community period
right I mean this is something that the machine learning
community is actively talking about
it's on Twitter it's at conference
it's all over the place. Like how do we have unbiased models? On the one hand, our system
might add a little extra difficulty to that because you can't see the data. But on the other
hand, because it's a free market for gradients, we've got some really good defenses against it
because there's such a wide variety of sources. So even if one guy showed up with a data set
that was skewed, unless he's responsible for the vast majority of the data, there's going to be
other people that are creating a more robust distribution, right? So it's a lot easier to accidentally
have bias if you're pulling data just from one distribution. But if you're opening up the accessibility
and making it possible to pull from lots of different distributions, it becomes a lot more
difficult to hijack the model. Now, that all being said, what are we doing to help mitigate this
particular problem. So the first one is not all gradients get accepted. It's actually only
gradients that contribute to in the current the current plan for how we roll up. It's only gradients that
contribute to what ends up being the highest quality model. So that creates an added challenge for
someone trying to add bias to the system because they don't need to just contribute gradients that
are marginally good but also include their bias. They have to contribute gradients that end up
participating in the highest model, right? So if you could
think about having multiple different models that are being trained and only the top one will
actually get, you know, pushed out to the ecosystem. It has to know which one to pick and it also
has to when it creates a, so it creates like GitHub forks. So if you're familiar with GitHub,
you know, you create a code repository and then there's a fork off and there's a little
improvement and then a fork off and a little of improvement. That's sort of how these models
end up getting trained. So it becomes significantly more difficult. It's not just like one model
and all the gradients aggregated and you just get paid for however your contributions. Even if you
have good data, you have to participate in the best model to get included at all. And your level
of participation is going to be regulated based on how much accuracy you add to the model. So there's
already some very significant hurdles that you have to cross. And it's even more difficult in the
open mine ecosystem because we have such a wide distribution of data. So whereas you can you can
provide your piece of it, unless you can convince the system that you should, you should
be you should be the dominant amount of gradients which you can only do if you're also contributing
good data it becomes very difficult so so this is still an active research question it's an extremely
high priority in the machine learning community but we like to think that we sit at a pretty strong
advantage by allowing data to come from multiple different sources and by not just not allowing any
gradients but only gradients that contribute to the best model cool so what's the current state of the
open mind project like how far have you gotten what are the main things coming up yeah so
we opened our GitHub repositories in july of last year so we're almost six months old
the we launched our first smart contract on a test chain in i think it was november
that you could do basic federated learning with and i guess in the last six months about our
Our community has grown to about 110 code contributors and about 2,000 people on our Slack channel.
The next thing to be released here in the next few months are, so we just finished our kind
of initial deep learning library interfaces for Pi Torch for Keras and our standard
autagrad system.
So basically you can do, you can chain together any sequence of kind of machine learning computations
and it will automatically backpropagate and update.
So the deep learning library is more or less in alpha.
We do have a blockchain or a smart contact up on the blockchain.
That being said, we're working hard on performance improvement.
So I wouldn't necessarily say that that was ready for kind of general availability or prime time.
It's on a test chain.
The thing that we're working most hard on now is,
and the next thing that we're going to end up releasing is the compute grid.
So this whole system has to live on a pretty significant amount of compute.
in order to do distributed machine learning.
And so we're looking to release that part in Alpha in the next few months, next two to three, two to three months.
We'll have it internally before then.
You said you guys have something running on a test chain.
So is in the end the idea that you guys will run your own blockchain?
Do you guys have run on top of another blockchain?
Do you guys have an idea of what direction that's going to go in?
Yeah, we've got a short list.
So the short list is Ethereum, tendermint, and then this really cool project called Trillion.
And it might end up being a combination of several.
Once again, we're trying to make this as accessible as possible.
For us, just being on one chain might not be really enough.
We've also been approached by lots of enterprises who are interested in doing this on kind of their big data warehouses,
which I actually view is, even though we're very excited about consumer data, if enterprises,
want to be able to protect their data and not send it around as much, but still extract the same
value from it. I think that that's pretty awesome. The answer is we'll probably do multiple.
And hopefully, once again, like, I really, my hope is that open mind will be referred to like
open source. It's a more a community and a class of algorithms than any one particular chain
or one particular protocol. I've been a quiet onlooker onto your Slack community.
It's a huge community.
I actually love the way you manage the community.
Like, every day you are like commending members of your community for doing something.
I get the impression you are a great manager, although I haven't ever worked with Open Mind or whatever.
And you have this hackathon coming along.
Like, tell us some of what are the interesting projects that are being done in your community,
like actual applications that people are building?
So unfortunately, the most interesting ones are being done by people who kind of want to keep it to themselves for the moment.
So I won't spoil their surprise.
I think that the most interesting one that's public is based on reinforcement learning.
So the Unity machine learning team reached out a few weeks ago,
and they've been particularly excited about us building a deep learning engine that actually sits inside of Unity.
And so they actually commented on one of our GitHub issues you'll find.
And I'll just, I'll share what's what they said, which is more or less, we want to build.
They've got this really cool project called ML agents.
You might have seen it as in the top of Hacker News not too long ago.
And it's basically the ability to train TensorFlow-based machine learning agents inside of Unity worlds.
So like you can create a world inside of Unity that an agent can interact with, you know, it might be balancing something or might be running.
You've probably seen some of the GIFs running around.
And so Unity has been working on their own project for that.
But in order to do that, they have had to, and to get it to work with TensorFlow, they literally render an image inside of Unity and then shoot it out via a socket to TensorFlow, which then makes a prediction.
And then they take their prediction and shoot it back into Unity and then iterate the game to the next frame.
And they're not unique in the way they do this. Everyone who does reinforcement learning kind of does it this way, right? There's sort of one rendering engine that sits over here and then there's one
whatever deep learning engine Pytorch Keras, whatever it is sitting over here.
So I think the most interesting project that's happening right now is the potential for the machine
learning engine to live inside of the world that on the same GPUs, in the same namespace,
as the R.L worlds exist, right? Because then you can do all sorts of crazy stuff. I mean,
you can use machine learning models to help with the rendering process or to make the world
different, right? You could have one AI that makes the world and
other AI that runs around in it, right? I mean, you can do all sorts of crazy stuff. And also,
you get better latency and better performance, which allows you to do new stuff. So I think the,
one of the most interesting projects that's going on in the community is, is rebuilding a lot of
their reinforcement learning demos, but with our deep learning engine based in the Unity game engine,
actually being used as the backend. And there's some really cool demos, like I highly recommend
checking it out. Those Unity guys, they really know how to make, well, they know how to make Unity
look really good, so they know how to build really cool games.
Cool. Well, so if people want to get involved in open mind or learn more about it, what's the best? What should they check out and how can they participate?
Yeah, so join the Slack and say hi in general discussion. There's a quick start guide that walks you through like who we are, what we do, what the community is about, what we're trying to build. We've got around 200 GitHub issues that are labeled good first issues.
issue and what those are designed to be are they supposed to be like the first code contribution that you make.
So each issue actually has a tutorial on like how to set up your dev environment, how to get unity
installed, like maybe you've never built a video game before, and like how to actually implement
the piece of functionality that you're trying to implement. So it's literally a tutorial for how to
contribute code to the system. We try to make that barrier really low. And once you actually have
your first merge pull request, you become an official contributor on open mind and you your face
appears in our home page and you are a member of our GitHub organization.
That's kind of like the formal process for becoming kind of a member of the GitHub
org.
So yeah, come hang out.
It's super fun.
And we're in, we've got this really cool team map, which I can send you a link for.
And it's actually, it's my Twitter cover photo as well that has basically every person all over
the world with a little dot showing where everyone lives.
And it's like, it's in every time zone.
So when you go, when you go hang out, not every time zone, but almost like a lot of time zones.
So when you hang out in Slack, there's always someone online to talk to and work on really cool stuff with.
And you mentioned that you guys are organizing some kind of like hackathon.
When is that happening?
Yeah, a week from Saturday.
So we have hackathons all the time.
We usually have them on Saturdays or Sundays.
And they're usually an in-person component and also a virtual component.
So we'll have a Google hangout that'll have lots of people in it.
But this, on Saturday the 13th, we're going to have a hackathon,
and I believe there's going to be 21 physical locations in 21 different cities around the world.
So there's probably one near you.
There's like Florida, Toronto, California, Turkey, Austria, London, obviously, because I'll be there.
But yeah, so come to the hackathon, and if you can't come in person,
come hang in and hang out, and we'll get to know each other.
Cool.
Well, Andrew, thanks much of us.
pleasure learning about this super awesome project and I do agree with me here even though I don't
have as deep an insight that you seem to be this seems to be a very vibrant and well-run and
growing communities so I hope I'm going to see lots of exciting news coming out of this project
me too thanks for having me so thanks so much Andrew and and of course thanks so much for a
listener for once again tuning in if you want to support the show you can
do so by leaving us an iTunes review or sending us the tape as well. And otherwise, we look
forward to being back next week. Thanks so much and we'll see you then.
