Orchestrate all the Things - DeepMind wants to reconcile Deep Learning and classical computer science algorithms with Neural Algorithmic Reasoning. Featuring DeepMind's Petar Veličković and Charles Blundell, MILA's Andreea Deac
Episode Date: September 10, 2021Will Deep Learning really be able to do everything? We don't really know. But if it's going to, it will have to assimilate how classical computer science algorithms work. This is what DeepMind... is working on, and its success is important to the eventual uptake of neural networks in wider commercial applications. This work goes by the name of Neural Algorithmic Reasoning. Join us as we discuss roots and first principles, the defining characteristics, similarities and differences of algorithms and Deep Learning models with the people who came up with this. We also cover the details of how Neural Algorithmic Reasoning works, as well as future directions and applications in areas such as path finding for Google Maps Could this be the one algorithm to rule them all? Article published on VentureBeat. Image: Getty
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Will deep learning really be able to do everything?
We don't really know.
But if it's going to, it will have to assimilate
how classical computer science algorithms work.
This is what DeepMind is working on
and its success is important to the eventual uptake
of neural networks in wider commercial application. This work goes by the name of Neural Algorithmic Reasoning. και το πρόσφατο του είναι σημαντικό για την επιπλέον ανοιχτήση νερμικών νέκρων σε περισσότερες επιχειρήσεις.
Αυτή η δουλειά συμβαίνει με το όνομα «Νερμική αλγόριθμική διαφωνία».
Συνέβητε μας όταν συζητήσουμε τις πρωτοβουλίες και πρώτες πράξεις,
τις διαφαντικές χαρακτηριστικές, τις συμμετοχές και τις διαφορές
των αλγόριθμων και των δευτεροπιστών μοντών
με τους ανθρώπους που έκανε αυτά.
Επίσης, συζητήσουμε τις ποιότητες του τρόπου
που λειτουργεί η νερμική αλγόριθμική διαφωνία,σμα δημιουργεί, όπως και τις αλλαγές και εργασίες στο μέλλον, όπως
το Pathfinding για τα Google Maps. Μπορεί αυτό να είναι το ένα γνώρισμα που θα τους καταλήξει όλους.
Ελπίζω ότι θα χαίρετε το podcast. Εάν σας αρέσει το δουλειό μου, μπορείτε να μείνετε
στο link data orchestration στο Twitter, LinkedIn και Facebook.
Ευχαριστώ, ευχαριστώ όλους για την ώρα. Η συμμετοχή είναι Data Orchestration on Twitter, LinkedIn, and Facebook. Thanks, everyone, for making the time.
The occasion is basically some work that you recently published
on neural algorithmic reasoning,
which sounds pretty intimidating in a way.
So the idea here is to try and break it down
and make it a little bit more approachable, let's say.
And what caught my attention and the main idea, I guess, is, well, you're trying to
make neural networks and deep learning kind of emulate, you know, usual algorithms, let's
say.
And that seems like a pretty unusual and also pretty ambitious idea.
And I guess people will be interested in that.
And that's why I thought, OK, let's see what this is all about.
So let's start by introductions, basically.
So if you want to just take your turn, say a few words about yourselves
and how you actually all got to work together on this,
which is, I guess, another interesting aspect of this story.
Sounds good. So I guess I'll start.
My name is Petr Velichkovich. I'm a senior research scientist here at DeepMind.
And I was in many ways, at least within DeepMind, a person who kind of kickstarted the algorithmic reasoning direction alongside Charles. And it came from a very kind of, it came from basically my
background, so to speak, like, basically, I, I've come from a theoretical computer science education
from my undergrad, and I've participated a lot in competitive programming contests. And I've
basically, my way into computer science was through algorithmic reasoning and algorithmic thinking in terms of classical algorithms. And since I started doing deep learning research, I really wanted to try to reconcile that with, you know, the classical algorithms that initially got me excited about the field. like very strong, even complementarity between the two of them, like what one of these methods
tends to do really well, the other one doesn't do that well, and vice versa. And usually, when you
see these kinds of patterns, it's usually a good indicator that if you can do anything to bring
them a little bit closer together, you know, then you could end up with a very awesome way to fuse
the best of both worlds and and make some really strong advances.
So I guess that's a good summary of how I got into this.
I'll hand it over to Charles.
Yeah.
Hey, so I'm Charles Bundle.
I'm a research lead here at DeepMind.
And I'm generally interested in getting neuronoics to make much better use of the really large
quantities of data they're
exposed to. There's a few things that I think are quite interesting there. So, you know, getting a
network to tell us what it doesn't know, get it to learn much quicker, or, you know, to use this
fast data to do something more than you might expect that it's able to do. And this latter case, I think, is where neural algorithmic reasoning comes in.
And I guess when Peta joined DeepMind, it was, you know,
our early conversations were a lot of fun because we have a very similar background.
A while ago, I also had a background in theoretical computer science.
And I think, you know, a big question, like a fundamental question in machine learning
for a long time has been how do, you know, how do you generalize? What is a notion of generalization?
How do you work beyond the data examples you've seen? And algorithms are a really good example of
something we use, all use every day to do this. And in fact, there aren't many algorithms out
there. You know, if you look at the standard computer science textbook, there's maybe 50 or 60 algorithms
that you learn as an undergraduate. And everything we're using right now to have this call is
using just a subset of those. And so there's like this very nice basis for very rich computation
that we already know about. But it's completely disjoint from the things we're learning.
And so I think when Petra and I started talking about this,
we saw clearly there's a nice fusion that we can make here
between these two fields that has actually been underexplored so far.
So yeah, that's kind of my interest in that.
Maybe Andrea, you should say.
Yeah, I have a bit of a different beginning.
I also did theoretical computer science.
I'm currently a PhD student at the University of Montreal in Milan, and I'm doing an internship
at DeepMind.
I got interested in general graph representation learning through the lens of drug discovery and then discovered some things that we could
generally improve if we combine some sort of inductive biases we know of.
I think algorithms are things we've studied and how do we combine this with just general reinforcement learning or
just general neural networks. I think that's what got me excited lately.
Okay, cool. Thanks. And I have to say that Peter specifically mentioned some of your previous work,
which I wasn't aware of, to be honest, as kind of kickstarting in a way what
him and Charles started working on for a neural algorithm reasoning. So I guess we can get back
to that and you can talk about that in a bit more detail further down as we progress in the
conversation. So thanks, everyone, for for the introduction and I guess we can
well we can jump straight to the core of the core proposal of your paper which you kind of outlined
already so and I'm quoting here for me reading the paper which I have to say is well all things
considered for you know for a typical paper in this space,
it's quite approachable, actually, because it doesn't go deep into the math aspect, basically.
Well, not too much.
It kind of tries to, it's more like a position paper, if my interpretation is right, at least.
So that's what makes it approachable.
So the key thesis is that algorithms possess fundamentally different qualities to deep learning methods. είναι σωστό, τουλάχιστον αυτό είναι που το κάνει προστατευόμενο. Ο κύριος θέσος είναι ότι οι αλγόρθμοι έχουν
διαφορετικές πραγματικά καταστασίες στους
διπλόρυντς μεθόδους και αυτό προσδέχει ότι
αν οι διπλόρυντς μεθόδους ήταν καλύτερα
ευκαιρίες να μιμικρούν τους αλγόρθμους,
τότε η κατανοητικότητα της μικρής σκηνής με
αλγόρθμους θα γινόταν δυνατή με
διπλόρυντ, το οποίο όπως επίσης
αναγνώρισα στην παρουσία σας, είναι
κάτι που δεν είναι τώρα εφαρμογό. which, as you also highlighted in your introduction, is something which is currently not available.
So I thought a good way to approach this for people
who may not necessarily be familiar with the defining properties,
let's say, of algorithms versus deep learning models
would be to actually define those.
Like, in what ways are algorithms and deep learning models different?
So who would like to have a go? I can say something here. I guess,
so with an algorithm, in most cases, it doesn't change. So it's a fixed set of rules that are
executed on some input. And usually, good algorithms have a bunch of well-known properties. So for any kind of input you give, it gives a sensible output.
It gives that sensible output in a reasonable amount of time.
And you can usually change the size of the input and the algorithm keeps working.
And I think those are really desirable quantities.
You can think about any processing that you want to do.
The other thing you can do with algorithms, you can plug them together.
So as I mentioned before, what we're using right now is just a composition of a bunch
of algorithms being strung together.
And the reason they all can be strung together is because of this kind of guarantee that
they have.
Like given some kind of input, they only produce these kinds of outputs.
And that means that we can say, well, those kind of outputs can feed into another algorithm.
You can build a whole stack of these things.
And I think that's maybe, you know, people have been looking at learning algorithms,
you know, for a while, actually, in deep learning.
And it's always been quite difficult.
Simple tasks are actually pretty good, pretty widely used to debug things.
So for example, a simple algorithm would be, I want to produce as an output the input.
It's called a copy task.
I just want to copy the output.
The output is just a copy of the input.
And it turns out that you can learn to do this up to some length, but many networks,
if you just increase the length of the input,
so if I train the numbers one to ten and test it on the numbers one to a thousand,
many networks will not generalize.
They will not have learned the core idea.
You just need to copy the input to the output.
This gets, you know, as you make the process more complicated, then you can imagine this gets worse.
So if you think about sorting or various graph algorithms actually the generalization is is far worse if you just train a network to simulate an
algorithm in a very naive fashion but there's something very nice about algorithms which is
you can they're basically simulations you know you can generate a lot of data with them
and that makes them very amenable actually actually, to being learned by deep neural networks.
But it requires us to then think, you know, from the deep learning side, what changes do we need to make there so that these algorithms can be well represented and actually learned in a robust fashion?
So if you use deep learning for something, usually there isn't a very strong guarantee on what the output is going to be.
So you might say that the output is a number between zero and one, and you can guarantee that,
but you couldn't guarantee something more structural. So for example, very rarely would
you be able to guarantee that, you know, if I show this network a picture of a cat,
and then I take a different picture of a cat, it will definitely classify that as a cat. There's a certain degree of error that's going to leak into it. Whereas if you had
an algorithm for doing something like this, you may very well find that you could develop guarantees
that this wouldn't happen. Now, this is partly because the kind of problems algorithms are
applied to are more amenable themselves to these kind of guarantees. That's a very interesting
thing itself, because the question is, if the problem is amenable themselves to these kind of guarantees. That's a very interesting thing itself, because the question is,
if the problem is amenable to these guarantees,
then maybe there's something we can bring across into the deep neural networks
on classical algorithmic tasks that allows us to develop these kind of guarantees
for the neural networks themselves.
And those guarantees are usually about generalization.
So they're generalization in terms of the size of the inputs,
the kinds of inputs you
have. There are algorithms that generalize over types, for example. If I ask you if I have a
sorting algorithm, you can sort a list of numbers, but you can also sort anything you can find an
ordering over. So you can do it over letters and words and things like this. That's not the kind
of thing we see at the moment with deep neural networks.
Usually when people learn sorting, they learn on one type of thing like numbers.
And they wouldn't necessarily even consider testing it on words, even though it's just the same algorithm.
So there's this nice abstract representation of processing that's independent of the inputs.
And that's part of the thesis here, is to try and think about what is this abstract processing?
It's a bit more detached from the actual representation of the inputs and can we learn that to some extent and
learn to use that. So that's sort of my little bit there. I don't know if there's anything Petr or
Andrea would like to add to that. Yeah that was fantastic Charles. I would maybe add only a few
things that are also important to note on the distinction between deep learning and algorithms, at least classical
algorithms, is that first of all, you know, algorithmic computation can usually be very
concisely and elegantly written down as a pseudocode that explains how you go from your
inputs to your outputs. This makes them very trivially interpretable. And because they operate
over these like abstractified inputs that conform to some preconditions and post conditions,
this means it's much more easier to theoretically reason about them, like make guarantees and stuff
like that. And also makes it much easier to find connections between different problems that you
might not see otherwise. I think max flow and min cut is one of my most loved examples where, you know, there are two problems that are seemingly quite different,
but the solution of one is necessarily the solution to the other. And that's not as obvious
unless you study it from a very abstract lens. And yeah, so there's a lot of the benefits to this
kind of elegance and constraints, but it's also the potential shortcoming of algorithms because if you want to make your inputs conform
to these stringent preconditions,
what this means is that if my data that comes from the real world
is even a tiny bit perturbed and doesn't conform to my preconditions,
I'm going to lose a lot of information
before I can massage it into the algorithm.
And that obviously makes it suboptimal
because even if the algorithm gives you a perfect solution,
it might give you a perfect solution in an environment that doesn't make sense. And therefore, the solutions aren't going to be something you can use. This is
something we will touch upon quite a bit when we talk about Andrea's work as well. But basically,
so that's quite important. And on the other side, deep learning is designed to just rapidly ingest
lots of raw data at scale, and pick up what are interesting rules in that raw data
without any real strong constraints.
And this makes it remarkably powerful to these kinds of noise scenarios.
Like you can perturb your inputs and your neural network
will still be reasonably applicable for classical algorithms
that may not be the case.
And that's also another reason why we might want to find this awesome middle ground
where we might be able to guarantee something about our data,
but not require that data to be constrained to say tiny scalars
when the complexity of the real world might be much larger.
So yeah, that's maybe a comment I would add.
Andrea, do you have some thoughts as well?
I feel like I will continue on both of those threads
when I talk about Excel.
So, yes, I think you covered everything other than the exact way we apply this.
Maybe I would add one comment then, which is like, where do algorithms come from?
And I think this is a really important thing for us to think about when we think about applying all of this. So, you know,
usually what happens is you find a very clever theoretical computer scientist, and you tell them
about your problem, and they think really hard about your problem. So let's imagine, you know,
you've got some real world problem you really want to solve, but you're not quite sure how to solve
it efficiently. They'll then often go go away and they'll sort of map it
onto a more abstract version of that problem,
derive an algorithm for that,
and then tell you that for everything in this problem class they've got,
this algorithm will execute in this amount of time
and give you the right answer.
And there's a bit of an inductive leap there,
which is that the mapping from your real-world problem
into this abstract space on which the algorithm is derived
isn't always completely aligned. And I think one thing we see with machine learning is kind
of the opposite, which is that it just looks at the data. It doesn't really map it onto an
abstract space, but it does solve the problem based on what you tell it. And so you have these
two extremes. And I think what we're trying to do here is get somewhere in between those two extremes where you have something that's a bit more structured, but it still fits the data.
And it's not necessarily a human in the loop constantly process.
You don't necessarily need to think really hard as a theoretical computer scientist.
And there's some opportunity because often real world problems are not exactly mapped onto the problems that we have algorithms for.
And also for these things that we have algorithms for, the abstract problems,
often there's like a few examples in there that are really, really difficult to solve.
And they're what make the algorithms really, really slow.
And they may not occur in the real world.
And so if we have something that can learn from your data that really just solves the problem you've actually got,
rather than something much more abstract and much more expensive,
there's the potential to get algorithms that are generalized just as well,
but actually are much faster than what we already have.
And so I think one of the things all three of us have been really thinking about here is,
and this is exactly where Andrea's work comes in,
is can we help new algorithms
that significantly outperform existing algorithms, but have the same sort of guarantees for the
real world?
Sorry, I just wanted to mention that.
Yeah, it's a good point.
And actually, you touched upon a few points that I made notes on and wanted to ask follow
up questions on.
So one of them has to do with complexity, really. So
one of the defining properties of algorithms is precisely that, that you can study them
theoretically and devise their complexity and therefore make predictions about the time at
which they're going to execute, which is not the case for machine learning and deep learning. So
I guess this is probably one of the aspects
that you're aiming to address.
Yeah, I guess most deep learning algorithms
and a lot of machine learning algorithms,
once the network is trained,
the runtime can factor in terms of the input size
because the schema doesn't change
or it's like linear or
maybe quadratic there's maybe one exception which be something that involves search there's
exponential time but that's very rare whereas i think when a human sits down to write a program
it's very easy to get something that's got exponential time you know if you try and solve
a problem manually it's sufficiently hard you'll often come with an exponential time solution
unless you've done a lot of programming computations like Petter has,
in which case maybe you go to a very quick solution quickly.
But in general, you'd come up with something quite slow.
Neuronerves are the opposite. They're extremely lazy.
And that's a very desirable property in learning and coming up with new algorithms.
You really want something that's very slow.
So we actually have to put all the work into making the networks use more computation.
That's actually something that we thought a lot about, and there's still a lot of work to be done
there. So there are people who've looked at networks that can adapt the amount of computation
time to use by changing the amount of pondering they do have a problem. And the idea there is to
work out when to stop. But that's not, you know, that's quite new and not that widely used. And maybe
Petra will mention maybe later in point graph networks, we had a mechanism very much like this
that helps us, you know, when you don't know ahead of time, how much, you know, there's this
whole problem in deep learning of how you design the architecture you use, and it has a huge impact
on how well it works. And there's a strong connection between how much processing you do
and how much computation time you spend
and what kind of architecture you come up with.
They're intimately linked.
I think it's very interesting to think about.
And as you go through more and more complicated problems
and the algorithms associated with them,
you have to then develop networks
that have those same computational expressivity,
which doesn't always, you know,
it's not always there, actually. And if even it is there, it's not always possible to get a network
that's trained to use it. So those are some of the challenges I think are really interesting
with this particular domain, you don't necessarily find as explicitly in other domains, although it
certainly exists, like something like the game of Go certainly has a lot of computational demands,
really massive. But the algorithm isn't necessarily clear. And so it's like something like the game of Go certainly has a lot of computational demands. It's really
massive. But the algorithm isn't necessarily clear. And so it's harder to study from that perspective.
I would maybe just add one quick remark about complexity, which is that in algorithms,
in theoretical computer science, people usually look at, you know, the worst case.
I think what we're doing here, like trying to apply algorithms on really naturally occurring inputs, we have a bit of a redeeming quality, which I like to concisely
state as nature is not an adversary.
So even if the underlying problem you're solving is MP hard in the worst case and requires
exponential time to solve, the naturally occurring instances of that problem that you're interested
in may well be polynomial time solvable.
So while traveling salesman, for example, is an NP-complete problem and is known to be so,
and we don't know of any polynomial time algorithm for it,
there exists a prediction that's 100% correct for the traveling salesman for all the towns in Sweden,
all the towns in Germany, all the towns in the USA. And that's because, you know, geographically occurring data actually has nicer properties, right, than any possible graph you
could feed into traveling salesmen. So one thing that we do sometimes when we solve natural problems
with algorithms is we try to push them into this framework we've come up with, which is nice and
abstract. But as a result, we might have made the problem more complex than it needs to be.
So in algorithmic reasoning, and I will touch upon this very shortly, we often deal with learning heuristics
that are strictly polynomial time. So they're not going to get to exponential complexity at all.
But they may well be just enough to solve all of the instances you actually care about in the real
world, because it uses the power of deep learning to extract the meaningful properties from the raw data. So I just thought that was a very important point to make here.
And then the other follow-up question I have was a kind of naive one really, but some of the things
you described actually are not specific to deep learning. They apply to machine learning in
general. So I was wondering why did you choose to address deep learning, basically?
Why did you choose to go for that generalization framework,
specifically applied to deep learning algorithms
and not just any machine learning algorithm?
So I guess one reason, which I think will also flow into the next segment
where we'll talk about our blueprint,
is that we want to design solutions that operate over the true raw complexity of the real world.
So something that will take the raw data you get from nature and make decisions that are algorithmic based on that data.
And so far, the best solution we have for processing large amounts of naturally occurring data at scale is generally
deep neural networks. Like that's the current state of the art for massive quantities of naturally
occurring data. And that's why we, I guess we focused our discussion on it. But in terms of
the applicability of the blueprint, I think at least the blueprint we propose is amenable to
any kind of gradient based optimization. So it doesn't have to be deep neural networks.
But like, I can totally see how if somebody wanted to, say,
apply a genetic algorithm to this framework,
they would still be able to do it if they wanted to.
Charles, you might have a comment on this, but yeah.
Yeah, I guess maybe, like in practice for a single problem,
like in a sort of very applied setting,
you might well find like logistic regression
or decision trees work fantastically well.
And I guess with deep learning, as Petter mentioned,
when you start to deal with text or audio
or visual inputs, video or static images,
deep learning starts to really take off.
It's able to deal with those.
And one of the reasons it really takes off there, we think,
is because inside the network,
it has these much richer representations of the data
that's really more elaborate
than what you can get with these other kinds of machine learning methods
like decision trees or logistic regression.
And so when we're thinking about extending the
computational capacity even further and being able to do more computation and richer kinds
of processing of data than we currently see in neural networks, it's natural to think that
that might provide a useful basis for doing that. And I guess what we find is that it does,
but actually we need to look at new neural network architectures that are distinct to ones people have previously thought of to be able to do that kind of processing.
So even inside something that's quite a large model class, it's very rich and complicated.
We find that we need to push the boundaries even further than that to be able to execute algorithms reliably.
And so it is a sort of, you know, it's an empirical science that we're looking at.
And, you know, I just don't think some of the things out there,
you know, as you get richer and richer decision trees,
they can start to do some of this processing.
But we know that it's basically a tree of if this, then that.
And what's missing from that is recursion.
So what's missing, or iteration,
the ability to loop over things multiple times.
Whereas with neural networks, for a long time, really,
people have understood that there's a relationship between
iteration or recursion and recurrent neural networks, for example.
That's maybe one motivation for them.
And in graph neural networks, the same sort of processing arises again,
like the message passing that you see there,
is again something very natural in object-oriented programming,
where you send messages between classes or objects.
You can see it's exactly analogous.
And you can build very complicated interaction diagrams,
and those can then be mapped onto graph neural networks.
So it's from the internal structure that you get a richness that seems,
it might be powerful enough
to learn algorithms you wouldn't necessarily get with more traditional machine learning methods.
Okay, thanks. Thanks. That explains things a little bit and I think it's also a good
occasion to actually go to the next part of the conversation which is
the core basically of what your research proposes. So the blueprint for new neural και έτσι, έγιναν οι δεύτεροι μέρος της συζήτησης, που είναι το κορδόνιο, βασικά, του τι προφανεί το εξελίσσιο σας,
ο λεπτόνιος για τη νερολατουργική διαφωνία,
που, βασικά, θα ήταν ωραίο αν οι άνθρωποι που ακούσαν τη συζήτηση
μπορούσαν να δει το διαγραμμό που δημιουργήσατε σε αυτό,
γιατί είναι πολύ ευκολόγητο, αλλά η κυρίως ιδέα,
αν μπορώ να το αναφέρω λίγο, είναι, βασικά, it's quite elaborate. But the main idea, if I'm able to convey it a little bit, is basically this
kind of mapping between inputs and outputs. And that's why it's related to what you just
mentioned about data representation, basically, and that being the reason for you choosing to
work with neural networks. So it seems like the core idea basically is taking some sort of elaborate real world, let's say, data artifacts or things that happen in the real world, basically, and use some data representation to make it processable by neural network.
So do you want to elaborate on that and how do you envision that happening?
Yeah, I think I will answer this one. So basically, I guess the best way to present
how we arrived at the blueprint is to start from what is the status quo? Like,
where are we right now in terms of processing real-world data with
algorithms and gradually fixing some issues that are placed there by
introducing neural networks at various stages of the pipeline,
I think is a way to kind of nicely step by step arrive to our blueprint.
So imagine you have a real world. I'm going to use the running example from the paper, but any other example with real world data works.
So you have, you know, a real world road network over which you need to predict travel times, which is a very standard problem.
Within DeepMind, we've worked on a Google Maps application where we applied graph neural
networks to predict travel times.
And this is now serving queries in Google Maps worldwide.
So it's a very impactful problem that hits a lot of users.
But fundamentally, you cannot just imagine the road network as a graph very easily and
apply Dijkstra's algorithm on it to compute shortest paths, right?
Or even estimate travel times.
But in many principles, if a human wants to apply a classical algorithm like Dijkstra's, this is what they would have to do.
They would have to take all the complexity of the real world, you know, roadblocks, weather changes, traffic flows, bottlenecks, God knows what,
and convert that into an abstractified graph of nodes and
edges with some weights, which sort of correspond to the travel time, so that then you can run
Dijkstra's algorithm over this thing and predict the shortest paths and then somehow predict that
back into a route. So this is the standard pipeline, how algorithms were originally proposed.
Now, there's a few issues there. And I think the first problematic issue, as I said,
we will keep repeating this issue time and time again throughout this interview,
is that algorithms are very happy to give you perfect answers, regardless of whether or not
the inputs themselves are perfect. So yeah, it's going to give you the shortest path, but who
guarantees that this graph you came up with is an accurate representation of your real world
scenarios, right? So the algorithm might give
you great solutions, but, you know, there's really no way of saying whether this human device heuristic
is actually the best way of looking at it. And, you know, that's the first point of attack. So
historically, we found that whenever there's a lot of human feature engineering, and lots of raw data
that we need to work with for that human feature engineering, this is where deep learning shines. Like the initial success stories of deep learning were
exactly in replacing handcrafted heuristics for, say, image recognition with deep neural networks.
So that's exactly what we do to begin with. Replace the human that maps the real world raw
data to a graph input with a neural network that will map the complexity of the raw data
to some abstractified graph
inputs that we can then run the extras over.
So far, so good.
That's looking like it might help.
And there exists a really good line of research that's actually concurrent to our work that
works exactly in this setting.
And they figured out a way to even propagate gradients through the algorithm.
So you can actually get good neural network optimization in the setting and apply the algorithm, which is awesome.
But there are a few limitations with the setting,
even if you're able to propagate gradients through it.
And that is, first of all,
the algorithm may not be everything you need
to compute the final solution, right?
You might say some part of this problem
requires shortest paths,
but maybe at some point,
I also need to run a flow algorithm.
And I'm not really sure how to combine the results of the two in the best way, because, you know,
the problem is very raw and noisy, right? So just, you know, forcing your representations to rely on
the output of this one algorithm might be too limiting. And also the fact that we compressed,
like, you know, Dijkstra requires one scalar value per every edge in the graph, which means that
you're compressing all that amazing complexity of the real world into just one number for every road segment. And that's
potentially problematic, because if you don't have enough data to estimate that scalar correctly,
you're doomed. Like basically, the algorithm is not going to give you correct solutions once again,
because you just haven't seen enough raw data to estimate that scalar correctly. And this is
basically a bottleneck problem, which in the paper we refer to as the algorithmic bottleneck.
So the bottleneck happens because we're placing all of our bets on this one scalar in every edge,
or basically a very low dimensional representation of the problem. So how do we break bottlenecks?
Well, we stay high dimensional, you know, deep neural networks are deriving a lot of their success from staying high dimensional
and doing all of these like high dimensional regularization techniques, which means that
even if you predict badly some parts of the internal vector in the neural network, the
other parts of that vector can still step in and compensate.
Not everything is lost, especially at lower data regimes.
So what we now
have to do, but algorithms weren't designed to run on high dimensional inputs. They're designed
to run on these low dimensional representations. So the key idea in our blueprint is to replace
the algorithm with a neural network, typically a graph neural network. We will discuss this
later on, which actually takes a high dimensional input, does one step of reasoning to produce another high dimensional input. And then once we've done
enough steps of that reasoning neural network, we predict the final outputs from there. Now we're
completely high dimensional everywhere. So you know, standard neural network optimization can
work much more nicely. Now, the only thing we need to guarantee is how do we make this high-dimensional
neural executor actually imitate the algorithm, like actually do useful stuff with respect to
the algorithm. And that's why we have also the abstract pipeline where that neural network is
actually pre-trained. So we know that we want to use some algorithm. As Charles said, we can
generate usually lots of synthetic data for that algorithm. And we can basically pre-train our high dimensional component to respect the steps of the algorithm.
And a lot of our research prior to writing this paper, primarily Charles and I, we wrote papers like neural execution of graph algorithms, pointer graph networks, persistent message passing, which dealt with just that part.
Like, how do we get reliable and robust neural networks that simulate algorithms in an abstract space?
And then it was, but we haven't actually looked at the natural part of the pipeline.
Like, can this actually be used in a real natural problem?
This is actually where Andrea's work comes into play, because she has actually demonstrated that this blueprint can be applied within a challenging, noisy input problem.
And I'm assuming that's going to be the next topic of discussion,
but basically, unless you have any follow-ups on this part.
Actually, I was going to suggest that this is probably the right part
for Andrea to talk a little bit about her work
and how it comes into play into the Blueprint.
Yes, sounds good.
I guess I'll start a bit by talking about reinforcement learning, because
all this was plugged into the reinforcement learning framework. So what we've seen, like,
reinforcement learning has been successfully applied in, like, for games, like, AlphaStar,
AlphaGo, for, like for real-world applications,
like reducing the cost of data center cooling.
I think also Loon was using it.
But basically, in all these problems,
we frame our problem to talk about states and actions.
And what we want to do
is to estimate how good our states are.
So we define some value over these states.
So, for example, if you think of a chess game,
this value is, like, how good is this move?
And what we want is to generally make plans,
like, take decisions such as this value is maximized.
So this value is like a core part of taking decisions.
And we have algorithms to get these values.
So if we know the environment dynamics, so like how we can move from one state to another,
what sort of reward we get from one state to another, what sort of reward
we get from one state in the kingdom action, we have an algorithm, the value iteration
algorithm, which can give us perfect estimations of how good the states are.
And yeah, basically this iterates on the value estimates and gets them better and better
until they converge to the ground-true values.
Why I'm talking about this is because this is the algorithm
we tried to imitate.
So on synthetic data, simple graphs,
we tried to imitate the value iteration algorithm.
And the idea was that there was, like, in previous work,
like BQN, neural networks tried to estimate the values.
And then some works have used these values as numbers
and roughly applied the value iteration algorithm on top of these predicted values.
And what we wanted to do is step away from getting these values, like these precise numbers,
over which we execute the algorithm and keep everything in a high dimension,
so that in case, like, which will happen, like at the beginning, these predictive values of the network won't be that good. And what we know is that if we have an actual algorithm,
our algorithm output will be as good as it is.
It will do what it's supposed to do,
but if your inputs are not what you actually want,
then your outputs are not going to do what you want.
So with the neural network plugging in its estimates of values
into the value iteration algorithm,
the output might not be that good,
at least at the beginning when the network is running.
So what we did is we imitated,
we first learned how to imitate the value iteration
algorithm, and then we kept,
we used the network's estimates
in this like high dimensional space.
Yeah, and I guess like this allows the network to recover from mistakes. It has
data efficiency points. You don't need that much data at the beginning to get off the ground
because your algorithm can deal with your imperfect estimates when you don't know the environment's dynamics, which
is what usually happens in the cases of applying neural networks in reinforcement learning.
But yeah, it also opens the direction for a lot of other ways we can use algorithms
in neural networks.
For example, in this case, we knew
this validation is a useful algorithm, if we have the
values, we can make a plan and we can act. But if we know other
things like the shortest path is something of interest in this
environment, then we can learn an algorithm which finds which
learns how to find the shortest path and we can plug it in. So I
guess that's like, just a the shortest path then we can plug it in. So I guess that's like
just a general way we can we can see how we can combine specific algorithms which we know we're going to be useful with the reinforcement learning framework in this case. If you want to
add anything I probably skipped over or moved very fast over a lot of parts.
I guess I would just quickly add
to mention the paper,
just to be clear.
This is the executed
latent value iteration network paper,
or XLVN,
and it's publicly available in the archive.
And we looked at,
basically, as Andrea mentioned,
we learned the value iteration algorithm
in the abstract space,
then we plugged it into
a reinforcement learning agent,
and we directly compared it against an identical architecture,
which rather than using our algorithmic component,
predicts the values and runs value iteration on them directly.
And, you know, this second agent actually managed, in some cases,
to catch up to ours if you give it a lot more interactions with the environments.
But on very low data regimes, which are like people are trying hard to put reinforcement
learning to do better in the lower data regimes, because for Atari, sometimes you need billions
of frames before you get a good policy, right?
And that's a lot of simulation budget that you may not always have in the real world.
You know, we found that at the beginning, this version that propagates values through a fixed algorithm actually performed worse than a baseline, which didn't have any algorithm in the first place.
Because initially, those value estimates were so bad that the algorithm was giving you useless answers.
Whereas our high dimensional version of the algorithm was able to kind of compensate for any mistakes there with also doing some pure model-free learning. And as a result, it stayed on top for the most,
like basically throughout the low data regime,
it was consistently on top compared to the others.
And to be more concrete about where we applied this,
we applied it on some classical control problems
and on the classical Atari benchmark directly
from pixel representations of the Atari frame.
Okay.
On the topic of dimensionality,
and this is something, well,
you mentioned previously graph neural networks,
and this is something you also cite in the paper,
previous work by yourselves as well as others.
And to be honest with you,
if I can say just a few words about how I came
to know your work, this is precisely how. And I come from, I have a general interest
in graphs, let's say. My background is more kind of like traditional computer
science, I guess same as you. And so I came to graphs from more from the data
modeling angle, so knowledge graphs and data modeling.
But I quickly realized that graph machine learning is, you know,
is really kind of booming at the moment and not, you know, to be honest,
not having a background on that myself, trying to distill why that is.
So I came to the conclusion that, well, basically it seems to,
it seems like when you use graph machine learning,
you're basically able to use more information.
So leverage the connections as opposed to just vectors, basically.
And you can correct me on that, but I was wondering if you would like to just say a
few words about the way that graph neural networks are relevant for your blueprint, basically.
Sounds great.
So I think I'd suggest the best way to do this is I'll say a little bit of words
about graph representation learning in general,
and I'll maybe hint a bit at why are they such a good choice within our blueprint.
And then I think Andrea can elaborate a bit more on some specific results
that are already known
that make graph neural networks very tightly linked to the kinds of computation that we would want.
And then maybe, Charles, if you have some philosophical remarks at the end, we can incorporate them in.
But yeah, basically, so my background for my PhD was in graph representation learning.
So I'm intimately familiar with what's going on in the area right now. And I felt like, you know, it gives us a great tool for processing data that
lives on a graph. And, you know, if you squint hard enough, any kind of data that's given to
you can be fit into a graph representation, like images can be seen as graphs of pixels connected
by proximity. Text can be seen as a, you know, a sequence of objects linked together.
And basically, but, you know, more generally,
things that truly come from nature that aren't like engineered to fit within a frame
or within a sequence like humans would do it,
are actually quite naturally usually represented as graph structures.
So like molecules, for example, they have a very nice graph interpretation.
Social networks, transportation networks, connectome structures in the brain, all sorts
of data is very naturally represented as a graph structure.
And I think we're seeing a boom of just very amazing application papers and industrial
and scientific applications of people just realizing that graph neural networks are a
thing and that their data is actually much more naturally represented as a
graph than say as an image and as a result like getting some very interesting results out there
and some very impactful outcomes and i'll just kind of to kind of set the the scene for why i
think they're a pretty good class of models to look at i'll just mention briefly what i think
is the most exciting application of
graph neural networks that I've seen recently. And I think Andrea can comment on that a bit more.
It's like one of the most popularized for sure is the application of graph neural networks in
drug screening and drug discovery. So a team from MIT was right before the COVID pandemic,
successful in training a graph neural network to predict molecular properties, specifically, would a molecule be a good antibiotic. And then they evaluated this on
a huge set of candidate molecules. They sent the top predictions to chemists for further study.
And they just ended up to discover using this graph neural network screener,
a molecule that was previously never considered to be an antibiotic, it was considered for diabetes.
But because of its unusual structure, it actually had a much higher potency against the heavily
resistant strains of microorganisms that you might want to eradicate. So this is one very clear case
that was widely popularized in the media about just how graph neural networks can impactfully
affect a core scientific problem. And within DeepMind, we've also worked on a lot of these very impactful applications.
We've used graph neural networks to closely study the structure of glass and how it evolves,
which is, by the way, a big mystery from a physics perspective in many aspects.
We've used graph neural networks to accurately predict travel times,
which is very much in scope of what we're discussing here.
And that's currently serving billions of queries worldwide in Google Maps
and so on and so forth.
Like generally, there's a lot of very impactful ways
of looking at your data in a graph-structured format
and using a graph neural network to support that.
Now, Andrea, I think you'd be in a good place
to say a few words about just what results are out there
ready to show why graph neural networks
are such an important part of our blueprint.
Yeah, I guess like the first thing that comes to my mind is the paper, if I'm not mistaken,
was what can neural networks reason about or something like that, where they showed there's quite a clear alignment
between the graph,
like the message passing update rule
and the dynamic programming update rule.
We can write them so that we can see
how the two of them can do the same computation, basically.
I don't know if you had something else in mind,
but that was what convinced
me and I guess what got me
to first try to imitate
value iteration, which is a dynamic programming
algorithm with a graph neural network.
Yeah, so exactly.
And to follow up on that, there's a strong, basically
this paper established a strong theoretical
link between dynamic programming
algorithms and the computations of a GNN,
which means,
you know, most polynomial time heuristics can be interpreted as dynamic programming,
which means that graph neural networks might be, you know, the right inductive bias for
this kind of computation, right?
Now, where does our work come into play?
The work that Charles and I have done, for example, is that this initial paper just talked
about the best case scenario, right?
There exists setting of weights of my graph neural network.
They'll make it nicely behave as that dynamic programming algorithm.
But it doesn't tell you necessarily what's the best way to get there
and how to make it work with the specific data you have
or even extrapolation in these issues that are important for algorithms.
So we actually spent a lot of time coming up with modifications
to the graph neural network architecture,
modifications to the training regime, modifications to various inductive biases that are plugged in there. archive, which actually not just went on, let's imitate the iterative computation of dynamic programming, but let's also try to include some data structures, which are crucial to how algorithms
operate nowadays, and also incorporate maybe some aspects of persistent reasoning. So rather than
just being able to support a simple data structure on top of my existing set of nodes, can I maybe
create additional memory, additional nodes? because many algorithms rely on initializing additional memory aside from the memory required to store their input.
So we've kind of developed models that tackle more and more, allow us to align more and
more with computation while still following that same GNN blueprint that was already,
the link that was already spotted in the first theoretical paper. Charles? Yeah, I guess it's also worth mentioning graphical models,
which are a kind of machine learning model that some people still look at,
but it was kind of dominant before deep learning came to rise.
And a lot of the stuff you see inside GNNs comes from that.
And RL also has a similar, like RL is basically a graph algorithm. It's
a dynamic programming update. It's actually related to a shortest path algorithm. You
can show that it's very closely related. And there's a little bit of, you know, tweaking
you have to do to make it online. And that's basically what RL does. It's like an online
shortest path algorithm. And so then when you think about that,
it's not too surprising
if you're trying to find the shortest path in the graph
and then you want to represent your problem as a graph,
that maybe there's a good relationship there.
Message passing itself in graphical models
was just the way you represent
the dynamic programming updates
and propagation information inside a probabilistic model.
And the form of updates and the way things are combined inside graph neural networks is very
much inspired by that, but loses some of the guarantees. So in some sense, graph neural
networks are like a softening up and a generalization of an existing algorithm for doing
propagation of probabilistic information in a graphical model. And now what we're doing is
we're saying, okay, well, maybe that's a bit more general.
Like as Petter mentioned, dynamic programming is a wonderful way
to think about solving any kind of problem.
It's not always, you can't always do it,
but when you can do it, it works really, really well.
And I think that's maybe like one of the deep connections
between certainly graph algorithms and reinforcement learning
and graph neural networks.
And now the question is when you go to things beyond that, and certainly graph algorithms and reinforcement learning and graph neural networks.
And now the question is, when you go to things beyond that,
what's the basis for thinking that that might be represented by a graph?
And I think object-oriented programming tells us a lot about how humans build the structures
around how their data structures look and how they communicate between different parts of it.
And that's a kind of message passing, a very different kind of message passing
for a completely different reason. And that gives us kind of message passing, a very different kind of message passing for a completely different reason.
And that gives another idea
that there might exist another graph there.
But as we get more and more complicated,
the question is,
what is the structure of the graph?
And that's really what's required a lot of thought.
And that's what we'll continue to sort of,
I think, us to think about
is what is the right graph for a particular problem?
And can you start to learn those graph structures?
Okay. So the last part of the conversation, which I guess we have like a few minutes to wrap up,
I was going to ask you about roadmap basically and applications and I have to add that to my
surprise in our previous exchange, Peter mentioned that there are some cases in which neural algorithm reasoning has already worked.
And so I guess it's a good idea to start by asking you to refer to those cases.
And then from there, you can also mention abstract algorithm like value iteration within an agent that solves, say, the Atari reinforcement learning environment of video games purely from pixels.
So this is a case of eventually like taking pixel data, processing it in many different ways,
and eventually being able to apply a neural algorithm on top of that.
And this actually outperforms, say, that same architecture with a fixed algorithm in there
and also a baseline without any kind of models, but otherwise the same hyperparameter.
So that's like one example that we already have of the blueprint already working in practice.
We do have some work that is coming out very soon that shows that we are able to use the algorithmic reasoning blueprint
to support unsupervised learning and self-supervised learning.
Specifically, you know, often unsupervised learning is just,
let's take a massive quantity of data and try to extract the most meaningful properties out of it.
But very often, that's not all the information you have. You might have some knowledge about
how your data came to be. For example, if you're working with estimating some representations from some physics simulations, like a bunch of balls bouncing around or n-body systems, you don't just see a bunch of snapshots of that system.
You know that it has to obey certain laws of physics.
And we think the neural algorithmic reasoning blueprint is an excellent way to take those laws of physics, package them up into a neural network
that you can splice in as part of your unsupervised architecture,
and as a result, get better representations.
And we're starting to see some really nice results
on a variety of environments that this blueprint is actually promising.
We should have a paper on the archive within the next few weeks detailing this.
So hopefully by the time this article is out,
people will already be able to take a look.
I'll send you a link.
But yeah, basically, that's one thing that we have in the near future that shows that this can help just learn better the properties of data.
And we are also just building off of Andrea's great work.
We would want to also see where else can we splice algorithms within reinforcement learning?
Because obviously reinforcement learning is an area of really high interest for DeepMind and I think also for the broader deep learning community generally.
And we think there's algorithms left, right, and center inside the reinforcement learning pipeline.
So basically a wealth of opportunity for just various places we can splice algorithmic reasoners.
So I guess that would be kind of the immediate applications
of the blueprint, ones that we already have
and ones that we were thinking of in the near future
that we might want to explore.
But I'm sure Charles and Andrea
have very interesting thoughts of their own.
So I'd like to give them a chance to fill in.
Sure. So who would like to go first?
Charles, you can go first. Yeah, sure. I guess I mentioned at the beginning,
there aren't that many algorithms out there that we use. And the question is, can we learn all of
them? There's not many of them. And I think that's the thing that I'm really excited to work out.
Because if you can have a single network that is able to execute any one of the algorithms that you know already,
then if you can get them to plug, then if you get that network to plug those algorithms together,
you start to form really quite complicated processing pipelines or programs.
And if it's all done with gradients going through it, then you start to learn programs.
So there's an increasing interest in program synthesis and these kinds of things
where you explicitly produce a program that's un-executed to solve some task.
But I think it's really interesting to think of an alternative approach
where you never explicitly produce the program,
the code for it, but you still have something that can actually, this is like the original
neural execution work that Patrick and I did, the sort of motivation for it is if you could
already in some latent space execute a complicated program that you've learned what goes into that
program by composing together stable algorithms. And so that's, I think, another area that's very exciting.
And, you know, of course, you know, machine learning itself is just an algorithm.
So we have gradient descent.
We have automatic differentiation, which, you know,
you can think of as a kind of dynamic programming, actually.
And, you know, in RL, there's a slightly more complicated loss
that involves another form of dynamic programming.
So then the question is, you know, if you really take this to its limit,
then you start to really learn algorithms that do learning. And I think that becomes very
interesting, because obviously, one of the limitations in deep learning is the algorithms
that we have for learning, you know, there hasn't been that much change in the best optimizers we
use, or how we update the weights in a neural network during training for quite a long time.
And, you know, there's been a lot of search over different architectures and so forth,
but they haven't always found the next breakthrough.
And the question is, you know, is this a different way of looking at that where we can start
to find new learning algorithms?
Because learning algorithms are just algorithms and maybe what's missing from them are, you
know, this whole basis that we have for other
algorithms that we use in all computers. And so we can get a slightly more universal algorithm
executor and use that as the basis for better methods for machine learning.
Thank you. And Andrea, any closing thoughts from you? Yeah, I would just, I really like the
Chelsea Mark one with like a network which runs multiple algorithms, all algorithms if possible.
I think that's something I, yeah, I would personally like to investigate with Mith,
together with Mila colleagues, we've made some steps in the direction,
doing some, like trying to learn a couple of algorithms together and seeing if we can transfer between one algorithm to getting easier to learn a separate related algorithm. So yeah, generally,
I think that the direction is quite interesting to me, seeing if we can learn multiple and then have an agent
which can choose which algorithm is relevant in specific instances.
That sounds quite cool to me.
One algorithm to rule them all.
Yeah, that's what it sounds like.
Well, thank you. Thank you all.
It's been really
a learning experience for me.
So I learned a bunch of things today.
It's one of the good things
about doing this work, I guess.
I keep learning.
And I appreciate your time,
especially knowing,
well, you all have deadlines coming up.
So many thanks.
Thanks so much for the opportunity.
It was a really,
really delightful discussion. Thank you. Thank you. I hope you enjoyed the podcast. If you like my work,
you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.