Orchestrate all the Things - DeepMind wants to reconcile Deep Learning and classical computer science algorithms with Neural Algorithmic Reasoning. Featuring DeepMind's Petar Veličković and Charles Blundell, MILA's Andreea Deac

Episode Date: September 10, 2021

Will Deep Learning really be able to do everything? We don't really know.  But if it's going to, it will have to assimilate how classical computer science algorithms work. This is what DeepMind... is working on, and its success is important to the eventual uptake of neural networks in wider commercial applications. This work goes by the name of Neural Algorithmic Reasoning. Join us as we discuss roots and first principles, the defining characteristics, similarities and differences of algorithms and Deep Learning models with the people who came up with this.  We also cover the details of how Neural Algorithmic Reasoning works, as well as future directions and applications in areas such as path finding for Google Maps Could this be the one algorithm to rule them all? Article published on VentureBeat. Image: Getty

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Will deep learning really be able to do everything? We don't really know. But if it's going to, it will have to assimilate how classical computer science algorithms work. This is what DeepMind is working on and its success is important to the eventual uptake
Starting point is 00:00:22 of neural networks in wider commercial application. This work goes by the name of Neural Algorithmic Reasoning. και το πρόσφατο του είναι σημαντικό για την επιπλέον ανοιχτήση νερμικών νέκρων σε περισσότερες επιχειρήσεις. Αυτή η δουλειά συμβαίνει με το όνομα «Νερμική αλγόριθμική διαφωνία». Συνέβητε μας όταν συζητήσουμε τις πρωτοβουλίες και πρώτες πράξεις, τις διαφαντικές χαρακτηριστικές, τις συμμετοχές και τις διαφορές των αλγόριθμων και των δευτεροπιστών μοντών με τους ανθρώπους που έκανε αυτά. Επίσης, συζητήσουμε τις ποιότητες του τρόπου που λειτουργεί η νερμική αλγόριθμική διαφωνία,σμα δημιουργεί, όπως και τις αλλαγές και εργασίες στο μέλλον, όπως
Starting point is 00:00:49 το Pathfinding για τα Google Maps. Μπορεί αυτό να είναι το ένα γνώρισμα που θα τους καταλήξει όλους. Ελπίζω ότι θα χαίρετε το podcast. Εάν σας αρέσει το δουλειό μου, μπορείτε να μείνετε στο link data orchestration στο Twitter, LinkedIn και Facebook. Ευχαριστώ, ευχαριστώ όλους για την ώρα. Η συμμετοχή είναι Data Orchestration on Twitter, LinkedIn, and Facebook. Thanks, everyone, for making the time. The occasion is basically some work that you recently published on neural algorithmic reasoning, which sounds pretty intimidating in a way. So the idea here is to try and break it down
Starting point is 00:01:20 and make it a little bit more approachable, let's say. And what caught my attention and the main idea, I guess, is, well, you're trying to make neural networks and deep learning kind of emulate, you know, usual algorithms, let's say. And that seems like a pretty unusual and also pretty ambitious idea. And I guess people will be interested in that. And that's why I thought, OK, let's see what this is all about. So let's start by introductions, basically.
Starting point is 00:01:52 So if you want to just take your turn, say a few words about yourselves and how you actually all got to work together on this, which is, I guess, another interesting aspect of this story. Sounds good. So I guess I'll start. My name is Petr Velichkovich. I'm a senior research scientist here at DeepMind. And I was in many ways, at least within DeepMind, a person who kind of kickstarted the algorithmic reasoning direction alongside Charles. And it came from a very kind of, it came from basically my background, so to speak, like, basically, I, I've come from a theoretical computer science education from my undergrad, and I've participated a lot in competitive programming contests. And I've
Starting point is 00:02:41 basically, my way into computer science was through algorithmic reasoning and algorithmic thinking in terms of classical algorithms. And since I started doing deep learning research, I really wanted to try to reconcile that with, you know, the classical algorithms that initially got me excited about the field. like very strong, even complementarity between the two of them, like what one of these methods tends to do really well, the other one doesn't do that well, and vice versa. And usually, when you see these kinds of patterns, it's usually a good indicator that if you can do anything to bring them a little bit closer together, you know, then you could end up with a very awesome way to fuse the best of both worlds and and make some really strong advances. So I guess that's a good summary of how I got into this. I'll hand it over to Charles. Yeah.
Starting point is 00:03:32 Hey, so I'm Charles Bundle. I'm a research lead here at DeepMind. And I'm generally interested in getting neuronoics to make much better use of the really large quantities of data they're exposed to. There's a few things that I think are quite interesting there. So, you know, getting a network to tell us what it doesn't know, get it to learn much quicker, or, you know, to use this fast data to do something more than you might expect that it's able to do. And this latter case, I think, is where neural algorithmic reasoning comes in. And I guess when Peta joined DeepMind, it was, you know,
Starting point is 00:04:13 our early conversations were a lot of fun because we have a very similar background. A while ago, I also had a background in theoretical computer science. And I think, you know, a big question, like a fundamental question in machine learning for a long time has been how do, you know, how do you generalize? What is a notion of generalization? How do you work beyond the data examples you've seen? And algorithms are a really good example of something we use, all use every day to do this. And in fact, there aren't many algorithms out there. You know, if you look at the standard computer science textbook, there's maybe 50 or 60 algorithms that you learn as an undergraduate. And everything we're using right now to have this call is
Starting point is 00:04:54 using just a subset of those. And so there's like this very nice basis for very rich computation that we already know about. But it's completely disjoint from the things we're learning. And so I think when Petra and I started talking about this, we saw clearly there's a nice fusion that we can make here between these two fields that has actually been underexplored so far. So yeah, that's kind of my interest in that. Maybe Andrea, you should say. Yeah, I have a bit of a different beginning.
Starting point is 00:05:25 I also did theoretical computer science. I'm currently a PhD student at the University of Montreal in Milan, and I'm doing an internship at DeepMind. I got interested in general graph representation learning through the lens of drug discovery and then discovered some things that we could generally improve if we combine some sort of inductive biases we know of. I think algorithms are things we've studied and how do we combine this with just general reinforcement learning or just general neural networks. I think that's what got me excited lately. Okay, cool. Thanks. And I have to say that Peter specifically mentioned some of your previous work,
Starting point is 00:06:21 which I wasn't aware of, to be honest, as kind of kickstarting in a way what him and Charles started working on for a neural algorithm reasoning. So I guess we can get back to that and you can talk about that in a bit more detail further down as we progress in the conversation. So thanks, everyone, for for the introduction and I guess we can well we can jump straight to the core of the core proposal of your paper which you kind of outlined already so and I'm quoting here for me reading the paper which I have to say is well all things considered for you know for a typical paper in this space, it's quite approachable, actually, because it doesn't go deep into the math aspect, basically.
Starting point is 00:07:14 Well, not too much. It kind of tries to, it's more like a position paper, if my interpretation is right, at least. So that's what makes it approachable. So the key thesis is that algorithms possess fundamentally different qualities to deep learning methods. είναι σωστό, τουλάχιστον αυτό είναι που το κάνει προστατευόμενο. Ο κύριος θέσος είναι ότι οι αλγόρθμοι έχουν διαφορετικές πραγματικά καταστασίες στους διπλόρυντς μεθόδους και αυτό προσδέχει ότι αν οι διπλόρυντς μεθόδους ήταν καλύτερα ευκαιρίες να μιμικρούν τους αλγόρθμους,
Starting point is 00:07:35 τότε η κατανοητικότητα της μικρής σκηνής με αλγόρθμους θα γινόταν δυνατή με διπλόρυντ, το οποίο όπως επίσης αναγνώρισα στην παρουσία σας, είναι κάτι που δεν είναι τώρα εφαρμογό. which, as you also highlighted in your introduction, is something which is currently not available. So I thought a good way to approach this for people who may not necessarily be familiar with the defining properties, let's say, of algorithms versus deep learning models
Starting point is 00:07:59 would be to actually define those. Like, in what ways are algorithms and deep learning models different? So who would like to have a go? I can say something here. I guess, so with an algorithm, in most cases, it doesn't change. So it's a fixed set of rules that are executed on some input. And usually, good algorithms have a bunch of well-known properties. So for any kind of input you give, it gives a sensible output. It gives that sensible output in a reasonable amount of time. And you can usually change the size of the input and the algorithm keeps working. And I think those are really desirable quantities.
Starting point is 00:08:43 You can think about any processing that you want to do. The other thing you can do with algorithms, you can plug them together. So as I mentioned before, what we're using right now is just a composition of a bunch of algorithms being strung together. And the reason they all can be strung together is because of this kind of guarantee that they have. Like given some kind of input, they only produce these kinds of outputs. And that means that we can say, well, those kind of outputs can feed into another algorithm.
Starting point is 00:09:08 You can build a whole stack of these things. And I think that's maybe, you know, people have been looking at learning algorithms, you know, for a while, actually, in deep learning. And it's always been quite difficult. Simple tasks are actually pretty good, pretty widely used to debug things. So for example, a simple algorithm would be, I want to produce as an output the input. It's called a copy task. I just want to copy the output.
Starting point is 00:09:35 The output is just a copy of the input. And it turns out that you can learn to do this up to some length, but many networks, if you just increase the length of the input, so if I train the numbers one to ten and test it on the numbers one to a thousand, many networks will not generalize. They will not have learned the core idea. You just need to copy the input to the output. This gets, you know, as you make the process more complicated, then you can imagine this gets worse.
Starting point is 00:10:02 So if you think about sorting or various graph algorithms actually the generalization is is far worse if you just train a network to simulate an algorithm in a very naive fashion but there's something very nice about algorithms which is you can they're basically simulations you know you can generate a lot of data with them and that makes them very amenable actually actually, to being learned by deep neural networks. But it requires us to then think, you know, from the deep learning side, what changes do we need to make there so that these algorithms can be well represented and actually learned in a robust fashion? So if you use deep learning for something, usually there isn't a very strong guarantee on what the output is going to be. So you might say that the output is a number between zero and one, and you can guarantee that, but you couldn't guarantee something more structural. So for example, very rarely would
Starting point is 00:10:57 you be able to guarantee that, you know, if I show this network a picture of a cat, and then I take a different picture of a cat, it will definitely classify that as a cat. There's a certain degree of error that's going to leak into it. Whereas if you had an algorithm for doing something like this, you may very well find that you could develop guarantees that this wouldn't happen. Now, this is partly because the kind of problems algorithms are applied to are more amenable themselves to these kind of guarantees. That's a very interesting thing itself, because the question is, if the problem is amenable themselves to these kind of guarantees. That's a very interesting thing itself, because the question is, if the problem is amenable to these guarantees, then maybe there's something we can bring across into the deep neural networks
Starting point is 00:11:32 on classical algorithmic tasks that allows us to develop these kind of guarantees for the neural networks themselves. And those guarantees are usually about generalization. So they're generalization in terms of the size of the inputs, the kinds of inputs you have. There are algorithms that generalize over types, for example. If I ask you if I have a sorting algorithm, you can sort a list of numbers, but you can also sort anything you can find an ordering over. So you can do it over letters and words and things like this. That's not the kind
Starting point is 00:12:03 of thing we see at the moment with deep neural networks. Usually when people learn sorting, they learn on one type of thing like numbers. And they wouldn't necessarily even consider testing it on words, even though it's just the same algorithm. So there's this nice abstract representation of processing that's independent of the inputs. And that's part of the thesis here, is to try and think about what is this abstract processing? It's a bit more detached from the actual representation of the inputs and can we learn that to some extent and learn to use that. So that's sort of my little bit there. I don't know if there's anything Petr or Andrea would like to add to that. Yeah that was fantastic Charles. I would maybe add only a few
Starting point is 00:12:43 things that are also important to note on the distinction between deep learning and algorithms, at least classical algorithms, is that first of all, you know, algorithmic computation can usually be very concisely and elegantly written down as a pseudocode that explains how you go from your inputs to your outputs. This makes them very trivially interpretable. And because they operate over these like abstractified inputs that conform to some preconditions and post conditions, this means it's much more easier to theoretically reason about them, like make guarantees and stuff like that. And also makes it much easier to find connections between different problems that you might not see otherwise. I think max flow and min cut is one of my most loved examples where, you know, there are two problems that are seemingly quite different,
Starting point is 00:13:30 but the solution of one is necessarily the solution to the other. And that's not as obvious unless you study it from a very abstract lens. And yeah, so there's a lot of the benefits to this kind of elegance and constraints, but it's also the potential shortcoming of algorithms because if you want to make your inputs conform to these stringent preconditions, what this means is that if my data that comes from the real world is even a tiny bit perturbed and doesn't conform to my preconditions, I'm going to lose a lot of information before I can massage it into the algorithm.
Starting point is 00:14:01 And that obviously makes it suboptimal because even if the algorithm gives you a perfect solution, it might give you a perfect solution in an environment that doesn't make sense. And therefore, the solutions aren't going to be something you can use. This is something we will touch upon quite a bit when we talk about Andrea's work as well. But basically, so that's quite important. And on the other side, deep learning is designed to just rapidly ingest lots of raw data at scale, and pick up what are interesting rules in that raw data without any real strong constraints. And this makes it remarkably powerful to these kinds of noise scenarios.
Starting point is 00:14:32 Like you can perturb your inputs and your neural network will still be reasonably applicable for classical algorithms that may not be the case. And that's also another reason why we might want to find this awesome middle ground where we might be able to guarantee something about our data, but not require that data to be constrained to say tiny scalars when the complexity of the real world might be much larger. So yeah, that's maybe a comment I would add.
Starting point is 00:14:56 Andrea, do you have some thoughts as well? I feel like I will continue on both of those threads when I talk about Excel. So, yes, I think you covered everything other than the exact way we apply this. Maybe I would add one comment then, which is like, where do algorithms come from? And I think this is a really important thing for us to think about when we think about applying all of this. So, you know, usually what happens is you find a very clever theoretical computer scientist, and you tell them about your problem, and they think really hard about your problem. So let's imagine, you know,
Starting point is 00:15:37 you've got some real world problem you really want to solve, but you're not quite sure how to solve it efficiently. They'll then often go go away and they'll sort of map it onto a more abstract version of that problem, derive an algorithm for that, and then tell you that for everything in this problem class they've got, this algorithm will execute in this amount of time and give you the right answer. And there's a bit of an inductive leap there,
Starting point is 00:15:59 which is that the mapping from your real-world problem into this abstract space on which the algorithm is derived isn't always completely aligned. And I think one thing we see with machine learning is kind of the opposite, which is that it just looks at the data. It doesn't really map it onto an abstract space, but it does solve the problem based on what you tell it. And so you have these two extremes. And I think what we're trying to do here is get somewhere in between those two extremes where you have something that's a bit more structured, but it still fits the data. And it's not necessarily a human in the loop constantly process. You don't necessarily need to think really hard as a theoretical computer scientist.
Starting point is 00:16:38 And there's some opportunity because often real world problems are not exactly mapped onto the problems that we have algorithms for. And also for these things that we have algorithms for, the abstract problems, often there's like a few examples in there that are really, really difficult to solve. And they're what make the algorithms really, really slow. And they may not occur in the real world. And so if we have something that can learn from your data that really just solves the problem you've actually got, rather than something much more abstract and much more expensive, there's the potential to get algorithms that are generalized just as well,
Starting point is 00:17:13 but actually are much faster than what we already have. And so I think one of the things all three of us have been really thinking about here is, and this is exactly where Andrea's work comes in, is can we help new algorithms that significantly outperform existing algorithms, but have the same sort of guarantees for the real world? Sorry, I just wanted to mention that. Yeah, it's a good point.
Starting point is 00:17:35 And actually, you touched upon a few points that I made notes on and wanted to ask follow up questions on. So one of them has to do with complexity, really. So one of the defining properties of algorithms is precisely that, that you can study them theoretically and devise their complexity and therefore make predictions about the time at which they're going to execute, which is not the case for machine learning and deep learning. So I guess this is probably one of the aspects that you're aiming to address.
Starting point is 00:18:08 Yeah, I guess most deep learning algorithms and a lot of machine learning algorithms, once the network is trained, the runtime can factor in terms of the input size because the schema doesn't change or it's like linear or maybe quadratic there's maybe one exception which be something that involves search there's exponential time but that's very rare whereas i think when a human sits down to write a program
Starting point is 00:18:34 it's very easy to get something that's got exponential time you know if you try and solve a problem manually it's sufficiently hard you'll often come with an exponential time solution unless you've done a lot of programming computations like Petter has, in which case maybe you go to a very quick solution quickly. But in general, you'd come up with something quite slow. Neuronerves are the opposite. They're extremely lazy. And that's a very desirable property in learning and coming up with new algorithms. You really want something that's very slow.
Starting point is 00:19:01 So we actually have to put all the work into making the networks use more computation. That's actually something that we thought a lot about, and there's still a lot of work to be done there. So there are people who've looked at networks that can adapt the amount of computation time to use by changing the amount of pondering they do have a problem. And the idea there is to work out when to stop. But that's not, you know, that's quite new and not that widely used. And maybe Petra will mention maybe later in point graph networks, we had a mechanism very much like this that helps us, you know, when you don't know ahead of time, how much, you know, there's this whole problem in deep learning of how you design the architecture you use, and it has a huge impact
Starting point is 00:19:41 on how well it works. And there's a strong connection between how much processing you do and how much computation time you spend and what kind of architecture you come up with. They're intimately linked. I think it's very interesting to think about. And as you go through more and more complicated problems and the algorithms associated with them, you have to then develop networks
Starting point is 00:20:00 that have those same computational expressivity, which doesn't always, you know, it's not always there, actually. And if even it is there, it's not always possible to get a network that's trained to use it. So those are some of the challenges I think are really interesting with this particular domain, you don't necessarily find as explicitly in other domains, although it certainly exists, like something like the game of Go certainly has a lot of computational demands, really massive. But the algorithm isn't necessarily clear. And so it's like something like the game of Go certainly has a lot of computational demands. It's really massive. But the algorithm isn't necessarily clear. And so it's harder to study from that perspective.
Starting point is 00:20:32 I would maybe just add one quick remark about complexity, which is that in algorithms, in theoretical computer science, people usually look at, you know, the worst case. I think what we're doing here, like trying to apply algorithms on really naturally occurring inputs, we have a bit of a redeeming quality, which I like to concisely state as nature is not an adversary. So even if the underlying problem you're solving is MP hard in the worst case and requires exponential time to solve, the naturally occurring instances of that problem that you're interested in may well be polynomial time solvable. So while traveling salesman, for example, is an NP-complete problem and is known to be so,
Starting point is 00:21:10 and we don't know of any polynomial time algorithm for it, there exists a prediction that's 100% correct for the traveling salesman for all the towns in Sweden, all the towns in Germany, all the towns in the USA. And that's because, you know, geographically occurring data actually has nicer properties, right, than any possible graph you could feed into traveling salesmen. So one thing that we do sometimes when we solve natural problems with algorithms is we try to push them into this framework we've come up with, which is nice and abstract. But as a result, we might have made the problem more complex than it needs to be. So in algorithmic reasoning, and I will touch upon this very shortly, we often deal with learning heuristics that are strictly polynomial time. So they're not going to get to exponential complexity at all.
Starting point is 00:21:54 But they may well be just enough to solve all of the instances you actually care about in the real world, because it uses the power of deep learning to extract the meaningful properties from the raw data. So I just thought that was a very important point to make here. And then the other follow-up question I have was a kind of naive one really, but some of the things you described actually are not specific to deep learning. They apply to machine learning in general. So I was wondering why did you choose to address deep learning, basically? Why did you choose to go for that generalization framework, specifically applied to deep learning algorithms and not just any machine learning algorithm?
Starting point is 00:22:38 So I guess one reason, which I think will also flow into the next segment where we'll talk about our blueprint, is that we want to design solutions that operate over the true raw complexity of the real world. So something that will take the raw data you get from nature and make decisions that are algorithmic based on that data. And so far, the best solution we have for processing large amounts of naturally occurring data at scale is generally deep neural networks. Like that's the current state of the art for massive quantities of naturally occurring data. And that's why we, I guess we focused our discussion on it. But in terms of the applicability of the blueprint, I think at least the blueprint we propose is amenable to
Starting point is 00:23:20 any kind of gradient based optimization. So it doesn't have to be deep neural networks. But like, I can totally see how if somebody wanted to, say, apply a genetic algorithm to this framework, they would still be able to do it if they wanted to. Charles, you might have a comment on this, but yeah. Yeah, I guess maybe, like in practice for a single problem, like in a sort of very applied setting, you might well find like logistic regression
Starting point is 00:23:46 or decision trees work fantastically well. And I guess with deep learning, as Petter mentioned, when you start to deal with text or audio or visual inputs, video or static images, deep learning starts to really take off. It's able to deal with those. And one of the reasons it really takes off there, we think, is because inside the network,
Starting point is 00:24:09 it has these much richer representations of the data that's really more elaborate than what you can get with these other kinds of machine learning methods like decision trees or logistic regression. And so when we're thinking about extending the computational capacity even further and being able to do more computation and richer kinds of processing of data than we currently see in neural networks, it's natural to think that that might provide a useful basis for doing that. And I guess what we find is that it does,
Starting point is 00:24:42 but actually we need to look at new neural network architectures that are distinct to ones people have previously thought of to be able to do that kind of processing. So even inside something that's quite a large model class, it's very rich and complicated. We find that we need to push the boundaries even further than that to be able to execute algorithms reliably. And so it is a sort of, you know, it's an empirical science that we're looking at. And, you know, I just don't think some of the things out there, you know, as you get richer and richer decision trees, they can start to do some of this processing. But we know that it's basically a tree of if this, then that.
Starting point is 00:25:19 And what's missing from that is recursion. So what's missing, or iteration, the ability to loop over things multiple times. Whereas with neural networks, for a long time, really, people have understood that there's a relationship between iteration or recursion and recurrent neural networks, for example. That's maybe one motivation for them. And in graph neural networks, the same sort of processing arises again,
Starting point is 00:25:43 like the message passing that you see there, is again something very natural in object-oriented programming, where you send messages between classes or objects. You can see it's exactly analogous. And you can build very complicated interaction diagrams, and those can then be mapped onto graph neural networks. So it's from the internal structure that you get a richness that seems, it might be powerful enough
Starting point is 00:26:06 to learn algorithms you wouldn't necessarily get with more traditional machine learning methods. Okay, thanks. Thanks. That explains things a little bit and I think it's also a good occasion to actually go to the next part of the conversation which is the core basically of what your research proposes. So the blueprint for new neural και έτσι, έγιναν οι δεύτεροι μέρος της συζήτησης, που είναι το κορδόνιο, βασικά, του τι προφανεί το εξελίσσιο σας, ο λεπτόνιος για τη νερολατουργική διαφωνία, που, βασικά, θα ήταν ωραίο αν οι άνθρωποι που ακούσαν τη συζήτηση μπορούσαν να δει το διαγραμμό που δημιουργήσατε σε αυτό, γιατί είναι πολύ ευκολόγητο, αλλά η κυρίως ιδέα,
Starting point is 00:26:43 αν μπορώ να το αναφέρω λίγο, είναι, βασικά, it's quite elaborate. But the main idea, if I'm able to convey it a little bit, is basically this kind of mapping between inputs and outputs. And that's why it's related to what you just mentioned about data representation, basically, and that being the reason for you choosing to work with neural networks. So it seems like the core idea basically is taking some sort of elaborate real world, let's say, data artifacts or things that happen in the real world, basically, and use some data representation to make it processable by neural network. So do you want to elaborate on that and how do you envision that happening? Yeah, I think I will answer this one. So basically, I guess the best way to present how we arrived at the blueprint is to start from what is the status quo? Like, where are we right now in terms of processing real-world data with
Starting point is 00:27:39 algorithms and gradually fixing some issues that are placed there by introducing neural networks at various stages of the pipeline, I think is a way to kind of nicely step by step arrive to our blueprint. So imagine you have a real world. I'm going to use the running example from the paper, but any other example with real world data works. So you have, you know, a real world road network over which you need to predict travel times, which is a very standard problem. Within DeepMind, we've worked on a Google Maps application where we applied graph neural networks to predict travel times. And this is now serving queries in Google Maps worldwide.
Starting point is 00:28:14 So it's a very impactful problem that hits a lot of users. But fundamentally, you cannot just imagine the road network as a graph very easily and apply Dijkstra's algorithm on it to compute shortest paths, right? Or even estimate travel times. But in many principles, if a human wants to apply a classical algorithm like Dijkstra's, this is what they would have to do. They would have to take all the complexity of the real world, you know, roadblocks, weather changes, traffic flows, bottlenecks, God knows what, and convert that into an abstractified graph of nodes and edges with some weights, which sort of correspond to the travel time, so that then you can run
Starting point is 00:28:51 Dijkstra's algorithm over this thing and predict the shortest paths and then somehow predict that back into a route. So this is the standard pipeline, how algorithms were originally proposed. Now, there's a few issues there. And I think the first problematic issue, as I said, we will keep repeating this issue time and time again throughout this interview, is that algorithms are very happy to give you perfect answers, regardless of whether or not the inputs themselves are perfect. So yeah, it's going to give you the shortest path, but who guarantees that this graph you came up with is an accurate representation of your real world scenarios, right? So the algorithm might give
Starting point is 00:29:25 you great solutions, but, you know, there's really no way of saying whether this human device heuristic is actually the best way of looking at it. And, you know, that's the first point of attack. So historically, we found that whenever there's a lot of human feature engineering, and lots of raw data that we need to work with for that human feature engineering, this is where deep learning shines. Like the initial success stories of deep learning were exactly in replacing handcrafted heuristics for, say, image recognition with deep neural networks. So that's exactly what we do to begin with. Replace the human that maps the real world raw data to a graph input with a neural network that will map the complexity of the raw data to some abstractified graph
Starting point is 00:30:05 inputs that we can then run the extras over. So far, so good. That's looking like it might help. And there exists a really good line of research that's actually concurrent to our work that works exactly in this setting. And they figured out a way to even propagate gradients through the algorithm. So you can actually get good neural network optimization in the setting and apply the algorithm, which is awesome. But there are a few limitations with the setting,
Starting point is 00:30:28 even if you're able to propagate gradients through it. And that is, first of all, the algorithm may not be everything you need to compute the final solution, right? You might say some part of this problem requires shortest paths, but maybe at some point, I also need to run a flow algorithm.
Starting point is 00:30:43 And I'm not really sure how to combine the results of the two in the best way, because, you know, the problem is very raw and noisy, right? So just, you know, forcing your representations to rely on the output of this one algorithm might be too limiting. And also the fact that we compressed, like, you know, Dijkstra requires one scalar value per every edge in the graph, which means that you're compressing all that amazing complexity of the real world into just one number for every road segment. And that's potentially problematic, because if you don't have enough data to estimate that scalar correctly, you're doomed. Like basically, the algorithm is not going to give you correct solutions once again, because you just haven't seen enough raw data to estimate that scalar correctly. And this is
Starting point is 00:31:24 basically a bottleneck problem, which in the paper we refer to as the algorithmic bottleneck. So the bottleneck happens because we're placing all of our bets on this one scalar in every edge, or basically a very low dimensional representation of the problem. So how do we break bottlenecks? Well, we stay high dimensional, you know, deep neural networks are deriving a lot of their success from staying high dimensional and doing all of these like high dimensional regularization techniques, which means that even if you predict badly some parts of the internal vector in the neural network, the other parts of that vector can still step in and compensate. Not everything is lost, especially at lower data regimes.
Starting point is 00:32:04 So what we now have to do, but algorithms weren't designed to run on high dimensional inputs. They're designed to run on these low dimensional representations. So the key idea in our blueprint is to replace the algorithm with a neural network, typically a graph neural network. We will discuss this later on, which actually takes a high dimensional input, does one step of reasoning to produce another high dimensional input. And then once we've done enough steps of that reasoning neural network, we predict the final outputs from there. Now we're completely high dimensional everywhere. So you know, standard neural network optimization can work much more nicely. Now, the only thing we need to guarantee is how do we make this high-dimensional
Starting point is 00:32:45 neural executor actually imitate the algorithm, like actually do useful stuff with respect to the algorithm. And that's why we have also the abstract pipeline where that neural network is actually pre-trained. So we know that we want to use some algorithm. As Charles said, we can generate usually lots of synthetic data for that algorithm. And we can basically pre-train our high dimensional component to respect the steps of the algorithm. And a lot of our research prior to writing this paper, primarily Charles and I, we wrote papers like neural execution of graph algorithms, pointer graph networks, persistent message passing, which dealt with just that part. Like, how do we get reliable and robust neural networks that simulate algorithms in an abstract space? And then it was, but we haven't actually looked at the natural part of the pipeline. Like, can this actually be used in a real natural problem?
Starting point is 00:33:34 This is actually where Andrea's work comes into play, because she has actually demonstrated that this blueprint can be applied within a challenging, noisy input problem. And I'm assuming that's going to be the next topic of discussion, but basically, unless you have any follow-ups on this part. Actually, I was going to suggest that this is probably the right part for Andrea to talk a little bit about her work and how it comes into play into the Blueprint. Yes, sounds good. I guess I'll start a bit by talking about reinforcement learning, because
Starting point is 00:34:08 all this was plugged into the reinforcement learning framework. So what we've seen, like, reinforcement learning has been successfully applied in, like, for games, like, AlphaStar, AlphaGo, for, like for real-world applications, like reducing the cost of data center cooling. I think also Loon was using it. But basically, in all these problems, we frame our problem to talk about states and actions. And what we want to do
Starting point is 00:34:48 is to estimate how good our states are. So we define some value over these states. So, for example, if you think of a chess game, this value is, like, how good is this move? And what we want is to generally make plans, like, take decisions such as this value is maximized. So this value is like a core part of taking decisions. And we have algorithms to get these values.
Starting point is 00:35:17 So if we know the environment dynamics, so like how we can move from one state to another, what sort of reward we get from one state to another, what sort of reward we get from one state in the kingdom action, we have an algorithm, the value iteration algorithm, which can give us perfect estimations of how good the states are. And yeah, basically this iterates on the value estimates and gets them better and better until they converge to the ground-true values. Why I'm talking about this is because this is the algorithm we tried to imitate.
Starting point is 00:35:50 So on synthetic data, simple graphs, we tried to imitate the value iteration algorithm. And the idea was that there was, like, in previous work, like BQN, neural networks tried to estimate the values. And then some works have used these values as numbers and roughly applied the value iteration algorithm on top of these predicted values. And what we wanted to do is step away from getting these values, like these precise numbers, over which we execute the algorithm and keep everything in a high dimension,
Starting point is 00:36:34 so that in case, like, which will happen, like at the beginning, these predictive values of the network won't be that good. And what we know is that if we have an actual algorithm, our algorithm output will be as good as it is. It will do what it's supposed to do, but if your inputs are not what you actually want, then your outputs are not going to do what you want. So with the neural network plugging in its estimates of values into the value iteration algorithm, the output might not be that good,
Starting point is 00:37:04 at least at the beginning when the network is running. So what we did is we imitated, we first learned how to imitate the value iteration algorithm, and then we kept, we used the network's estimates in this like high dimensional space. Yeah, and I guess like this allows the network to recover from mistakes. It has data efficiency points. You don't need that much data at the beginning to get off the ground
Starting point is 00:37:37 because your algorithm can deal with your imperfect estimates when you don't know the environment's dynamics, which is what usually happens in the cases of applying neural networks in reinforcement learning. But yeah, it also opens the direction for a lot of other ways we can use algorithms in neural networks. For example, in this case, we knew this validation is a useful algorithm, if we have the values, we can make a plan and we can act. But if we know other things like the shortest path is something of interest in this
Starting point is 00:38:17 environment, then we can learn an algorithm which finds which learns how to find the shortest path and we can plug it in. So I guess that's like, just a the shortest path then we can plug it in. So I guess that's like just a general way we can we can see how we can combine specific algorithms which we know we're going to be useful with the reinforcement learning framework in this case. If you want to add anything I probably skipped over or moved very fast over a lot of parts. I guess I would just quickly add to mention the paper, just to be clear.
Starting point is 00:38:49 This is the executed latent value iteration network paper, or XLVN, and it's publicly available in the archive. And we looked at, basically, as Andrea mentioned, we learned the value iteration algorithm in the abstract space,
Starting point is 00:39:02 then we plugged it into a reinforcement learning agent, and we directly compared it against an identical architecture, which rather than using our algorithmic component, predicts the values and runs value iteration on them directly. And, you know, this second agent actually managed, in some cases, to catch up to ours if you give it a lot more interactions with the environments. But on very low data regimes, which are like people are trying hard to put reinforcement
Starting point is 00:39:28 learning to do better in the lower data regimes, because for Atari, sometimes you need billions of frames before you get a good policy, right? And that's a lot of simulation budget that you may not always have in the real world. You know, we found that at the beginning, this version that propagates values through a fixed algorithm actually performed worse than a baseline, which didn't have any algorithm in the first place. Because initially, those value estimates were so bad that the algorithm was giving you useless answers. Whereas our high dimensional version of the algorithm was able to kind of compensate for any mistakes there with also doing some pure model-free learning. And as a result, it stayed on top for the most, like basically throughout the low data regime, it was consistently on top compared to the others.
Starting point is 00:40:10 And to be more concrete about where we applied this, we applied it on some classical control problems and on the classical Atari benchmark directly from pixel representations of the Atari frame. Okay. On the topic of dimensionality, and this is something, well, you mentioned previously graph neural networks,
Starting point is 00:40:34 and this is something you also cite in the paper, previous work by yourselves as well as others. And to be honest with you, if I can say just a few words about how I came to know your work, this is precisely how. And I come from, I have a general interest in graphs, let's say. My background is more kind of like traditional computer science, I guess same as you. And so I came to graphs from more from the data modeling angle, so knowledge graphs and data modeling.
Starting point is 00:41:06 But I quickly realized that graph machine learning is, you know, is really kind of booming at the moment and not, you know, to be honest, not having a background on that myself, trying to distill why that is. So I came to the conclusion that, well, basically it seems to, it seems like when you use graph machine learning, you're basically able to use more information. So leverage the connections as opposed to just vectors, basically. And you can correct me on that, but I was wondering if you would like to just say a
Starting point is 00:41:38 few words about the way that graph neural networks are relevant for your blueprint, basically. Sounds great. So I think I'd suggest the best way to do this is I'll say a little bit of words about graph representation learning in general, and I'll maybe hint a bit at why are they such a good choice within our blueprint. And then I think Andrea can elaborate a bit more on some specific results that are already known that make graph neural networks very tightly linked to the kinds of computation that we would want.
Starting point is 00:42:11 And then maybe, Charles, if you have some philosophical remarks at the end, we can incorporate them in. But yeah, basically, so my background for my PhD was in graph representation learning. So I'm intimately familiar with what's going on in the area right now. And I felt like, you know, it gives us a great tool for processing data that lives on a graph. And, you know, if you squint hard enough, any kind of data that's given to you can be fit into a graph representation, like images can be seen as graphs of pixels connected by proximity. Text can be seen as a, you know, a sequence of objects linked together. And basically, but, you know, more generally, things that truly come from nature that aren't like engineered to fit within a frame
Starting point is 00:42:54 or within a sequence like humans would do it, are actually quite naturally usually represented as graph structures. So like molecules, for example, they have a very nice graph interpretation. Social networks, transportation networks, connectome structures in the brain, all sorts of data is very naturally represented as a graph structure. And I think we're seeing a boom of just very amazing application papers and industrial and scientific applications of people just realizing that graph neural networks are a thing and that their data is actually much more naturally represented as a
Starting point is 00:43:25 graph than say as an image and as a result like getting some very interesting results out there and some very impactful outcomes and i'll just kind of to kind of set the the scene for why i think they're a pretty good class of models to look at i'll just mention briefly what i think is the most exciting application of graph neural networks that I've seen recently. And I think Andrea can comment on that a bit more. It's like one of the most popularized for sure is the application of graph neural networks in drug screening and drug discovery. So a team from MIT was right before the COVID pandemic, successful in training a graph neural network to predict molecular properties, specifically, would a molecule be a good antibiotic. And then they evaluated this on
Starting point is 00:44:09 a huge set of candidate molecules. They sent the top predictions to chemists for further study. And they just ended up to discover using this graph neural network screener, a molecule that was previously never considered to be an antibiotic, it was considered for diabetes. But because of its unusual structure, it actually had a much higher potency against the heavily resistant strains of microorganisms that you might want to eradicate. So this is one very clear case that was widely popularized in the media about just how graph neural networks can impactfully affect a core scientific problem. And within DeepMind, we've also worked on a lot of these very impactful applications. We've used graph neural networks to closely study the structure of glass and how it evolves,
Starting point is 00:44:53 which is, by the way, a big mystery from a physics perspective in many aspects. We've used graph neural networks to accurately predict travel times, which is very much in scope of what we're discussing here. And that's currently serving billions of queries worldwide in Google Maps and so on and so forth. Like generally, there's a lot of very impactful ways of looking at your data in a graph-structured format and using a graph neural network to support that.
Starting point is 00:45:17 Now, Andrea, I think you'd be in a good place to say a few words about just what results are out there ready to show why graph neural networks are such an important part of our blueprint. Yeah, I guess like the first thing that comes to my mind is the paper, if I'm not mistaken, was what can neural networks reason about or something like that, where they showed there's quite a clear alignment between the graph, like the message passing update rule
Starting point is 00:45:51 and the dynamic programming update rule. We can write them so that we can see how the two of them can do the same computation, basically. I don't know if you had something else in mind, but that was what convinced me and I guess what got me to first try to imitate value iteration, which is a dynamic programming
Starting point is 00:46:12 algorithm with a graph neural network. Yeah, so exactly. And to follow up on that, there's a strong, basically this paper established a strong theoretical link between dynamic programming algorithms and the computations of a GNN, which means, you know, most polynomial time heuristics can be interpreted as dynamic programming,
Starting point is 00:46:29 which means that graph neural networks might be, you know, the right inductive bias for this kind of computation, right? Now, where does our work come into play? The work that Charles and I have done, for example, is that this initial paper just talked about the best case scenario, right? There exists setting of weights of my graph neural network. They'll make it nicely behave as that dynamic programming algorithm. But it doesn't tell you necessarily what's the best way to get there
Starting point is 00:46:53 and how to make it work with the specific data you have or even extrapolation in these issues that are important for algorithms. So we actually spent a lot of time coming up with modifications to the graph neural network architecture, modifications to the training regime, modifications to various inductive biases that are plugged in there. archive, which actually not just went on, let's imitate the iterative computation of dynamic programming, but let's also try to include some data structures, which are crucial to how algorithms operate nowadays, and also incorporate maybe some aspects of persistent reasoning. So rather than just being able to support a simple data structure on top of my existing set of nodes, can I maybe create additional memory, additional nodes? because many algorithms rely on initializing additional memory aside from the memory required to store their input.
Starting point is 00:47:50 So we've kind of developed models that tackle more and more, allow us to align more and more with computation while still following that same GNN blueprint that was already, the link that was already spotted in the first theoretical paper. Charles? Yeah, I guess it's also worth mentioning graphical models, which are a kind of machine learning model that some people still look at, but it was kind of dominant before deep learning came to rise. And a lot of the stuff you see inside GNNs comes from that. And RL also has a similar, like RL is basically a graph algorithm. It's a dynamic programming update. It's actually related to a shortest path algorithm. You
Starting point is 00:48:32 can show that it's very closely related. And there's a little bit of, you know, tweaking you have to do to make it online. And that's basically what RL does. It's like an online shortest path algorithm. And so then when you think about that, it's not too surprising if you're trying to find the shortest path in the graph and then you want to represent your problem as a graph, that maybe there's a good relationship there. Message passing itself in graphical models
Starting point is 00:48:58 was just the way you represent the dynamic programming updates and propagation information inside a probabilistic model. And the form of updates and the way things are combined inside graph neural networks is very much inspired by that, but loses some of the guarantees. So in some sense, graph neural networks are like a softening up and a generalization of an existing algorithm for doing propagation of probabilistic information in a graphical model. And now what we're doing is we're saying, okay, well, maybe that's a bit more general.
Starting point is 00:49:25 Like as Petter mentioned, dynamic programming is a wonderful way to think about solving any kind of problem. It's not always, you can't always do it, but when you can do it, it works really, really well. And I think that's maybe like one of the deep connections between certainly graph algorithms and reinforcement learning and graph neural networks. And now the question is when you go to things beyond that, and certainly graph algorithms and reinforcement learning and graph neural networks.
Starting point is 00:49:45 And now the question is, when you go to things beyond that, what's the basis for thinking that that might be represented by a graph? And I think object-oriented programming tells us a lot about how humans build the structures around how their data structures look and how they communicate between different parts of it. And that's a kind of message passing, a very different kind of message passing for a completely different reason. And that gives us kind of message passing, a very different kind of message passing for a completely different reason. And that gives another idea that there might exist another graph there.
Starting point is 00:50:08 But as we get more and more complicated, the question is, what is the structure of the graph? And that's really what's required a lot of thought. And that's what we'll continue to sort of, I think, us to think about is what is the right graph for a particular problem? And can you start to learn those graph structures?
Starting point is 00:50:26 Okay. So the last part of the conversation, which I guess we have like a few minutes to wrap up, I was going to ask you about roadmap basically and applications and I have to add that to my surprise in our previous exchange, Peter mentioned that there are some cases in which neural algorithm reasoning has already worked. And so I guess it's a good idea to start by asking you to refer to those cases. And then from there, you can also mention abstract algorithm like value iteration within an agent that solves, say, the Atari reinforcement learning environment of video games purely from pixels. So this is a case of eventually like taking pixel data, processing it in many different ways, and eventually being able to apply a neural algorithm on top of that. And this actually outperforms, say, that same architecture with a fixed algorithm in there
Starting point is 00:51:38 and also a baseline without any kind of models, but otherwise the same hyperparameter. So that's like one example that we already have of the blueprint already working in practice. We do have some work that is coming out very soon that shows that we are able to use the algorithmic reasoning blueprint to support unsupervised learning and self-supervised learning. Specifically, you know, often unsupervised learning is just, let's take a massive quantity of data and try to extract the most meaningful properties out of it. But very often, that's not all the information you have. You might have some knowledge about how your data came to be. For example, if you're working with estimating some representations from some physics simulations, like a bunch of balls bouncing around or n-body systems, you don't just see a bunch of snapshots of that system.
Starting point is 00:52:34 You know that it has to obey certain laws of physics. And we think the neural algorithmic reasoning blueprint is an excellent way to take those laws of physics, package them up into a neural network that you can splice in as part of your unsupervised architecture, and as a result, get better representations. And we're starting to see some really nice results on a variety of environments that this blueprint is actually promising. We should have a paper on the archive within the next few weeks detailing this. So hopefully by the time this article is out,
Starting point is 00:53:06 people will already be able to take a look. I'll send you a link. But yeah, basically, that's one thing that we have in the near future that shows that this can help just learn better the properties of data. And we are also just building off of Andrea's great work. We would want to also see where else can we splice algorithms within reinforcement learning? Because obviously reinforcement learning is an area of really high interest for DeepMind and I think also for the broader deep learning community generally. And we think there's algorithms left, right, and center inside the reinforcement learning pipeline. So basically a wealth of opportunity for just various places we can splice algorithmic reasoners.
Starting point is 00:53:48 So I guess that would be kind of the immediate applications of the blueprint, ones that we already have and ones that we were thinking of in the near future that we might want to explore. But I'm sure Charles and Andrea have very interesting thoughts of their own. So I'd like to give them a chance to fill in. Sure. So who would like to go first?
Starting point is 00:54:11 Charles, you can go first. Yeah, sure. I guess I mentioned at the beginning, there aren't that many algorithms out there that we use. And the question is, can we learn all of them? There's not many of them. And I think that's the thing that I'm really excited to work out. Because if you can have a single network that is able to execute any one of the algorithms that you know already, then if you can get them to plug, then if you get that network to plug those algorithms together, you start to form really quite complicated processing pipelines or programs. And if it's all done with gradients going through it, then you start to learn programs. So there's an increasing interest in program synthesis and these kinds of things
Starting point is 00:54:54 where you explicitly produce a program that's un-executed to solve some task. But I think it's really interesting to think of an alternative approach where you never explicitly produce the program, the code for it, but you still have something that can actually, this is like the original neural execution work that Patrick and I did, the sort of motivation for it is if you could already in some latent space execute a complicated program that you've learned what goes into that program by composing together stable algorithms. And so that's, I think, another area that's very exciting. And, you know, of course, you know, machine learning itself is just an algorithm.
Starting point is 00:55:31 So we have gradient descent. We have automatic differentiation, which, you know, you can think of as a kind of dynamic programming, actually. And, you know, in RL, there's a slightly more complicated loss that involves another form of dynamic programming. So then the question is, you know, if you really take this to its limit, then you start to really learn algorithms that do learning. And I think that becomes very interesting, because obviously, one of the limitations in deep learning is the algorithms
Starting point is 00:55:56 that we have for learning, you know, there hasn't been that much change in the best optimizers we use, or how we update the weights in a neural network during training for quite a long time. And, you know, there's been a lot of search over different architectures and so forth, but they haven't always found the next breakthrough. And the question is, you know, is this a different way of looking at that where we can start to find new learning algorithms? Because learning algorithms are just algorithms and maybe what's missing from them are, you know, this whole basis that we have for other
Starting point is 00:56:26 algorithms that we use in all computers. And so we can get a slightly more universal algorithm executor and use that as the basis for better methods for machine learning. Thank you. And Andrea, any closing thoughts from you? Yeah, I would just, I really like the Chelsea Mark one with like a network which runs multiple algorithms, all algorithms if possible. I think that's something I, yeah, I would personally like to investigate with Mith, together with Mila colleagues, we've made some steps in the direction, doing some, like trying to learn a couple of algorithms together and seeing if we can transfer between one algorithm to getting easier to learn a separate related algorithm. So yeah, generally, I think that the direction is quite interesting to me, seeing if we can learn multiple and then have an agent
Starting point is 00:57:25 which can choose which algorithm is relevant in specific instances. That sounds quite cool to me. One algorithm to rule them all. Yeah, that's what it sounds like. Well, thank you. Thank you all. It's been really a learning experience for me. So I learned a bunch of things today.
Starting point is 00:57:50 It's one of the good things about doing this work, I guess. I keep learning. And I appreciate your time, especially knowing, well, you all have deadlines coming up. So many thanks. Thanks so much for the opportunity.
Starting point is 00:58:03 It was a really, really delightful discussion. Thank you. Thank you. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.