Signals and Threads - The Network as a Program with Nate Foster

Starting point is 00:00:00 Welcome to signals and threads, in-depth conversations about every layer of the text stack from Jane Street. I'm Ron Minsky. It's my great pleasure to introduce Nate Foster. Nate Foster is both an old friend and an accomplished researcher at the intersection of programming languages and networking. He's both a professor at EPFL and a visiting researcher here at Jane Street, where he spends a day a week, mostly focusing on work with our networking team. So thanks for joining me. Thanks. I'm a big fan of this podcast. It's great to be. To start out, I'd love to hear a little bit more about your origin story. Like, how did you actually get into this whole computer science world in the first place and also into the kind of place that you've gotten over time in this area of research that you think about? Yeah, so I didn't actually get into computer science really until partway through college. I started out as a physics major, and I like to say I was a failed physics major. I was taking the sort of standard sequence.

Starting point is 00:00:55 You take some mechanics, some electricity and magnetism, you do quantum. And as I was going through that sequence, I found I was kind of liking physics a little bit less. And on the side, I was taking CS classes, and I just loved them. They were, I was kind of fully engaged. I was, you know, doing well. And I really liked how all the pieces kind of fit together. You know, I could understand after a couple years, like how a program, you know, gets implemented down through a compiler, down through the ISA, through the chip, all the way down to gates or even lower.

Starting point is 00:01:25 And that was very pleasing. And I also really liked the idea that, you know, unlike physics, say, there's a lot of creativity. A lot of these abstractions that we have in computing are really designed by people. So things like Lambda calculus and functional programming, like, who would think of that as a basis for computation? And yet it runs on a computer. I just found that really beautiful. Yeah, it's kind of an amazing thing about computer science. It's not like a science in a traditional sense, or at least not totally like a science,

Starting point is 00:01:51 and that you're not studying the natural world. You're studying human creations. And yet those human creations are very bounded by both all sorts of things about, like, the physical world of what you can reasonably implement, and bounded in important ways by mathematics. Right. It's something kind of amazing about the field. So I can say a bit about how I got into research. That was a bit of an accident.

Starting point is 00:02:12 I think many things in my career were sort of an accident. I had the idea that I might want to go to grad school, but I was maybe a little too shy to articulate that. And there was an ad when I was my junior year, an ad for a summer research project working on a Java compiler. And I thought, well, I might want to get a master's or maybe even a PhD. I'll spend a summer doing research. And my mentor was a guy named Kim Bruce,

Starting point is 00:02:38 who's a programming languages researcher, and spent the summer hacking on that Java compiler, working on type systems, and just fell in love. Actually, what were you doing in Java? What were you trying to do with the type system? Yeah, so this is way back, Java originally was sort of a big advance. You know, it sort of brought things like garbage collection

Starting point is 00:02:54 to the mainstream for the first time. I mean, garbage collection had existed, but, you know, in a language that was being used really widely, that was new. Right. And just to be clear, shockingly, 40 years after the invention of garbage collection? That's right. Garbage collection existed, you know, decades before.

Starting point is 00:03:09 It's like the single biggest advanced in like programming language productivity and something like 40 years between the time of an invention and the time of mainstream use, which I've always found like a really shocking fact. Yep. And Java also had this type system. It had a static, believed to be, you know, sound or mostly sound type system. And that was very exciting to the academic programming languages crowd because they, you know, we love type systems. And yet, again, the mainstream languages at the time, you know, CNC. plus plus don't really have sound static type systems. So the research we were actually doing was adding polymorphism to Java.

Starting point is 00:03:41 Now Java has generics, but at the time it didn't. And there was a process by which the community was allowed to sort of propose, you know, ways of bringing generics to Java. So my advisor, Kim Bruce had such a proposal. And, you know, he's at a small teaching college, so he didn't have grad students. So he found undergrads to work on implementing this type system. So is the one that you worked on, the one that was in the end adopted? No. So the one that was adopted was a very nice design by Phil Wadler and Martin Odryski called G.J., who was the name of the original paper. And that's the one that eventually made it through the community process. But there were actually lots of teams working on different proposals. There was one out of MIT from Barbara Lyskoff and Andrew Myers, one out of Rice by Corky Cartwright. Kim Bruce had his. And they all varied in their notation and how expressive they were and different features. Oh, that's super cool.

Starting point is 00:04:34 And I don't think I've ever heard of another language process that quite worked like that. Yeah, I don't actually, I was too young to know the reason why Sun Microsystems, who was sort of controlling job at the time, why they did it through this community process. But perhaps the feeling was, you know, type systems are kind of this thing that seems very simple, but actually can be quite subtle. You know, it's very easy to design a type system that has a soundness bug or has some maybe, you know, runtime cost. So they tried to leverage the smarts of the community to help them design it.

Starting point is 00:05:04 Okay, so that's how you got your first taste of programming languages. What happened then? So then after undergrad, I spent a couple of years in England doing not computer science, actually, but I knew I wanted to come back and do a PhD. And I ended up at Penn, where I worked with Benjamin Pierce, who's a friend of yours as well. And, you know, I thought that I would work on, I don't know, more type systems or maybe something in semantics, you know, the kind of really macho kind of big topics in programming languages.

Starting point is 00:05:35 And when I got to Penn, Benjamin said, I'm working on this data synchronizer. That might be a good project for you to start with. Oh, unison. Yeah, so Benjamin has this tool built in O'Cam called Unison. It's a file synchronizer. And at the time, he was trying to make it generic so that you could synchronize data that was in different formats.

Starting point is 00:05:55 And the project was called Harmony. So I was, I never told this story publicly. I was actually a little disappointed because I thought, you know, I'm going to work on a really crazy type system or some, you know, new fancy denotational model. And instead, I was working on a data synchronizer, which sounded like a kind of rubbing systems problem. But embedded in this project was this really beautiful programming language's problem. And Benjamin and his postdoc at the time, Alan Schmidt, had discovered that in order to synchronize data and data. different formats, you need to convert between those formats in two directions. So say you're synchronizing maybe a version of your calendar in the standard ICAL format and maybe another version at the time

Starting point is 00:06:39 XML was all the rage, so, you know, an XML. So it has somehow the same information, but there's some differences. And if you're going to synchronize them, you need to basically convert from one into the other and vice versa. And so they've been writing these conversions, you know, writing sort of functions that go from, you know, version A to version B and from version B to version A, and very quickly realized these two functions that were writing separately, they're like really close to each other. They're almost inverses. And as program language as researchers, they said, well, maybe we should have some kind of abstractions for writing these two mappings.

Starting point is 00:07:12 And that became what's called lenses. So the idea was to have, you know, one abstraction from which you can derive both of these functions. Right. And actually, you worked on this lens idea early on and in fact has like, had a whole big life afterwards. There's all sorts of people in the Haskell community who have all sorts of variations on the lens idea

Starting point is 00:07:32 and use it kind of all over the place and seemed to be very excited about it. Yeah, so lenses are one of these ideas that I think was sort of in the air. Benjamin and Allen were not the first to discover it. There was prior work in the database community. There was prior work actually in the functional programming community. Someone named Lambert Mertens had worked on a very similar idea.

Starting point is 00:07:51 But their definition sort of was particularly sort of clean and elegant, and so it kind of caught people's attention. And yeah, I worked on it for my PhD. We worked on a variety of lenses playing with, you know, what we could express, you know, how complicated can we make the functions, using types to make sure that the lenses had good properties. For example, you generally want to know that, you know, if you're making changes to one side and then you use a lens to push the changes to the other side, somehow those changes

Starting point is 00:08:21 should be reflected accurately. And you can try to characterize that. using some kind of laws or some kind of specification. And then, you know, building this out and trying to kind of figure out how this could turn into useful, useful tools. So I spent six years working on lenses, had a lot of fun. And you mentioned sort of that lenses have kind of caught fire. Other people, much smarter than us, you know, took the idea and sort of generalized it. So in Haskell, there are versions of lenses that are not exactly our definition.

Starting point is 00:08:54 They've sort of loosened some of the mathematical definitions, but in ways that allow them to have lenses for all kinds of different structures. And it's really cool. Okay, but that was still you doing like kind of a little bit systems-y, but still like what feels like pretty kind of straight ahead PL research. Yeah. But that's not what you do today. So how did that happen?

Starting point is 00:09:13 So I have this funny kind of right turn in my career. As I was finished my PhD, I decided to apply for faculty jobs, got hired at Cornell, where I spent 15 years on the faculty. And then I decided to take a kind of a gap year, really for personal reasons. My wife was finishing grad school. I wanted to stay near her. But also, I think, although programming languages and functional programming is my home, I kind of felt like, you know, I've thought about lenses for six years.

Starting point is 00:09:40 I've kind of had all my good ideas. I want to have something else to work on. And so I decided to take a leap and go to Princeton where I worked with Dave Walker and Jen Rexford on problems in networking. And I described this as leap of faith for me. It was really a leap of faith, I think, for both sides. Jen in particular kind of hired me almost blind and thought that it might be fun to have a project

Starting point is 00:10:04 at the intersection of programming languages and networking and hired me as a postdoc with really no background in networking. And that was the start of my current career push. So maybe this is a good moment to stop and just talk about what is actually your current research like? What's the overall thrust of your research? What are the kind of core ideas that you're trying to explore there? Yeah.

Starting point is 00:10:25 So I've spent really the last 15 years working at the intersection of programming languages and networking. Maybe before I answer to my own goals, it's worth giving a little bit of context. Starting around 15 years ago, there was a big change that happened in the networking community. It's become known as software-defined networking. And it was driven by lots of factors, some kind of economic, some based on changes in hardware. but the real technical changes were really twofold. Some of the big organizations, like, you know, tech companies, cloud companies, they wanted to have more freedom to change how their networks work.

Starting point is 00:11:01 And that was very difficult for them with how networks were built, say, circa 2005 to 2010. And the second was their networks were getting really, really big. You know, they're building these huge data centers and managing the complexity of all the different protocols and control algorithms and such was getting out of hand. So they wanted a different way to do this. And so the sort of ambient, you know, background thing that happened is a bunch of people, both in industry and in academia, proposed sort of a new model for networks.

Starting point is 00:11:32 And it was based on the idea that you could really specify the behavior of your network somehow in software. So you could write programs or somehow generate, you know, the behavior of the network from these higher level descriptions. So to me, as a programming language researcher, it's kind of obvious that, you know, there's many ways you could sort of do this. You could have maybe databases of network structure that you somehow push down into the infrastructure. But wouldn't it be better to have a language, a language that lets you maybe operate at a higher level than what the hardware lets you do? So you can sort of design different data structures, different algorithms, and write those in ways that are more natural for humans. You might want to be able to compose programs. You might have like a network that's doing multiple things.

Starting point is 00:12:18 and you'd like to be able to think about each of these things separately and then somehow compose them together. Conversely, you might have a network that's doing multiple things and you don't want them to interact. So you want to keep them separate. And so being able to express, you know, different forms of composition or isolation using language abstractions is very natural. And then third, to tame this, to sort of manage this complexity and this operational challenges, you know, having tools that could understand the semantics of what the network is doing and then kind of turn the crank and understand, okay, here are all the behaviors of the network. And some of them maybe are good and some of them might be bad. You know, being able to verify networks or reason about networks was a third thing. So just to make sure I'm getting this right.

Starting point is 00:12:55 So in some sense, the situation ex ante was there were a bunch of switch manufacturers, right? There was a kind of pre-existing notion that was kind of built out of like the early internet days of like what does a switch do and what is IP and what are the protocols. And then there were actually programming languages of a kind. There were like configuration languages that you could use to choose. the behavior of an individual switch. And then the kind of classic way of managing a big network is you figure out what the layout of this physically should be, how you want to wire things together, you buy a bunch of these devices, you wire them up, and then you hire a bunch of network engineers whose job is to kind of very carefully configure all of this stuff so that it

Starting point is 00:13:35 doesn't break the network and you get the property data that you want. And then SDN in some sense is like, no, no, no, we're going to like, I guess you still have to do a lot of the physical layer work. But then the configuration. part is very different. Now, instead of configuring each individual piece, you try and have an overall program that tells you how the network as a whole works. That's right.

Starting point is 00:13:57 And then, like, the advantages, like, is one advantage you're saying is much more configurable so that you can, like, get more behaviors out of the hardware than you could have gotten out of, like, the stock thing that you would get for the vendor before. So there's, like, a kind of faster cycle to be able to iterate on new things that you want to do.

Starting point is 00:14:13 And then maybe, like, another big thing is the ability to reason about the thing that you're doing at a higher level, all the things you said about composition, like, the thing I hear from that is like, oh, I can actually like predict how the thing is going to behave and understand based on a relatively small program what the behavior of this like sprawling network is going to be and make sure that properties that I care about are enforced. Yep. And I think like one thing that's different though, like you should, we should understand that, you know, what counts as a program that you might want to run in a network is going to be different than the kinds of programs that you might

Starting point is 00:14:44 run on a server. So there still are some things about. you know, about hardware, for example, due to the speeds and, you know, the sort of scale that networks operate at, you're not going to write an algorithm that sort of, you know, has a heap and a sort of, you know, allocating memory on every packet at every hop or something. That's going to be crazy. So you are in some of a special domain. There's also, you know, some differences, you know, just from the fact that networks are part of the infrastructure. So even in, say, a single organization, there's usually a desire to know things about how different pieces of the network are isolated from each other. Or, again, even in a single organization, you may have multiple units that are responsible for controlling the network.

Starting point is 00:15:28 And so you still have, in some form, you know, federated or distributed control. So this is what makes it, you know, not just let's take lambda calculus or let's take Java and, you know, that's our way of writing network algorithms. There's still some interesting domain-specific structure that needs to be explored. And in a lot of ways, this echoes to me the story around hardware synthesis, right? Where, again, it's like there's a big graph structured computation thing, and you need some kind of language for expressing it. And then there's a lot of play there of, like, how sophisticated is the language and how much power does the language give you to kind of reason about the thing you're doing and to

Starting point is 00:16:04 flexibly compose bigger designs out of smaller designs. But again, like, there are these very profound limitations on what you can do in a hardware design, like you're not going to write the same kind of code at all. I guess it's a question of, like, how much of traditional PL theory applies in this world, right? Because it's like the constraints are very different. Yeah. So maybe I can tell you about a couple of projects so we can work through them. But one thing that's been very kind of exciting and kind of cool for me is despite all these differences, there are some things from classical PL theory that match up quite well. So one of the languages that we discovered early on is a framework we call NECAT.

Starting point is 00:16:42 And this is a language that we've designed for describing not the whole network. So we're not describing the behavior of the end host. We're not describing TCP and congestion control. We're not even describing what's called the control plane, which is the sort of brains of the network that decides, you know, which routes are going to use and how to respond to failures. So we're just describing really the, you know, how the network processes packets at the forwarding level.

Starting point is 00:17:09 And we've been working on a sort of series of DSLs, for several years and kind of following on from my postdoc at Princeton and we had, you know, sort of, I would say, ideas about how to make these languages, you know, what features we might need,

Starting point is 00:17:23 how to do it in an elegant way. But there was this exciting moment when we realized that all of these languages lined up really well with a system called CAT, which is shorthand for cleaning the algebra with tests, which had been around for a couple of decades.

Starting point is 00:17:37 In fact, Dexter Kozinek-Kornell had discovered this framework and had worked on it for a long time. So CAT is really, it's a pretty high-level mathematical abstract framework that's meant to be kind of a model of kind of standard imperative programming. We can talk more about it, but that actually lines up really well with at least how you can think of the forwarding behavior of packets through a network. And so the reason I say this is exciting is, you know, although you start from a domain that seems to have all these, you know, weird primitives and weird constraints, you can sort of extract this thing that looks a lot like, you know, finite automata and state machines and all the things you learn about as a second year computer science student. And that alignment is quite

Starting point is 00:18:21 cool to discover. Yeah, I sort of think of this as like the one weird trick of programming language theory, which is that like, there's this like complicated and messy and very human process of writing programs. And then it turns out a lot of the best ideas in programming languages come from relating the thing you're doing to very simple mathematical models. And there's something nice about languages that have this tight relationship. Not all of them do, right? There are languages that are kind of much messier mathematically. But the ones that have a kind of tighter and simpler mathematical foundation, at least my sense is, tend to be better at generalizing. Like the features that you add to like solve one problem, turn out to solve lots of different problems and compose nicely with other ideas. You can sort of having this like foundation gives you this kind of nice playground where the different ideas that you come up with end up integrating better with each other. because they kind of all fit into this relatively simple worldview. Yeah, I completely agree. And actually, in our academic paper on NECAT,

Starting point is 00:19:19 that's exactly what we said in the introduction. We sort of said, the value of this remark is not so much that we can do something you couldn't do before, but we've now aligned ourselves with this, you know, with this theory that is actually backed by, you know, going all the way back to Cleeney to like the 1950s. And that gives us both kind of some confidence that what we're doing is maybe right or at least, you know, sensible.

Starting point is 00:19:41 It also gives us a whole bunch of constructions and tools that we can pull from formal language theory. And in our work, we've actually used a bunch of those tools to build compilers, to build verification tools. But especially, it's what you said. So as we've extended NECAT with new features, you know, now we're kind of back into being, you know, confused researchers, just playing with examples and trying to make things work. But having this structure of, you know, cleaning algebra with tests has provided guidance to us. So, for example, we've worked on a probabilistic version of this language, which is very useful in networking because you have unexpected things. You have traffic and you don't know how much traffic there's going to be. You have failures.

Starting point is 00:20:19 You don't know when the failures are going to happen. And you may also have like randomized algorithms that are used to load balance across the network. And so reasoning about the behavior of all these things requires, you know, reasoning about probabilities. So we extended a knuck out with, you know, probabilities, worked out the semantics. And that was very, you know, not obvious. and a bit subtle. And I think if we didn't have the structure of cat and other associated theories,

Starting point is 00:20:45 we would have very easily ended up with a language that was kind of incoherent. We've been talking about a mix of like, some about what software-defined networks is and kind of in the broader sense. And then also about your research. Like, how would you explain, like, the particular take that you and the other people

Starting point is 00:20:59 that you work with have on this field? Yeah, I mean, I think the main kind of slogan is the network should just be thought of as like another program. And that's a short sentence that maybe sounds very trivial. But to networking people, it's a really different way of thinking about things. You know, for many years, innovation networks has been driven either by things that happen at the hardware level.

Starting point is 00:21:26 You know, a vendor comes along with a new router that has a hardware pipeline that has some extra feature. And then that turns into, you know, something new that you do in load balance, or congestion control or cueing or something else. Or, you know, driven by, you know, standards bodies for a long time. You know, if you want to do something new, you have to have a good problem, have a solution, and then go convince a vendor to implement it, go convince a bunch of other users of networks that, you know, we should standardize this, get it ratified by the standards bodies.

Starting point is 00:21:58 It's a very long, slow process. And so thinking of networks just as programs is like, well, just as we don't go ask permission from, you know, Intel or InVidia, when we want to deploy new algorithm on hardware, we just write a new program. We should have the same freedom with networks. How much of this is basically gated on the sort of existence of the hyperscalers of, like, huge companies like Google and Amazon and stuff

Starting point is 00:22:21 that have enormous networks that they can configure this way? Like, in some sense, it's an issue around the domain of administrative control. Like, if I need to primarily build a thing that interacts with things I don't have any control over, then I can sort of see how I'm gated by the standards bodies because I have to get everyone to agree to share the language with which we communicate. Whereas if I have my own enormous network that I'm going to configure,

Starting point is 00:22:44 I can just think of the whole thing as a program and then I maybe have to think about, I maybe have to think about the standards bodies on the edges, but within the network, I get to do that. Yeah, I think that's right. I mean, if you, again, go back and sort of tell the, you know, intellectual history of software-dify networking, it very much did emerge when these really large private networks

Starting point is 00:23:01 became a thing, and the companies that were building them wanted to have the freedom to, you know, basically define new features at, you know, software timescales. And it's true, the thing I was just sort of, you know, throwing shade at of, you know, vendors and standards bodies, well, that's what built the internet. And the internet worked because you took, you know, tens of thousands of autonomous systems, tens of thousands of, you know, networks built by different organizations. organizations, different people on different hardware in the originally, like even all the way down to the physical layer, they were using, you know, different ways of moving bits around. And you connect it all and make it all work and interoperate. So that's, you know, the internet was designed for really, you know, connecting up all these different networks and making it and making it work worldwide. And that's why, you know, we ended up with certain solutions. But in these really large networks with, you know, hundreds of thousands or even millions of computers and, uh, uh, uh, comparable number of switches and routers, you may also want to, you know, have the ability to define those networks and to optimize them and to have them implement certain features. And so that's, that's a big part of the story, you know, more economic than technical. But I think it's an important change that, you know, you do have sort of, again, even in a large organization, you do have often multiple teams or multiple units that are involved in this, but still there's sort of, you know, one ultimate unit of control that gets to define the goal.

Starting point is 00:24:31 Do you think the idea of the network as a program generalizes to the open internet? This is one of the big challenges that the community has been thinking about for decades. So one of the kind of paradoxes of the internet is it's so successful that you can't change it. And this is something that the community has been very worried about going back quite a number of years, at least to the 90s. So there have been a bunch of efforts to think about, well, you know, the internet works really well for at today's scale, for today's applications. But there are things that come along that we'd like it to do. And how are you going to change, you know, these tens of thousands of ASs?

Starting point is 00:25:10 How are you going to, how would you, you know, decide to move to a different routing protocol for the internet or, you know, a different way of moving packets around? You can't, you know, turn it off and turn it back on tomorrow with a big flag day. And so there's this, there was this sense. And you go back and read sort of papers from 20 years ago, people would talk about ossification, the idea that the internet structure and its kind of scale, were kind of setting in and it was impossible to change. So there's a whole other community that's been thinking about how we could design an internet

Starting point is 00:25:40 that is extensible and evolvable. And that's a sort of very rich, cool space. And there are people with cool ideas. Some of them involve program languages, but a lot of them also involve, you know, different architectures, different ways of getting extensibility. And to what to you think this idea of your network as a program, like has caught on, has been influential, has kind of changed how people build networks in practice, like, both among, like, there's like an academic audience for this, but there's also lots of practitioners

Starting point is 00:26:10 and companies and also all of these hardware vendors. Like, like, how is this idea kind of propagated over time? Yeah, it's funny. You know, if you, if you track sort of ideas or, or, uh, these, these, these, these trends, um, I mean, my view is, this is, this idea has become, you know, just the way things are done. In fact, I can back this up a little bit of a little bit of evidence. A few years ago, some of my collaborators, including Jen Rexford and Nick McEwen and some others, we wrote a paper basically arguing that, you know, the network as a program was here and it was a sort of vision paper for a short conference. And it got rejected, and we were a little bit miffed. I mean, we get papers direct all the time, but, you know, we're really proud

Starting point is 00:26:52 of this paper. But the reviews actually said, you're describing the way the world works. Like, your ideas aren't spicy enough. Like, this is how things work. So I think to some extent, you know, the network as program or software-divine networks just is how things work. Now, if you, for folks who are familiar with the sort of original articulation of software-defined networking, of course, there are some ideas that 15 years ago people were saying, well, you know, we should build, you know, centralized algorithms or logically centralized algorithms. And that idea has not so much caught on in practice. Or another example is, you know, there was a big push. I played a small part in it in making network routers sort of truly programmable. Almost every piece of their functionality could be specified in a program.

Starting point is 00:27:38 And again, for mostly economic reasons, that idea has not caught on. But at the same time, you know, the sort of ability to change the, say, hardware pipelines of these routers is coming, just not in sort of the way that it was originally articulated. So, again, I'm biased, but I think it really is the way the networks are going. and the path towards this vision of like, you sort of can write code and get the rich behaviors you want. It's not exactly smooth in the way that was predicted at all times, but it's the general trend is in that direction.

Starting point is 00:28:11 So the SDN idea is interesting to me because I feel like it falls along another theme that's like a little different from the way you framed it, where you're kind of talking about this basic, you know, being able to specify things in cleaner ways, getting better abstractions out of it. And I think that's all part of it. But another thing, that I think has been very valuable in this kind of work is just adopting the culture of software

Starting point is 00:28:35 and the kinds of tools of software. Like, there's all these domains in computer science that just have picked up different approaches and techniques for building things. And if you look at the way in which people think about management of databases or doing hardware synthesis or networks or building traditional software, like the techniques are actually all really different. And there's a bunch of really good ideas that have come up in software that, I think aren't as clearly expressed in the other domains and things around things like the way that you do like version control and code review and testing and things like that. And in like the old world of networking where, you know, you just go and configure the switches

Starting point is 00:29:16 to do the thing that you want, you kind of don't have this centralized place where you can do all of these pieces. And so sort of in some ways separate from the kind of like nice semantic improvements, which I think are like super important. This thing of like allowing the kinds of tools that people use and the kind of engineering approaches that people use in software to apply that to domains like networking seems like another advantage of this whole of this whole thing, which I guess is like maybe like hard to summarize nicely in an academic paper, but I feel like in practice is a big part of where the advantage comes from. Yeah, I completely agree. And the other, so every time I'm talking about

Starting point is 00:29:54 sort of trends that got, you know, a bit of maybe hype and a lot of attention. And, maybe the hype wasn't quite deserved or, you know, things didn't quite play out. This area, I think, well, you're talking about something more broad than just formal reasoning. I think you're just talking about adopting kind of modern software practices to, you know, keeping things in databases or repositories, having not, you know, humans don't just log into a router and sort of, you know, yolo a change. They run it through a process. Maybe there's some even some checking. Absolutely. And you can actually, I mean, you can go back people were thinking about this

Starting point is 00:30:30 even at the ISPs in the 90s. Large ISPs were sort of the equivalent of the hyperscalers back then. They operated sort of big networks that were complicated, expected to work, and they had started to experiment with some of these ideas of, you know, having at least sort of centralized,

Starting point is 00:30:45 you know, specification of the functionality at companies like AT&T and then how to realize that. But the other piece is verification. And this has, I think, is also, it's maybe not quite as far along, but it's something that's becoming quite commonplace. All the hyperscalers are doing it. There's also some startups.

Starting point is 00:31:03 There's, of course, many academics who are interested in this idea. And here it's that, you know, well, if you have some representation, maybe it's not a beautiful representation in NECAT, but you at least have some program that describes how the network is supposed to behave or how it is configured, you could start to apply, you know, all the tools of software engineering, testing tools, validation tools, even verification tools. And this could then become part of your workflow.

Starting point is 00:31:29 So before someone decides to push down a change that might change the routes between two data centers, you could check, is this going to break connectivity anywhere else my network? And that's a good idea. Yeah, and I guess you actually hear lots of stories of large companies managing large networks where config changes break things and cause huge outages.

Starting point is 00:31:48 It's actually like one of the biggest problems you run into is people having a config change that unexpectedly has some semantic behavior that you didn't expect. Yeah. I mean, I think to be clear, this area is really not done. And it's in part because although this, I think I'll take network verification, you know, it sort of took this one layer that became exposed.

Starting point is 00:32:07 So the idea that there's a centralized either database or program that is defining the behavior of the network, and then that gets pushed down to the routers who then, you know, realize it. That gives you a place where you could sort of interpose and, you know, you could intercept snapshots of the network and start to start to test them or reason about them. And so that's what people mostly have done. But networks are much more than that. And they're, you know, they're distributed systems. So they have all the complexity of distributed systems where you can have failures that you didn't expect and interactions that weren't part of your model. And so we're

Starting point is 00:32:41 definitely not done. I mean, you still see outages due to human error or, you know, flaws in a model all the time. And I think this is, you know, this will get better, but it is really hard to reason about these complex systems that have components you didn't even really think about or know about, you know, multiple control loops, funny interactions. It's a, it's a true puzzle that, you know, requires some new ideas to make progress on. Do you have a good example of a kind of problem in this space that is like now pretty well solved, like a thing that people would pretty routinely get wrong in the past and that now there are at least in some places good verification checks to help people not make those mistakes. Yes. I'm going to twist the question slightly and not

Starting point is 00:33:26 say that people got it wrong, but that there was also sort of conservatism. Like people were afraid to make changes because they weren't sure what the impact of those changes would be. This is, by the way, just like a huge part of the network engineering story as I've experienced it. Like I think part of the job of a good network engineer is saying no. Exactly. That's too complicated. We're not going to do that. Yeah.

Starting point is 00:33:48 One example that's, I think, it's no longer research. It's sort of been fully reduced to practice is reasoning about these snapshots of, you know, the so-called forwarding plane of a network. So, of course, networks have changes happening all the time. There's failures. There's different controllers that are maybe, you know, monitoring the system and making changes. But you can pretend that there's a snapshot and you can, you know, a consistent snapshot maybe that you can sort of extract and then reason about. And that snapshot can be modeled using tools like model checkers or SAT solvers, these automated theorem Provers that understand first order logic, or custom tools like NUtKAT is such a tool.

Starting point is 00:34:25 You could put it into NUTCAT and then ask questions about the model. So this is something that is pretty widely done and, you know, mostly just works. I mean, it does what it's supposed to do. And so, you know, it does catch certain errors in the sense that, you know, if you could write a specification like you know, these two hosts should always be connected, no matter what routes are being used, and I want these two hosts be able to send traffic to each other, or these two hosts should be isolated. And if something goes wrong and somehow there's a path between them, that's bad. So these kinds of properties you can check. And, you know, if your control plane, okay, the brains

Starting point is 00:35:03 in the network makes a change that would violate that property, you get some kind of signal or exception. So, and that's, you know, I think that's, that's useful in, say cloud companies. But what it doesn't solve, of course, is what do you do? So if you're getting a failure of a property because the control plane is something bad, then what? So it's not like we've sort of made networks perfectly reliable or perfectly able to satisfy their specifications.

Starting point is 00:35:32 Well, if you, if the control way wants to do something and you know it's bad, can't you just like, I don't know, like in a software context, be like, oh, we won't merge that PR? right? That's like you can, if you can catch it at the time where the change is proposed before it's accepted, then you can like, there's something you can do about it, which is just, again, like the software, like the network engineer, you can say no. Yes, although there are cases where that may be sort of the wrong move, you know, if it has some other effect like you made this change because of a failure, do you, you know, do you keep the failure unsolved because you decided to reject this change that violated your, your spec? This is where things get a little murky. Oh, interesting. So like, maybe. Basically, there's a control. system sitting on top and, you know, merely saying, I reject your change, if the control system is not going to then do something better, you may not have actually improved life. Got it.

Starting point is 00:36:21 That makes sense. So a lot of the things that you're talking about seem like they involve a pretty rich connection between a bunch of ideas that you can develop in an academic context, and then a bunch of industrial use cases. Some of these at places like the hyperscale or some of them at the actual network Twitch vendors. And I know that you spent some of your career with very very important. various kinds of engagements with the kind of industrial side. Can you say a little bit more about how that worked and kind of how you've integrated that into your kind of career and research

Starting point is 00:36:51 and approach to thinking about the space? Yeah. So this is something I actually love about the networking research community. And I say the research community, not the academic community, because it truly involves, you know, the hyperscalers and the switch vendors. Somehow, you know, The particular community that identifies as researchers in networking is not just driven by universities and PhD students. It involves all these different entities. And so it has a really nice mix of, you know, you have people doing pure theory, but you also have people who have, you know, designed the, you know, wide area backbone for a giant cloud company. And they're all coming together to share ideas. So that's really cool as a researcher because you have this, you know, relatively small group of people who all know each other.

Starting point is 00:37:36 and they're working on related ideas and you therefore have this kind of you know quick, you can have quick transitions of research ideas getting into getting into practice. I think, you know, my sort of original home community of programming languages also has this, but as we already said,

Starting point is 00:37:52 the timescales are often much longer. We talked about garbage collection from, you know, the 1950s till the 1990s or, you know, type systems similar many decades. And actually, you know, I mean, Jane Street's a very prominent company in the functional programming space. So people understand that, you know, functional programming is being used industrially. But

Starting point is 00:38:12 Jane Treats is a little, not unique, but, you know, there's not as much of a conversation between sort of mainstream programming languages as used by, you know, the millions of developers in the world and, you know, the academic community. So it's something I really love about, about the networking research community. Although maybe there's like more now, like Rust is another example of a language where there's been a lot of very rich connection between academic and industrial. Yeah, yeah. I do feel Like it's been changing and yeah, similar kind of short time scales to cool academic idea, you know, appearing in some mainstream language and then being broadly used. There's another piece of this that's kind of interesting, which is when these ideas like software-defined networking first came out, a lot of the companies decided deliberately to sort of engage a broader ecosystem, a broader community. quite famously sort of, you know, Google was sort of very interested in ideas like OpenFlow, which was the early SDN sort of standard, but chose to do it in open source for, you know, strategic reasons.

Starting point is 00:39:13 But that sort of created an opportunity to build a community around these things. And were you involved in thinking through and helping set any of those standards? No, I was not involved in OpenFlow at all. That was sort of, that was already pretty baked by the time I did my postdoc with Jen and David Princeton. I did get involved in this second phase of trying to design languages and associated hardware for describing the behavior of individual routers, switches, network interface cards. We worked on a language called P4, and that was a similar sort of community effort. Right, and like OpenFlow was more like you get to set the routing table, like a little bit more general than that. But it was like there are like tables that you get to configure there.

Starting point is 00:39:56 and then like P4 was more like you kind of get to write the whole switch. Yeah. OpenFlow people like to bash on because it was kind of cartoony. I think its designers did not intend it to be a cartoon, but there's sort of a big gap between OpenFlow's model of how a router works,

Starting point is 00:40:12 which is basically there's one big lookup table and you're going to cram all your logic into this one big lookup table. And the reality of, you know, high speed routers and switches, which have pretty complex pipelines with specialized units that do certain things. So I think the original hope was that somehow open flow would be realized by, you know, smart compilers teams, you know, people who would sort of

Starting point is 00:40:31 lower it down to these pipelines, that's actually a pretty hard task. And so P4, the benefit of being sort of a second mover was sort of a, you know, a second attempt. And it just exposes the structure of the pipeline. So you do get to customize what happens, but there are certain things that can be just exposed in the language or in the programs itself. Right. And the work with P4, this actually involves some pretty deep engagement on the industry side for you as well, right? Yeah, so I chose during my first sabbatical at Cornell to go be a part of the company that was developing one of these programmable switches called Barefront Networks and then also the P4 community. This was a choice because I think I felt some maybe, I don't know if imposter syndrome is the right word, but I felt very much like I was, you know, I was sort of the programming languages academic who was sort of going and, you know, cosplaying in networking. and I wanted to understand at a deep level, you know, how does a router really work?

Starting point is 00:41:28 And so going to a hardware company seemed like a great way to do that. And then how, like, I'm a little curious, like, both how did it feel making that transition? Like, I feel like going to be a CS academic is like a choice, right? You can go and do computer science in academic context. You can go and do it in an industrial context. And I'm kind of curious, like, why you in the end, like, made that move and what you felt like you got out of it. and how it affected kind of your thinking and research after that. Yeah.

Starting point is 00:41:57 I mean, I can tell you the kind of personal history. There's, again, you know, mentors who sort of provided advice. I was very unsure. My colleague Fred Schneider, who you know well at Cornell. My advisor. Your advisor, yeah. So I remember talking to Fred, you know, should I do this? And he was like, absolutely, you know, if you have a chance to go kind of deepen your knowledge and expertise in a space,

Starting point is 00:42:20 that's going to pay dividends down the road. I had other mentors, George Vargasi, who's at UCLA now, he was sort of like, you know, go do this. And then Nick McEwen, who was the co-founder of the company and one of the STN pioneers, he really sort of, you know, opened the door to me to come be at the company. So it did not feel like a big risk. I confess, of course, you know, really not knowing anything about hardware except for my VLSI class from sophomore year, it was, not that I became a hardware designer, but like, the hardware industry is amazing. I mean, you have just these people who are, you know, the best at what they do at designing circuits, at optimizing them, at, you know, physical layout, at integrating all the pieces from different vendors, going to the fab and getting it manufactured. I mean, it's just amazing what is involved in making a chip.

Starting point is 00:43:11 And the startup had some real veterans and people who really knew what they were doing. It was some alumni of Texas instruments and then others from around the Bay Area. And I learned a lot. It was really fun to be with those kinds of experts and, you know, learn how hardware works and how it's built. Do you have any concrete examples of ways in which your research after was different than your research before for having done the experience? Well, there's a line of work that came out of the sabbatical that I think I wouldn't have done. I can't take credit for it. My colleague at Barefoot, Cheng Kim, he's now at Google and does lots of AI infrastructure for them, had this idea that if we have routers that are programmable, like fully, I should say,

Starting point is 00:43:50 on the router chip that Barefoot was designing literally almost the entire behavior end to end, you could specify in the program. There were a few things that were fixed, but pretty much, you know, you sort of receive bits, and then you can write a program that parses those bits into some data structures, and then you can write some code that, you know, interacts with different memories, and you can change the bits, and you can run them through different hash functions and other functions, and then you can, you know, spit them out the other end. The whole thing you could really specify. And so he sort of realized, you know, this is, just another kind of processor. It's a processor that looks a lot different than a CPU, but

Starting point is 00:44:24 it has a little bit of memory, a little bit of state, and it's very high throughput. And, you know, if you're thinking about a data center, what would it mean to have a bunch of processors, you know, if you take a typical path in a data center, you go through, I don't know, three to six different hops maybe? What if you could do a little bit of processing on each processor? And so we worked on a number of applications of this idea. It later became known as sort of in-network computing. But the question was like, what could you do? Like what functions could you kind of cram onto one of these weird, quirky,

Starting point is 00:44:59 resource-constrained things? And then I think the deeper question, which was quite controversial, is like, what should you do? Should you do anything in the network? And if so, which functions would make sense? And this is controversial for people who don't remember their undergrad networking class. You know, there's, again, from the internet. Net's design, and there's a sort of famous paper people often read, that argued that

Starting point is 00:45:19 you really don't want to put functions and services in the network that will only benefit, you know, certain users because then everyone sort of pays the cost. And so there's sort of some orthodoxy around, you know, keep the network very simple. Right. This is like the end-to-end principle or something. Yeah. Yeah. So that's an example. I, you know, really enjoyed working on those projects and this idea is still very controversial, you know, whether we should have in network computing, but I also believe that eventually, you know, some form of this will become inevitable. And we see this a little bit, you know, some of the networks that are supporting ML fabrics have, you know, more than just sort of traditional packet forwarding. And there are some services

Starting point is 00:46:01 that were developed in some of these systems, things like, you know, consensus protocols or ability to make observations about what's happening in the network fabric or, you know, things like caching certain data that I think really are good ideas or could be good ideas in the right context. So that's something I definitely wouldn't have done if I didn't understand hardware. Yeah. And that's, I think, something that we also do some amount of, of, like, finding places where there are, like, critical bits that you want to accelerate that you want to kind of put at particular choke points in the network, sometimes for monitoring reasons, sometimes for, like, filtering and shaping of traffic. And I guess in our context, it in some sense helps that

Starting point is 00:46:41 there are like a small number of kinds of flows that come up all over the place and like Multicast is maybe like the biggest one of like we you know it's the way in which exchanges distribute market data. There's actually a lot of complexity around the fact that like multicast is not super well supported by by all the hyperscalers. This thing that I remember when I was a yons ago in grad school and learned about networking like Multicast was this important thing that was going to be the way that we delivered video to people and like that. That turned out not to happen. In fact, I think there was like a basic confusion.

Starting point is 00:47:16 It wasn't clear to later, which is that like it turns out multicast, which is like mechanism for essentially broadcasting data, which involves laying out trees that you can like use for the automatic transmission and like take advantage of the network switches capability of like copying data in parallel down multiple paths concurrently. And it seems like a great way to get the same data to lots of different people. but I think the thing that wasn't clear at the time was that the data plane was going to be super cheap

Starting point is 00:47:45 and the control plane was going to be really expensive meaning you would have a huge amount of bandwidth for sending data around and actually very little space for like the control data with which you would like lay down the trees. And if like lots of people wants to consume lots of different data

Starting point is 00:48:02 and sort of these different logically separate multicasts, then like you just weren't going to have space to kind of specify all that. But in our world, there's a relatively small number of channels, a small number of things we want to kind of get to everyone. And so the old multicast idea kind of works in this context, even though it kind of totally failed in the outside world. And now it's just like, you know, in the cloud,

Starting point is 00:48:23 there's just like no multicast. In fact, sometimes cloud vendors will give you things that look like multicast, but they're implemented really, really badly and slowly. And it turns out those are there for if you have some ancient application that you want to run and it thinks, thinks it wants multicast, then you can give it that interface and we will do a thing that like kind of delivers the right

Starting point is 00:48:45 packets at totally the wrong time scales, but you can get some like legacy software to work that otherwise wouldn't work at all. Yeah. I think it's well, I want to talk about multicast some more, but I think it's also interesting, you know, you sort of mentioned a way that your grad school

Starting point is 00:49:01 version of networking and certain distributed systems kind of was wrong. And I've described a few cases where I worked on things that kind of didn't end up being as successful as we as we hoped. And I think that's actually really healthy. I mean, certainly university researchers should be working on things that don't work out, you know, and I don't mean don't work out like you couldn't solve the problem. You couldn't prove the theorem. You couldn't build the system or whatever. But like don't end up being the way the world works. I mean, that's, you know, part of being in a

Starting point is 00:49:27 creative, innovative community is like people are trying wild things. And, you know, not all of them are going to be the right thing to do or, you know, the good thing to do at a particular juncture. and I think when communities get too conservative and you know you're supposed to do things that sort of orthodox way, that's actually a recipe for stagnation and that's a little bit of the ossification that happened in the internet community was like there were these principles which are good principles.

Starting point is 00:49:51 You know, the Antriman principles are good principle. But, you know, you should know when to break it and, you know, it may not be the right, breaking it may not be the right thing for every scenario. But if we never are allowed to go revisit that rule, you know, somehow the world just got a little smaller. So I love communities that you know, in some unruly way or like, you know, advancing and making progress towards

Starting point is 00:50:11 greater and greater things. But along the perimeter, there's just all kinds of chaos and people doing crazy things. I know you watch this video. We just put out a conference called Nines, which is a new conference for the networking community, devoted this idea of like, let's explore new ideas. And we're going to be, you know, not going to be wrong, but like, we're going to be very accepting of new ideas. That's going to be the main criterion we use to evaluate papers. And to a company, some great papers that got published, we also invited some sort of, you know, luminaries in the field to lend credibility to our effort and also to tell stories about, you know, about their experience with new ideas. And there's a video from Scott Schenker, what I got

Starting point is 00:50:52 wrong about QOS. So Scott is like, for those I don't know Scott, he's one of the giants in computer networking. He's another, well, Scott's not a failed physicist. He actually is a successful physicist, but he switched to networking. But he was, you know, involved. in bringing networks that have support for quality of service in the late 90s and, you know, put his name behind it, wrote lots of papers. And he gave this very thoughtful piece about why things didn't work out. And, you know, what was wrong, but I mean, also what was right. But I just love that piece. I think academics should, A, be doing wild things and then be, you know, not be so shy about reflecting on, you know, hey, this thing didn't end up catching fire or being, you know, the way the world works, but it was still interesting to explore.

Starting point is 00:51:38 Right. Yeah, I think we've had our own kind of evolution within Jane Street where I think early on in some sense we only did things that worked. Like, you know, there's a small company and there was lots of low hanging fruit and lots of opportunities. And we kind of started out with like a working business. And like, you know, there were lots of small things you could do to make that business better. And almost always make things better in relatively short timeframes. And we still do a lot. of that. There's a lot of plucking of low-hanging fruit. But as the organization has grown, we've had more opportunities to try bigger things and to do projects that sometimes take years to bear fruit. And there's something lovely about that as well. And I think academia, like, can and should and does, you know, take an even more extreme version of that where, like, you can take more bets that, you know, some of these bets might not work out for 10 or 20 years. And, like, that's okay. And I think it's an exciting way of moving forward, like, the bounds of knowledge and to be able to do that in a way that isn't as, you know, constrained by the needs

Starting point is 00:52:43 to, like, get, like, the next practical thing up and running. So you've had this kind of industry experience at Barefoot. Now you're here at Jane Street with a different kind of industry experience. Like, I was sort of involved in the story of how you got here. Like, we've known each other for a long time. I think I first met you when you were, I think, a PhD student at Penn in Benjamin Pierce's office talking about the lens work. So we've known each other for a long time. I think I've known each other for a long time. I think I've each other for a long time. But I'm kind of curious, like, just from your perspective, like, like, you know, you spent a long time knowing kind of vaguely of James Street as like this weird trading firm that uses functional programming languages. And I'm kind of curious how

Starting point is 00:53:19 that process felt to you of like what you thought about Jane Street like in the past and like what in the end led you to think, actually maybe coming and spending some time here and doing work here would be an interesting thing to do. Yeah. I think, you know, we had been talking. I was sort of, I mean, James Street, for those I don't know, is sort of quite prominent in the academic functional programming community. which I consider myself a part of. So, you know, Jane Street sometimes publishes papers at ICFP, which is the main conference in functional programming, sends people to the conference.

Starting point is 00:53:46 I was sort of aware of, you know, here's this company that is using a functional language for much of its work. It happens in my favorite functional language. And also, you know, really takes on hard technical problems and is willing to sort of make these longer-term investments in, you know, systems and tools. And so I was sort of aware of that. And I think, you know, we'd been talking. I'd even had some meetings with some folks in the networking team about some of the efforts that were being made to sort of, I would say, bring SDN-ish ideas to Jane Street's network.

Starting point is 00:54:18 I wasn't around, but my understanding is, you know, Jane Street's network has sort of, you know, emerged from being a sort of smaller, you know, sort of human managed network to something that's, well, maybe not quite as big as the hypers, getting big enough that you want to use those same ideas of, kind of, you know, top-down specification, having some tools to understand what's going on, being able to really make changes with confidence. So, you know, to me, the chance to kind of work on those kinds of problems at a place that has, and we should talk more about the firm's culture, but we talked about, you know, sort of ideas that fail, changement has a very open culture. There's a wiki page, you know, you can just do things. And I think that really appealed to me as well, the idea that, you know, we're not going to be sort of just doing things, you know, sort of a conventional way.

Starting point is 00:55:07 We're going to think about, you know, cool, maybe new ways to solve these problems and then get smart people to work on it together. The other thing that, I don't know if I've told you, but it was definitely in the back of my mind is although the research community has this lovely sort of tight embrace with the hardware vendors and the cloud companies and now the ML companies, I was very excited. about the idea that, well, maybe financial networks are different, right? There's certain things like multicast, like latency and latency at a different time scale than what the cloud companies care about. And, you know, whenever you just like take a problem and you tweak it a little bit,

Starting point is 00:55:47 you add some new assumption or different constraint, often that leads you to a very different kind of solution. And so I thought it might be fun to understand, you know, some of the unique problems that finance and Jane Street has. And then to be at a place where, you know, there's there's going to be the smart people and resources to go solve some of those problems. Yeah, and for all that academia is a place where you can work on all sorts of wild ideas, it's also the case that in lots of contexts, academic work gets kind of caught up by the industry thing of the moment. And I think in networking, like the hyperscalers have that shape of like,

Starting point is 00:56:19 it's a legitimately big and important problem. And like you see that like networking papers kind of overwhelming you want to think about that. There's like an older, fun example of this in the garbage collection world where there's like a stretch of years where all the garbage collection papers are about Java garbage collection. And, like, Java was, like, a very particular kind of language with a very particular approach to garbage collection that, like, skewed the way things were done.

Starting point is 00:56:42 Like, one interesting aspect of this is the way in which you tune a garbage collector. Like, you always have to tune a garbage collector. Trade off between space and time of, like, how much time are you going to spend collecting and versus how big you're going to let the heap go. And then, like, the traditional way of doing this in Java collectors is a kind of roofline model, where you're just going to be, like, how much memory can I use? well, how much memory do you have?

Starting point is 00:57:02 Right? And I sort of say it's like, sort of, you know, it's like you have this big hulking enterprise application and it will run on a box and it will be able to use every bit of RAM on that box. And then like when it gets close to using it up, then it'll have to like work harder to collect memory.

Starting point is 00:57:16 And that's not like the only, and like for many contexts not really the best way of tuning it. In fact, O'Kamel has a very different way of tuning. It's garbage collector, which operates in percentage terms and like a whole different set of heuristics and stuff is in mind. But like for a long time, kind of all the paper

Starting point is 00:57:31 were structured around this roofline model, and then, like, eventually it breaks out of that. And so anyway, I sort of kind of am open to this idea that, like, yeah, it's often useful to kind of break away from the standard thing that everybody is doing. Yeah. So what, now you've been here for a while and worked on some interesting problems here.

Starting point is 00:57:50 Like, what are examples of problems that you have seen that come up in the Jane Street context that actually do look different from the problems you see in the outside? One example, we wrote a short paper on this, Just getting at this question of multicast, we've already talked about a little bit. So I don't want to summarize the paper, but the basic story is, you know, support for multicast hasn't gone away, but it's really been sort of diminishing at the hardware level. And although, you know, I think most, if not all trading firms, of course, use multicast because that's what the exchanges are giving us in terms of data. the trend has been that commodity routers have gotten,

Starting point is 00:58:31 they have more features, they have a lot more bandwidth, but they're getting a little bit slower. So the latency is kind of creeping up as you make more complex pipelines that can do more processing of every packet. And then sort of relative to the growth in things like bandwidth, support for multicast has sort of been flat or even getting a little worse. And so we wrote this paper that was just asking, you know, this is not yet, I wouldn't say it's like a looming problem, but if you sort of, you know,

Starting point is 00:59:01 follow the trends out for some years, it could become an issue. And then in the second part of the paper, we should have asked, well, what are some really different designs that we could think of? Are there different kinds of fabrics that we could build? Maybe things based on, you know, optical networks or circuit switching. And there's interesting tradeoffs. You know, you can, you can actually build a fabric that sort of delivers lots of traffic, you know, simultaneously to lots of places. But then you have sort of a proliferation of traffic everywhere and you have to filter

Starting point is 00:59:30 it. So you have you sort of have this tradeoff between yeah, sort of easy, cheap delivery with certain kinds of networks versus the ability to inspect, classify, and then you know, split and drop. And so that's a paper that I think, you know, sort of

Starting point is 00:59:46 uniquely could be written in this context. Do you think there are lessons to be learned by, like from places that are, you know, in the kind of in more of the hyper-scaler mode, from the things that you see in more trading-style networks. Like, I've looked over time at, like, the kind of designs that people have built

Starting point is 01:00:05 for doing all sorts of, you know, kind of standard web-style problems of, like, you know, an example that I remember talking to some of the engineers there about is, like, the way in which Twitter does distribution and analysis and transformation of the sequence of tweets, which is like, you know, now on the modern scale, a pretty small data problem, but at some point it was a bigger one.

Starting point is 01:00:25 And I remember looking at that and thought, multicast would be really useful here. And I wondered to what degree whether the kind of magic powers of multicast are kind of underappreciated and underused in other kinds of contexts and that, like, maybe they should pick up them, maybe they should pick them up more than they actually do. Another thing that shows up a lot in trading context and showed up a lot in my own PhD is state machine replication.

Starting point is 01:00:53 which is a kind of core idea for building distributed systems and shows up a ton everywhere and cloud providers are also building things with this. But multicast is like a super nice primitive for building efficient state machine replication systems and it's not one that seems to show up a lot in practice. And that's like another just concrete example where I suspect it could be more useful.

Starting point is 01:01:18 I think it's, you know, Multicast gets used for lots of things in a trading context, right? using it like in a very kind of local environment in the middle of building a certain kind of more or less supercomputer, right? You can have like lots of systems that are hooked up to each other with multicast and it's a way of like giving you an efficient bus for just distributing messages to everyone. And then it can also be used as a way of connecting data across different organizations. And like that's the exchange side of this, right?

Starting point is 01:01:43 They deliver multicast as a way of efficiently and fairly getting their data out to all the many people who are consuming it. And then that same multicast tree kind of extends into the network. of the consumer. And maybe that latter one is like very trading specific, right? The kind of cross-institution version of it. But the inside of the company version of it, or the inside of a system, it feels to me dramatically underused.

Starting point is 01:02:08 Again, I don't have your distributed system's instinct, so I can't kind of spar with you on that. But one thing that I've been pondering recently is, and this is inspired by a paper by Nick McKeown and his student Sundar, they published just last year, you know, most network infrastructure is still based on, you know, the good old packet switching model. And there's reasons that we move to packet switching in the 60s. You know, it gives us efficiencies.

Starting point is 01:02:37 We don't have to schedule things. We don't have to understand reserve capacities and so on. It's a very, you know, simple building block that has really nice properties. But, and you might wonder, like, what does this have to? with packet switching. Well, the, sorry, with multicast. The challenge with packet switched routers is that building support for multicast is actually pretty complicated. So doing that with low latency and, you know, finding, building in sort of the heart of a router, you know, a unit that can move packets along, you know, any combination of input and output ports,

Starting point is 01:03:14 at speed, you know, maybe even doing some, some queuing at that same time. That's pretty complicated. I guess, and it's complicated because, like, if all you had was, you know, a single multicast tree to distribute along, then, like, packets would come in and you'd copy them out to multiple outputs and everything would be cool. But you actually have many different things happening concurrently. So you both want the parallelism of being able to, like, emit out of multiple wires at the same time, but also you have to tolerate all this dynamism. And there's, like, some fundamental tension there of, like, you can't, like, at the physical layer, do things completely in parallel. if some of the resources you're trying to address are busy doing something else. And this part of her router is sort of the, you know, the middle part that has to run, you know, the fastest. It's what determines sort of the rest of the performance of your whole router. So that's generally complicated. So what this paper that Nick and Sundar wrote is they were looking at machine learning workloads, in particular training workloads, which are many, and often very regular, you know.

Starting point is 01:04:12 And so why are we doing packet switching at all? Why don't we just understand, you know, what's going to happen when? And then take those schedules, you know, this, this data here is going to be delivered according to this permutation and this one according to this permutation. And then you can build a much simpler switch that just understands how to implement these, you know, permutations on a schedule. And, you know, Nick has deep hardware understanding. So he, you know, the paper explains sort of why this would lead to, you know, simpler, cheaper, faster switches. But for that use case, it seems like, you know, the right way to do things if you were able to boil the ocean and build all the infrastructure from scratch. So to me, the sort of intellectually interesting question is, you know, we have certain kinds of networks that can do these tasks that are otherwise expensive or slow or hard. We know how to build networks that can do those tasks very well, but we're sort of afraid to build them because, you know, there's so many benefits of packet switching that we can just sort of spray these packets into the network and whatever resources we have will be used efficiently by the herd. maybe in the future we'll start to think about going back to some hybrids where we do a little bit of both. Yeah, and in some sense we are kind of boiling the ocean or maybe like making several new oceans or something. The whole ML world is creating this enormous revolution in networking.

Starting point is 01:05:31 And you have both like much higher demands in terms of throughput and latency just because of the, in fact, in part because of this very regular process, right? A lot of the synchronization in machine learning training is this kind of barrier synchronization where like, you know, You have a bunch of hardware in parallel doing a thing. That hardware is actually very deterministic, and so it finishes pretty much at the same time across multiple, and then they need to, like, exchange their tensors really quickly. And all that time where they're exchanging tensors, I mean, you can do some overlapping,

Starting point is 01:06:00 but, like, sometimes you can't do overlapping, and any uncovered, like, communication is just time where these very expensive GPUs are just idle, and so just wasted money. And so there's a huge amount of pressure on these networks, and people are also increasing the heterogeneity of the networks, because now you have, like, the networks on the inside of the

Starting point is 01:06:18 like I guess invidia uses maybe everyone uses this kind of somewhat odd terminology of scale up versus scale out where like scale up is like the really fast little network and the scale out is the network beyond that so that could be a context where you have the freedom to go and try like very different things yeah it's actually

Starting point is 01:06:36 in networking it's a pretty exciting time because people are playing with maybe not ideas quite so as radical as like building just a scheduled crossbar in making that be the building block but there is a lot of innovation in transport protocols, collectives, co-optimizing, you know, the low-level Kuda code and the communication code. So things feel very like suddenly like, oh, we can sort of play with all these pieces of the design. And then, you know, because training and serving AI models is sort of the, you know, the central problem of the day for systems, you kind of get immediate feedback.

Starting point is 01:07:09 And when something works, you know, people get very excited. So another thing you've been working on while you hear is BGP. Maybe you could say a few words about what BGP is and then talk about like the problems that we've run into that you're working on making better. Yeah. So maybe one of the areas where, you know, Jane Street was sort of living sort of in the past. We have a now by now big and it's been growing a lot worldwide network that connects all of our sites. And although we have a lot of tooling and analysis, it was still, we were still expressing what we wanted the wide area network to do in terms of configurations. for individual BGP routers.

Starting point is 01:07:46 And that very much feels like sort of the dark ages. So maybe I'll quickly explain what BGP is. BGP is what was originally designed as the routing protocol for the internet. So you have the internet with all these tens of thousands of autonomous systems. Every organization is its own system, gets to decide how it routes traffics to other autonomous systems, and you need some protocol that these so-called ASs can use to agree on how traffic flows. What's an AS?

Starting point is 01:08:15 Autonomous system. Oh, okay. And the way the BGP works is, essentially every AS knows who its neighbors are, and it selectively shares information about certain paths it knows to reach certain destinations.

Starting point is 01:08:35 So, for example, our routers that connect to our peers on the internet might say, well, hey, we're Jane Street. If you want to reach any Jane Street IP address, come to us. And then those routers will send to their neighbors, you know, a similar advertisement. You know, if you want to reach Jane Street, I can reach them in one hop. And you can also share other characteristics about the path. So this is sort of the basics of BGP. It's a so-called path vector protocol. It's disseminating information through the internet about paths that reach

Starting point is 01:09:08 certain destinations. And what makes it very rich is there are many so-called attributes that you can add to these advertisements. So you can decorate an advertisement, not just with I know how to reach Jane Street, but I know how to reach Jane Street with this cost. You can add, you know, sort of tags. You can add a whole bunch of information. And now when a router receives this advertisement,

Starting point is 01:09:30 it can sort of compare maybe a whole bunch of advertisements. It has all for reaching Jane Street on different paths, and it can then make a selection and decide which ones it thinks is best. So it lets every node kind of make a local choice and express its own preferences, but it also kind of quickly disseminates information about all the past through the Internet. And in some sense, it's sort of like

Starting point is 01:09:50 the opposite of what we were describing in this kind of, you know, I step back and write a big program that lays out a mostly static graph. This is like, instead, I have like a rich distributed system of individual nodes sharing information and then making local decisions about how to route data,

Starting point is 01:10:09 although hopefully somewhere in there there's like some reason to think that those local decisions actually lead to good global outcomes. That's right. What I've described is basically how BGP works on the internet. And it's actually, it was not known for a long time why BGP seems to work so well. You would think that a bunch of nodes that are just...

Starting point is 01:10:29 That would seem like a very important thing to know. Making independent decisions. Well, you know, in fact, the internet routes are fairly stable. So things sort of converge to... I mean, the Internet's all. always in motion, of course, but if we could sort of pretend that we could sort of stop the world, you know, the internet sort of converges to the paths that are sort of, you know, at least a local optimum, pretty well. And there's a really nice paper by Jen Rexford, my postdoc advisor,

Starting point is 01:10:59 and Lickson Gau that explains sort of why this is the case. And it turns out that the internet sort of has a kind of structure that comes from the economic relationships. You know, you have sort of ISPs and customers. And because the kinds of BGP choices that different players in this ecosystem tend to make, it turns out there's sort of latent, you know, properties that cause BGP to behave particularly well. These are the so-called GOW-Rexford conditions. This is like ancient stuff, but it's kind of cool that this like unruly distributed system actually kind of works pretty well for these reasons. Like, why did this present problems for us? So one other thing I haven't said is that BGP is often also used.

Starting point is 01:11:39 inside of organizations. And it's a little confusing because it was, again, originally designed for, you know, the internet where the nodes that are participating are an entire organization like Jane Street or an entire university like NYU.

Starting point is 01:11:53 But of course, inside of Jane Street, there are also, you know, many thousands of routers. And they need to understand how to reach certain destinations, both internal and external. And so there are other protocols that have been used in the past,

Starting point is 01:12:06 but for many decades now, it's been really common to use BGP also internally to share knowledge about what paths exist. Is part of the reason for that basically the dynamism that you need? Like, if nothing else, links can fail and you need to be able to recover from link failure? Yeah, I think it's a really expressive protocol. It's got sort of all these ways that you could sort of cram in information about different routes and make choices and selectively disseminate information. It's widely supported by vendors.

Starting point is 01:12:33 All the network engineers know it because, you know, it's been this way for a long time. So, you know, why not? It's a good tool for, you know, disseminating information about the network topology and its paths. Right. I guess in an alternate universe where you're like totally down the SDN route, you could imagine that you could just like look at your overall network and just like decide what you want to lay out.

Starting point is 01:12:55 And then you don't have to think about like the communication part of it. But then it doesn't have a, there's no story there for a dynamic behavior. Yeah, I think there's one of the sort of maybe flaws of the original SDN conception is, although you might want to think about, your specifications or your program as kind of being sort of truly one program. I have sort of one objective for my network and I'm going to sort of, you know, check that a new repository and have people review it and argue about it and test it. But then the way you realize that, there are good reasons to have distributed protocols.

Starting point is 01:13:24 They, you know, they detect, they detect and respond to changes very quickly. They don't involve lots of coordination. And so if you can map your high level objectives into a distributing implementation, there are good reasons in a large system like one that spans the whole world to do that. Maybe you want to compile down to something, but maybe you don't want to compile down to a static graph. Right.

Starting point is 01:13:45 So the system I've worked on here is a system called Butane, and really we're trying to do exactly this. We're trying to have, what we have, a sort of higher level, and in our case, it actually is centralized. You know, it's like checked into our repositories, and there's, you know, if you want to make a change, you go propose a change,

Starting point is 01:14:03 and it gets reviewed, just like our other software. But then it gets compiled into snippets of BGP, one for every router in the network. And then the behavior of the whole thing is somehow the distributed behavior of all of these routers exchanging BGP messages with each other,

Starting point is 01:14:20 and then ultimately arriving at some graph that forwards packets through the network. So I guess the top level of butane is some kind of specification of what you want the behavior to be, and then that compiles down to like the actual concrete confase that land everywhere. That's right.

Starting point is 01:14:33 So what can you say in that top level spec? So the top level spec, I guess, you know, one thing that we care a lot about is latency. So we, to a first approximation, you know, would like certain kinds of traffic to absolutely take the fastest paths. And then, you know, there's other traffic that we just wanted to get to its destination somehow. And we may not actually care that much, you know, if it loops around the world to get there, if it takes, you know, twice as long or three times as long as it has to, as long as it gets there, that's okay. Okay. And so this, the Butane's policy abstractions are really designed to support, you know, sort of the default case just sort of happens. You don't have to specify very much. And then where you want to say, no, this traffic should take a fast path. You can do that. And then there's some other pieces that are kind of a little bit inside baseball. But, you know, we internally have certain structures to the network. And so there's, the policy abstraction sort of expose some of those structures. Things like, you know, where are there sites? We have certain expectations about how traffic may flow or not flow between certain sites.

Starting point is 01:15:40 There's a whole set of ways that we sort of classify and differentiate traffic. So there's some features for doing that as well. And then, like, what's, what is the technically hard part of this story? Like, I could, you know, I could, like, have a, I could, I could, like, write a program that writes a bunch of BGP configs. But I don't have much of a lock between, like, what I wanted to happen. And then what's, like, the dynamic behavior. of it? Like, is that like the central problem here? So, let me not answer the question. I'm not going to say about the hard part. I want to tell you

Starting point is 01:16:10 why has Butane been valuable or what, what have we found is sort of the most important parts of butane. And to a first approximation, it's a little bit something we talked about before. It's just bringing a sort of software mind scale to thinking about the wider network. So instead of, you know, operators having some change they want to do, network engineers might want to, you know, move this traffic over to this other path and then go, you know, make a bunch of changes expressed at the BGP config level to, you know, several routers. Now you modify a little bit of Butane config. The compiler generates the actual BGP code and then it gets validated.

Starting point is 01:16:50 We have some tools for visualizing and validating the changes and then it gets pushed to the network. So there's actually not, I would say, giant technical challenges. This is all sort of fairly well understood stuff. But, you know, for Jane Street, moving us from a world where we're making changes to individual routers at the BGP config level to being able to work with these other abstractions has been pretty exciting. And I'll say, I was actually not sure what the network engineers would think. But so far, people seem to really like it. The ability to kind of take these bigger steps has been very exciting. And then the other piece, which maybe surprised me a little bit, is all of the tooling that we've built,

Starting point is 01:17:32 especially tooling for testing and visualizing, that's what they're actually so excited about. You know, the idea that I can make a change. And then we have this UI where you can see what's the expected, you know, change in latency, say, between sites. And this is something that, you know, maybe they could have worked out on paper or could have pushed and then tested. But now we have a model, again, based on some, you know, historical measurements and we have a samanel antics for both Butane and BGP, so we can sort of, you know, compute what's the change going to be. We can do this both for sort of like, you know, propose one change and then see what might happen. And we're in the midst of making this more powerful.

Starting point is 01:18:08 So you can do sort of what if kinds of things. Like, what if we lose a link? Well, what if we lose N links? Or, you know, I'd really like this hotspot to disappear. What's the, you know, proposed to me a set of changes you could make that would, you know, move traffic off here while minimizing other changes. So that's a different kind of, I don't know, edit or, you know, UI that you might like to have where you're not specifying a different, you know, butane policy abstraction. You're really kind of expressing constraints on what you'd like the system to get to.

Starting point is 01:18:39 So do you essentially need a kind of solver which does, like, exploration of the space of other configs? The second thing I described is still research. Got it. Yes. So we're actively working on that. But yeah, it's going to look very much like, you know, ideas from program synthesis. or some kind of solver where you can take these constraints and then explore the space of programs that might meet them. This is something that, you know, again, maybe to a programming that person doesn't sound all that wild.

Starting point is 01:19:06 But to do this to the network infrastructure, I think, is pretty wild. I mean, and what keeps it safe is that we do have this, we have a semantics for both Butene and BGP. It's extensively tested, both kind of mathematically and against the actual hardware. You know, you have to make sure that the way that BGP is realized by the vendors doesn't somehow differentiate from the internet standards. And so we have quite a lot of confidence that our model of BGP is good, and therefore we can do these analyses and give answers to engineers before them. Right.

Starting point is 01:19:39 And at the end of the day, there's a kind of formal methods piece where it is trying to do something that gives a, like, provable up to the fact that you don't know if the underlying model is quite right because the vendors might be doing something differently, but, like, gives you a kind of provable guarantee of the, of the fact it is analytically telling you about the network. Is that right? Yeah. I think, you know, this use of form methods actually may become a lot, you know, much more commonplace. So I think form methods people often think, you know, I build a tool to verify something to like stop bugs. I'm going to, you know, this is going to be the seatbelt of my,

Starting point is 01:20:10 of my complex system. But here, of course, we care about, you know, stopping bugs and, and preventing us from making mistakes. But the real power is, you know, now we can start to explore. So we can start to take bigger steps and even automate some of the exploration of those steps. And that would be unthinkable if you have to reason about the impact of all these changes, network-wide, and let alone on latency, on congestion. And so having models that are backed by some kind of mechanical implementation that you believe at least closely corresponds to what's going to happen, that's what empowers these kinds of more these tools that take bigger steps.

Starting point is 01:20:51 Right. And rather than it giving you a single notion of correctness of like, I've written a spec and doesn't follow the spec, it's almost like a kind of observability of like, I get to like explore different possible setups and think about how they perform and what their behaviors are so that I can like more with more confidence make tradeoffs between different design decisions and figure out how I want to structure the network. So one surprising use of this that is just happening. the last couple weeks is some of the folks in the team are starting to use it for not actually routing, but capacity planning. So deciding, you know, what future links should we buy? And this is

Starting point is 01:21:29 something where there is actually a well-understood mathematical theory. You can model all the demands for bandwidth and the current network and then figure out how to augment it, you know, subject to what fiber is available. But to do that connected to our current butane policies and our historical workloads and latencies is kind of cool. So to put it into not just how do we route today's traffic, but how do we figure out how to expand the network so that we can do a better job of routing tomorrow's traffic.

Starting point is 01:21:57 And I guess that's just all in the back of essentially having a more complete model of the network that includes these kind of dynamic behaviors. Yeah. I'll maybe say one more thing. You asked about sort of what was hard about buting, and we benefited from a lot of academic work. People like Zach Tatlock at UW

Starting point is 01:22:11 had written down a formal semantics for BGP in the rock, proof assistant, and we didn't use their implementation, but their paper did a really nice job of spelling out. You know, here's how BGP configurations should be understood. There was also work on designing higher level abstractions

Starting point is 01:22:28 for BGP, and some of these were really different, you know, sort of the, the analog of like O'Camel instead of X86. And we actually chose not to adopt those. So our Butane policy abstractions are fairly simple, and kind of like O'Cammel, they compile fairly

Starting point is 01:22:44 straightforwardly into BGP configs for different vendors. And we made that choice early. And, you know, I was, if I'm honest, a little bit disappointed. It would be fun to do something kind of more wild, you know, to have something like O'Camel as an analogy. But I think, you know, it was the right choice to actually pick something where the abstraction abstracts, but in a way that you can, when needed, sort of peel back the abstraction and understand how it might map to all the components of a BGP config.

Starting point is 01:23:19 And so, you know, there are hard problems we could have solved. Like, how do you compile some very expressive policy language onto a bunch of distributed router configs? We did not solve that problem. Or we didn't solve it in the fanciest way we could have. Maybe in the future, you know, we will start to explore richer policy abstractions. But in this context, at least, I very much now agree, it was the right choice to sort of pick something relatively simple and then, you know, iterate based on that. Yeah, there's something

Starting point is 01:23:47 powerful about having an abstraction that has, that isn't just simple in terms of its semantics, but is simple in terms of, like, how it elaborates, like how it goes from the high-level thing that you want to the actual thing you're running. If you think about people who are trying to, like, in close detail, engineer how a system behaves in all sorts of different aspects, just giving them that kind of vision onto the behavior. And I think the point about O'Camel, it's sort of both. It's both like a fancy, wild, high-level, language and also like it has a relatively straight ahead like compilation story that where it's relatively easy to understand from looking at the OCamill code how that code is going to

Starting point is 01:24:22 execute. And so that, you know, that's, I mean, there are tradeoffs here obviously. I think more optimization is good, but also more straight ahead of head compilation makes it easier to think about what's happening. Yep. How did the process of taking like these ideas that, that, you know, came in part from you and in part from the team and, like, turning this into, like,

Starting point is 01:24:43 a thing we could actually, like, roll out into the network and use. So it's kind of amazing. It's worked this way. Maybe I could say a bit about kind of how my engagement as a visiting researcher has gone. So, you know, I've been spending about a day a week here for a few years.

Starting point is 01:24:56 I've spent some periods of time where I've spent, you know, more intensely, you know, a whole week or, you know, several days in a row. And so a lot of the design of the system was done by, you know, sitting side by side with some network engineers. some folks on our team that builds networking tools and, you know, trying to understand our

Starting point is 01:25:18 problems, trying to come up with some solutions, and then prototyping them, and then, you know, seeking feedback from others on the team and continuing. So for me, this has been really fun because, you know, in my day job as a professor, I'm teaching and doing research with a team of students, you know, really the goal is to train students. Here I get to work with really great engineers. And, you know, that's a great privilege. It's not like I'm, not that my students are also great, but, you know, it's fun to work with sort of really smart software engineers who can solve problems really quickly.

Starting point is 01:25:54 For me, it's always a little bit sad, like when the end of my day or weekends, you know, we'll have had a team meeting and maybe I'll have done a little bit of development and synced with the team. And then I know that by next Thursday, when I teleport back in, you know, amazing things have happened. But that's very much from my experience. So I'm sort of, you know, I get the privilege of being sort of a small player in this team and then I get to work with some really fantastic engineers. Has the process of like actually putting it into production been like relatively straightforward or complicated? Like what's that, how's that played out? So I think I have had less

Starting point is 01:26:28 a role to play here than in the design of the system. Again, I've had the privilege of working with a program manager and some other people from the more operational side. that have really helped roll it out. And there's all kinds of problems that have had to be solved. We had Butane, you know, running, you know, sort of its current set of features were more or less done. And then we did a bunch of testing. And then we did more testing in the lab. And then we started to roll it out.

Starting point is 01:26:55 And the first rollout was on a new piece of infrastructure that Jane Street had stood up somewhere else in the country but wasn't yet using. So we had sort of a living lab that we could roll it out and sort of, you know, kick the tires and see how it worked. And then, you know, rolling it out more broadly across the firm was, was yet another step. And, you know, this involves doing firmware upgrades to the whole fleet of routers, making sure everything's on a good version. That's very hard to do. Building all the tooling for automation so that these things can go into the standard workflows that we have for making changes to the infrastructure. So again, I did almost nothing here, but there's an amazing set of people on processes. and maybe just say one thing that's kind of not just,

Starting point is 01:27:39 I mean, every large company has these kinds of processes, but one thing that's really neat about Jane Street is a lot of these tools are really homegrown. And so in some cases, we've gotten to work with the team that's actually building, you know, the deployment tool. And if there's something that we need that the tool doesn't have, you know, they can build it for us. Yeah, it's kind of fun.

Starting point is 01:27:59 For both good and ill, like Jane Street has been on its own, like, you know, 25-year-long software adventure that has been, like, pretty different from other places. And, like, some of that had to do with a language choice and some of it has to do with idiosyncrasies of the kind of business that we're in. But, like, Jane Street's software ecosystem is kind of weird. And, you know, I think there's a lot of great things about that. I think there are some pain points about that. But you certainly get a lot of control all the way through the stack, which is really cool. Yeah. So part of the point of the whole visiting researcher program is that it's a way for us to kind of build good relationships with researchers.

Starting point is 01:28:37 And I think part of the value proposition for the researchers is that connection to the kind of work we're doing internally is a kind of useful way for them to kind of develop ideas and learn about the world in a way they can influence their research beyond our walls. I'm curious to what degree you feel like that has played out. And to what degree like you have learned things that affect how you think about research outside of the stuff you do here. I'll confess that part of why I've had so much fun spending a day a week in industry is I actually like, you know, maybe playing software engineer. So I really enjoy, I love my job as a professor.

Starting point is 01:29:15 It's the best job in the world. You know, there are a lot of interruptions, even if you're very careful with your time. You end up spending a lot of time on teaching and service and working with students as a joy. It takes a lot of time. So it's really wonderful to have a day where, you know, my calendar is blocked off. I have meetings with folks here, but like, J.C. is pretty efficient with meetings. I don't, like, my day doesn't get filled up with, you know,

Starting point is 01:29:37 one-on-one Zoom calls or anything. And I actually get to sit in front of a terminal and, you know, and write some code. I have to be very modest about how much I can do in a day, but for me, that gives me a lot of joy. In terms of what I've brought back, some of the technical ideas that we used in Butane are things that I'm now working on in my lab.

Starting point is 01:29:58 So I didn't talk too much about this, but under the hood, the semantics for BGP and Butane that we built originally was based on a, just sort of a simple simulator, a little operational model. But there's a more powerful mathematical model that's been studied for some time in the networking community based on a more algebraic approach. So that's become a topic that I'm working on with my students in my group. Can I see, what's the upside of the algebraic approach? How would it be better than this kind of more operational models? So I think the upside of the algebraic approach is, the original vision was to sort of give you the building blocks for building policy DSLs.

Starting point is 01:30:43 There's a paper by Tim Griffin called MetaRouting, and that expresses this idea. So the assumption is choices that you make about how to route traffic through a network are very specific to a particular organization. Every organization has their own policy. about how they want traffic to flow, how they want to share information with their neighbors and so on. So you can't really come up with a one-size-fits-all solution. And even what information you choose to share with your neighbors, you know, my sharing information about latency or bandwidth or trust.

Starting point is 01:31:17 And so Tim anticipated that, you know, BGP might be the assembly language of many policy languages that might all look really different. And so how could we design some general building blocks for designing these policy languages? And so the algebraic approach is that you sort of abstract, you know, what's happening in BGP. There's basically information being exchanged with your peers. And then there's choices being made about the information you receive from your peers.

Starting point is 01:31:46 You make a selection. And you can model those abstractly in terms of some kind of what he calls a rooting algebra, because he's British, but routing algebra. And once you do this, you could sort of, you know, write down, here's what a routing algebra is. Here's what happens if you take a given routing algebra and you sort of run it in a graph. Here's what, you know, here's the set of paths you'll get. But also, here are constructions you can do on routing algebras. So I could take two routing algebras, maybe that one that talks about latency and one that talks about bandwidth.

Starting point is 01:32:19 And I could run them together. I could sort of glue them together. And now I get a routing algebra that shares information about both latency and bandwidth. And it makes choices in some deterministic way about, you know, whether it prefers low latency or more bandwidth. But I can glue them together, glue their preference function together, and now I have a more interesting routing protocol. So you could sort of, this sort of becomes a factory for building DSLs. And then the really interesting part is... So basically more composability in the space of writing policies?

Starting point is 01:32:48 That's right. Okay. And then you can also study, you know, if I, it's not the case that every instance of BGP or every routing algebra is going to converge to a unique solution. Sometimes, so BGP in general can have this property that you end up oscillating between multiple solutions and, you know, A prefers these paths, you use those paths for a while, but then B's unhappy, so then you switch to the other paths and then A is unhappy. So that's bad. You would like that not to be the case. And you can study generally, you know,

Starting point is 01:33:19 what conditions on my algebra do I have to have to ensure convergence? Got it. Anyways, these are kind of old ideas, but we started to play with these in my lab. And I think, you know, the grander vision is we'd like to take the sort of original vision of SDN that you can write a sort of top-level program, but we'd like to have the distributed implementation based on BGP, which is widely supported and such. And so this could be sort of the IR of that kind of system. Cool.

Starting point is 01:33:50 So like a kind of richer compiler. can you imagine that work eventually reflecting back into the kind of things that we're doing here? Potentially. I think we, as I mentioned, we were sort of somewhat modest in our original goals for Butane's policy abstraction. And we have, you know, things we know we could do that are fancier. And so I think to take that next step, it may be that understanding, you know, that the system as a whole is working well could be done in terms of routing algebras. Cool. All right, well, maybe that's a good place to stop.

Starting point is 01:34:23 Thank you so much. Thank you. It's been a lot of fun. You'll find a complete transcript of the episode, along with show notes and links, at Signalsandthreads.com.

Signals and Threads - The Network as a Program with Nate Foster

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.