Signals and Threads - What Is an Operating System? with Anil Madhavapeddy

Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstream. I'm Ron Minsky. It is my pleasure to introduce Anil Madhavapeti. Anil and I have worked together for many years in lots of different contexts. We wrote a book together. Anil and I and Jason Hickey together wrote a book called Real World OCaml. We have spent lots of years talking about and scheming about OCaml and the future of the language and collaborating together in many different ways, including working together

Starting point is 00:00:35 to found a lab at Cambridge University that focused on OCaml. And Anil is also a systems researcher in his own right, an academic who's done a lot of interesting work, and also an industrial programmer who's built real systems with enormous scale and reach. We're going to talk about a lot of different parts of the work that Anil has done over the years. To start with, though, I want to focus on one particular project that you're pretty well known for, which is Mirage. Can you give us a capsule summary of what Mirage is?

Starting point is 00:01:04 Sure I can, and it's great to be here, Ron. The story of Mirage starts at the turn of the century. In the early 2000s, pretty much every bit of software that ran on the internet was written in C. And back then, we had internet worms that were just destroying and tearing through services because there was lots of problems like buffer overflows and memory errors and reasons why the unreliability of all the systems code that had been written in the past was becoming really obvious and the internet was really insecure. So there I was as a fresh graduate student in Cambridge, and I decided that after years of doing systems programming in C, I would just have a go

Starting point is 00:01:39 and see what it was like to rewrite some common internet protocols using a modern high-level language. And so I looked around and I looked at Java, which was obviously the big language back then. I looked at Perl, which was heavily used for scripting purposes. But in the end, I decided I want something that was the most Unix-like language I could find. And I ended up using OCaml. It had fast native code compilation that just ran in Unix, could be debugged very easily. It had a very thin layer to the operating system. I spent a great couple of years figuring out how to write really safe applications in OCaml. So I started by rewriting the domain name service, which is how we resolve names, human readable names like google.com to IP addresses. And I rewrote the secure shell

Starting point is 00:02:19 protocol, which is how most computers just talk to each other over remote connections. And I rewrote all of these in pure OCaml. And I showed as part of my PhD research that you could make these not only as high performance as the C versions, which really wasn't that well known then because there was a perception that these high level languages would be quite slow. But then I also showed that you could start doing some high level reasoning about them as well. You could use model checking and early verification techniques to prove high level properties. And this is all really good fun. I wrote loads of OCaml code and and then I published all these papers, and then I asked myself a simple question. So I've written all of this code to rewrite network

Starting point is 00:02:52 protocols and have safe applications, but then the compiler just seemed to stop. So after all of these beautiful abstractions and compilation processes, I got a binary at the end, and this binary just talked to this operating system. And I might have written 100,000 lines of OCaml, but this operating system had 25 million lines of C code, the Linux kernel. So why, after all of my hard work and perfecting this beautiful network protocol, do I have to drag along 25 million lines of code? What value is that adding to me when I've done so much in my high-level language? And this is where Mirage OS comes in. So Mirage OS is a system written in pure OCaml, where not only do common network protocols and file systems and high-level things like web servers and web stacks can all be expressed in OCaml, but the compiler just refuses to stop. We then

Starting point is 00:03:41 provide different abstractions to plug in the actual operating system as well. And so the compiler, instead of stopping and generating a binary that you then run inside Linux or Windows, will continue to specialize the application that it is compiling, and it will emit a full operating system that can just boot by itself. The compiler has specialized your high level application into an operating system that can only do one thing, the thing that is written to do. And it does this not just by looking at the source code. It also looks at your configuration files, which are also written in OCaml. It evaluates all of those in combination with your business logic. And then it compiles a whole thing in combination with operating system components written in OCaml, like TCP IP stacks and low-level file systems

Starting point is 00:04:22 and network drivers and those kinds of things. And it's what's known as a unikernel. A unikernel is a highly specialized binary output. So MirageOS started off as an experiment in my PhD 15 years ago. I've been joined by an incredible community, initially by Thomas Rézigné and David Scott, and now by a large MirageOS core team. And we have hundreds of protocols and file systems and pieces that can all fit together and be combined into very bespoke artisanal infrastructure. So you can design a kernel that does exactly what you want it to do. And you don't have to drag along other people's code unless you want to. Maybe an overly pithy summary of this is your operating system as a library. Well, so this is an operating system, but what is an operating system? In a normal operating system, you run a bunch of processes, as known as user land, where this is a failure domain where something goes wrong or it needs resources from the outside world. It's all of the resources in your system. It manages the hardware and it acts as a middleware to give software safe and isolated and high performance access to the

Starting point is 00:05:31 underlying hardware. So with unikernels, it uses a different approach of structuring operating systems, one known as library operating systems. And this is one where instead of the kernel acting as a big wrapper around all of your code, it simply is provided as a set of libraries. So there's no different from any other library that you link to, such as OpenSSL or some kind of graphics library, for example. The kernel is just another one of those things. But what you sacrifice is multi-user modes, because if one application is accessing some system libraries, it needs exclusive access to the hardware. It's quite hard to provide competing or untrusted access to different parts of your hardware stack. So library operating systems work really well if you're trying to build a specialized application that is maximally using

Starting point is 00:06:16 the hardware at hand. If you just want to have a desktop with lots of applications running, then you should just use conventional operating systems. It's only if you can benefit from the specialization that you want to switch into this different mode of operating system construction. What I like about Mirage OS as an idea is it's so weird. It's hard to know whether it's a research project or a stunt. And it's also, I think, part of what I think of as a larger story of the multi-decade long failure of the original idea of an operating system. Back in the day, we had this idea of what we were really going to do is build multi-user operating systems. Computers were really expensive, and we need to share them,

Starting point is 00:06:50 share one big computer among a bunch of people. And so we built systems like Multics, and then systems that took inspiration from them, like Unix and lots of other systems along the way. And we built all of these abstractions that were designed to make it easy to share hardware among multiple people and to do it safely. And then in the last couple of decades, we have more completely and utterly given up on that project.

Starting point is 00:07:14 And things like virtualization and containers are all examples where we're like, no, no, no, that's not what an operating system is for. Operating systems are for piling up your complicated stack of all the different components you need to do when you throw together to build an application. And you want to kind of add them up and freeze them in place so you can replicably build up this weird agglomeration of stuff that you've thrown together. The original purpose of actually having multiple users share the same operating system has basically vanished from the scene. And once you've made all those changes, the idea that instead of all of the traditional abstractions that we needed when we were separating out different users, maybe we could do something radically different. That's where I see something like Mirage OS showing up.

Starting point is 00:07:57 That's right. It's an interesting perspective to think that operating systems have been a failure because what's really happened in the last 20 or 30 years is that we have invisibly added layers that provide the right level of abstractions needed for that point in time. So for example, in the late nineties, I would spend ages building a beautifully configured Windows machine because I knew exactly all the registry keys and all the magic that went into it. But in the early 2000s, I worked on the Zen hypervisor and the Zen hypervisor started off with a very simple thesis, which is it is possible to run multiple instances of an operating system not designed to run on the same physical hardware simultaneously and make sure it's completely isolated from the other operating

Starting point is 00:08:36 systems running on the machine, but also do so with minimal performance. There was a serious balancing act there. And so what we did with the Zen hypervisor was don't touch anything in the user space because you don't want to have people rewriting all of their applications, their Oracle databases or their SQL servers or whatever they're running. So we scooped out the guts of the kernel. And normally the guts of the kernel in Linux is what manages the low-level hardware, the memory management subsystem, the interrupt controller, and the things that map hardware to operating systems. And with this simple modification, we adopted a technique called power virtualization. And what power virtualization did

Starting point is 00:09:08 was it just fooled the kernel into thinking it was running on real hardware, but we shimmed in a little layer called a hypervisor, the Zen hypervisor, which then did all the real mapping to real hardware. It turned out this was extraordinarily effective because we could take entire physical operating system stacks of tens of millions of lines of code all combined and run them simultaneously in a single physical machine and make sure that they were all utilized to their maximum potential. So if you had a bunch of machines all being used 10% of the time, we could shove these in one place. Now this was worked out so well because the notion of a user wasn't someone who's logging into a Windows machine, but it became the person who's booting up an operating system. And then suddenly, the Zen hypervisor became its

Starting point is 00:09:49 own operating system, and cloud computing and all of these kind of things took off by the mid-2000s. But they just provided a different interface. And when Mirage OS came along, it was kind of the leftover portions of the Zen experiment. Zen also, interestingly, started off as a stunt. It was a bet in the Castle Pub in Cambridge that Keir Fraser couldn't hack Linux over a weekend. And then Monday came along and we had the first version of Zen and then a big team of us continued working on it. I then spent the next few years at a startup company called Zensource building all of the support to make it production quality so we could sell the Zen hypervisor as a product so that we had Windows drivers and Linux drivers.

Starting point is 00:10:28 And those years were filled full of compatibility woes. So you have to look at every single edge case and make sure it works perfectly. And then life just got frustrating. You just get bored of making other people's code work well in your virtualization layer. So we had to have some way to test Zen. And so Mirage OS, the first version of it, came along because we built a minimal operating system that didn't have all of the Windows baggage and all of the Linux baggage. And all it did was exercise the lowest levels of the Zen functionality, the device drivers, the memory subsystem, and so on. I needed to have slightly more complicated tests. So with Tomas Gazanier, we just linked in the OCaml runtime because we just needed to write

Starting point is 00:11:02 some high-level logic. And then that was running inside the Xen hypervisor as a minimal operating system. So it was a few hundred kilobytes in size at most. And then we're sending Ethernet packets. So wouldn't it be nice if you could just hook up an OCaml library to send TCP frames instead of low-level Ethernet? So then I started writing a TCP IP stack in pure OCaml. And then, you know, once you have TCP, it's a pretty small step to go write an HTTP stack in OCaml. And then that happened. So MirageOS became this kind of organic

Starting point is 00:11:30 growth of starting from low level interfaces, figuring out what the system abstractions that we need are, and then filling in the blanks with libraries. So it did start as a stunt. I think all good systems projects start with a stunt because you're trying to test an experimental hypothesis. You're trying to show that if we modify the world to be the way we want it to be with our hypothesis, that it's worth doing. And you need that stunt to show that all of the effort and all the hard work that goes into productizing something is actually worthwhile. So then the hypervisor was a stunt just to show that you could just boot three Linuxes on one machine. And then it, to this day, remains one of the industry's most popular hypervisors. And MirageOS also started as a stunt just to show you could build a credible sequence

Starting point is 00:12:10 of OCaml applications and protocols and compose them together and build something useful. MirageOS today has tens of millions of daily active users. It's embedded in all kinds of systems that use the libraries and the protocols in lots of different ways. And it's invisibly servicing lots and lots of cloud infrastructure. Yeah, I think it's hard to overstate how impactful the Zen work has been. It's the foundation on which the entire modern internet is built, right? The virtualization is absolutely at the core of what an enormous number of companies have done and enormous number of different systems that have been built

Starting point is 00:12:46 have been built on top of this. There's been a bunch of ways that MirageOS has gotten into big and important pieces of infrastructure. One thing I wonder about is, are you happy with the set of abstractions that we've started to build up around this? In some ways, I feel like the stunt-like nature of all of this shows a little

Starting point is 00:13:05 bit in the happenstance of what we got. A lot of the things that we've ended up building are things that you could kind of shim in, right? We started off building a big multi-user set of operating systems and we're like, oh, actually, the abstractions aren't good enough for supporting multiple users truly isolated from each other. So we started doing this, in some sense, very strange thing where we said, we know what's the right abstraction, hardware. Like whatever the physical hardware happens to provide at the bottom layer, that's the thing that will allow us to take our operating systems and just port them cheaply to new places.

Starting point is 00:13:38 So let's pick hardware as the new abstraction. And I find it hard to believe on some level that either of these are really good choices. If you were to actually start from scratch in a way that's not just like a stunt, but like a multi-decade long commitment to rebuild the entire world, do you have a feel for what abstractions you'd actually pick? That's a great question. So Mirage is now 15 years old and we are never happy with our abstractions. I don't think there's been a single day where the core team has sat down and said, we have the perfect set of interfaces that will survive for the next few years.

Starting point is 00:14:13 And it's worth stepping back a little bit to explain why OCaml was the right choice for Mirage OS and why it empowers this continuous evolution of our interfaces. In OCaml, you have the notion of modules. And this is one of the defining features of OCaml beyond being a functional programming language. And what modules do is that they let you define an interface. And this interface is a series of types which can then have functions that operate over those types. And that collection is known as a module signature. And whenever in Mirage OS we are defining some abstract hardware or even a high-level thing, we define a module signature for this thing. And all that does is sketch out

Starting point is 00:14:52 what goes in and what goes out and how you create things of this module type. But then in OCaml, you also have this notion of module implementations, modules themselves. And if they satisfy that module signature, then you could apply this in a type safe way. And you can compose lots and lots of different module types with lots and lots of different implementations. In Mirage, we have a sequence of module types which represent the full set of our possible hardware and application level and protocol level signatures. But then we also have hundreds and hundreds of concrete libraries which satisfy some of those module signatures. So for example, if I have a networking module signature that just says you can open a connection and you could read and write from it, we call this a flow in Mirage

Starting point is 00:15:34 OS, then there are several possible implementations of this flow interface. One of them is just a normal Linux socket stack, which will compile only in Linux. And another one is a full OCaml based implementation of TCP IP, which exports the same socket interface, but instead of delegating the requirement to actually send the network traffic to the kernel, it actually implements it in pure OCaml. And so in Mirage OS, whenever we're not happy with the lack of some safe code, we go write an implementation. Whenever we're unhappy with the lack of some safe code, we go write an implementation. Whenever we're unhappy with the evolution of some hardware interfaces or virtualization interfaces, we go rewrite our module signatures. And all we have to do is to adjust our implementations so that they match

Starting point is 00:16:16 the new module signatures. We can do this in an incremental and evolutionary way. And so over the years, we've learned a ton of stuff. We've seen an evolution of hardware, both in terms of performance and straight line capabilities. We've seen it change in terms of the security model. We started with just page tables for memory. Now we have all kinds of trusted encrypted memory enclaves and we have nested virtualization. It's become an incredibly sophisticated interface there. And then we also have the dimensionality of distributed systems, which is just another way of programming and abstracting across the failure domain. So OCaml lets us split up our implementations and our signatures into two discrete halves and then try to evolve continuously.

Starting point is 00:16:59 And that's why the Mirage project is called Mirage, because our idea was that the Mirage project would disappear and just become the default way that people programmed systems because our signatures would just become part of the standard community and part of the standard way that people build things. And we've been seeing that over the last few years. And one, I think, subtle advantage of Mirage, which is not, I think, totally obvious to someone who encounters it as an operating systems project, is you can take a program that was built for Mirage and you can run it with an ordinary operating system. Your point about one of the ways that you can get network services is to just use the standard network services on the operating system of your choice.

Starting point is 00:17:33 And the other way is to have a pure OCaml implementation that goes all the way down and run that inside of Hypervisor or maybe run it inside of on an actual bare metal server. So there's an enormous amount of flexibility in terms of how you take these things and deploy them. inside of a hypervisor or maybe run it inside of on an actual bare metal server, right? So there's an enormous amount of flexibility in terms of how you take these things and deploy them. This may be not obvious if you just think about it as an operating system. In some sense, it's both more than that and kind of less in the sense that, you know, as you said, there's a way in which the more you look at it, the more you wonder like what actually is here.

Starting point is 00:18:00 In some sense, the whole architecture disappears into the background. That's right. That's right. Well, to give you a concrete example of this, right now, we're really worried about climate change. So we thought we would build a website that is purely solar powered. And one observation about websites, for example, the Camel Labs website is that most people probably only look at the website when it's daytime, right? There's not much machine access to the website. So we thought, well, what if we had a bunch of Raspberry Pis around the world that were just solar powered? And so the process of writing this kind of thing is, first of all, just start writing it in Unix, like a normal OCaml Unix application. And we built

Starting point is 00:18:32 the web server with my colleague, Patrick Ferris. And then at this point, we start measuring the energy usage. And the energy usage is high because it's running Linux in the Raspberry Pi. And then it's just taking up more budget than our solar is letting us provide. So then we wrap it in a more constrained Mirage OS interface. So one that doesn't give you the full access to Linux and all the syscalls and only requires a small file system. And so this is just an evolution over our existing Linux code. And then suddenly it becomes compatible with all of the direct unikernel interfaces. And then you can replace the Raspberry Pi with an ESP32, one of those tiny little 32-bit microcontrollers, and your energy budget drops dramatically. But obviously your capabilities drop. But I had the luxury of

Starting point is 00:19:13 developing the Raspberry Pi environment, which is a full Linux environment. And then when I decide, well, okay, my high-level logic is right, I can bisect it and then get rid of the lower half of the operating system. It's all just done through iterative, normal, pure OCaml development. It's worth noting as well that anyone can build their own custom kernel. If you've never done any kernel hacking, you can still use MirageOS programming pure OCaml and have a custom kernel that you can boot. It is really quite dramatic if you think that there's some mystique in kernel programming,

Starting point is 00:19:41 because there isn't. It's just another very, very large program that is hard to debug. So I think I have a pretty good sense of what's to like about this approach. One advantage is that you get all of the flexibility that you get out of a powerful programming language for building rich abstractions in a kind of kernel environment.

Starting point is 00:19:59 You are restricted in various ways to building abstractions that are, in some sense, safe via the hardware support that you have for separating kernel code and non-kernel code. And there's a bunch of constraints about how you can build that kind of system. Here, you get to use the abstractions very freely. You can build just what you want. And you can have a compilation process that just doesn't link in the stuff that you're not using. So you get things that are truly minimal and, as a result, more secure. So that all seems really exciting.

Starting point is 00:20:26 I have an enormous amount of sympathy for the idea that part of the way that you make your world better is by extending the programming language. I think this is a luxury that Jane Street has had over the years. And I think that in some sense, everyone, whether they know it or not, is enormously dependent on the fundamental tools they use, including the programming language. And people mostly think of themselves as being in the position of victim with respect to their programming language of choice. They mostly use it and don't have a lot of control over how it works.

Starting point is 00:20:55 But being in a place where you can be in real conversation with the community of developers that defines the language lets you, when you find really important ways of changing that ecosystem, actually being able to important ways of changing that ecosystem, actually being able to push that forward, that's a very powerful thing. It is. And OCaml, in my mind, is a generational language. One of the properties I want from systems I build is that they last the test of time. So it's so frustrating that a system I built in the early 2000s, if you put it on the internet today, it would be hacked in seconds. It would not survive for any length of time. So how do we even begin the discipline of building systems that can last for,

Starting point is 00:21:30 forget a decade, just even a year without having some kind of security holes or some kind of terrible, terrible flaw? Now, there is one argument saying that you should build living systems that are perpetually refreshed, but also we should have the hope of building eternal systems that have beautiful mathematical properties and still perform useful utilitarian functions in the world. So there's one big downside I feel like I see in all of this, which you haven't talked about yet, which is it requires you to write all of your code in OCaml. And, you know, I really like OCaml. You really like OCaml. It's in some sense not a downside. But if you're trying to build software that's broadly useful and usable and can build a

Starting point is 00:22:06 big ecosystem around it, restricting down to one particular programming language can be awkward. I mean, just to say the obvious, I would find it somewhat awkward if there's some operating system I wanted to use and I had to use like whatever their favorite language was and I couldn't write in my favorite language. How do you think about this trade-off? Totally. Well, first of all, we must use multiple languages. It's not really OCaml that is the lure for this notion of generational computing. It's the fact that there's at the heart of it, a simple semantic that could be expressed in a machine-specifiable form. And although we have the OCaml syntax and everything at the heart

Starting point is 00:22:39 of it, there's no formal specification about OCaml, but it's obvious that one is emerging and one can be written in the next certainly five to 10 years. And this means that once you have a large body of code that has semantics, it has meaning, it's possible to transform it into other languages and other future semantics. And that kind of self-description is a really, really important part of the reason why I chose OCaml. It's still possible to compile code I wrote in the early 2000s using the modern OCaml compiler. So I've compiled code I wrote in the early 2000s using the modern OCaml compiler. So I've compiled code I wrote 20 years ago. In fact, it was OCaml's 25th birthday just a few months ago, and I tested out the first program I could find. It was my CVS repository, and it compiles fine. But when you want to use

Starting point is 00:23:16 another language, then we just go through the foreign function interface, and it's just like that process abstraction I talked about. All you have to do is spin up another process, which is another runtime, and you have to talk to it. And the industry has made tremendous progress in understanding how multi-language interoperability should work, specifically through WebAssembly, for example, at the moment. We have a substrate where modern browsers can run quite portable code. But more importantly than the bytecode is their emerging understanding of what it means to make function calls across languages. And all we have to do is take advantage of whatever those advances

Starting point is 00:23:49 are, and we can link multiple libraries for multiple languages together. So again, it's a mirage, right? By using other people's advances, mirage can benefit because all we need are libraries to build these operating systems, nothing else. Everyone loves libraries. Everyone has them. That's the only thing we need. And standards for how they can talk to each other. One of the things that I think is really important about programming language design is building a good programming language.

Starting point is 00:24:13 It is as much about what you leave out as about what you put in. And having a set of abstractions that smoothly work together, language features that really click, where it's really easy to use other people's code no matter which subset together, language features that really click, where it's really easy to use other people's code, no matter which subset of the language features they tried to use, and they'll still all hook together. It's hard to build a language that encourages that kind

Starting point is 00:24:34 of simplicity, that embodies that kind of simplicity. And if what you need is now languages that need to kind of be fully interoperable with each other. There's a degree to which each language has to fully embrace the complexity of the other languages. And it can get awkward fast. I wonder if some of the simplicity that Mirage offers would get harder to maintain in a context where you're trying to have lots and lots of different languages interacting with each other. It definitely does because you're trying to get end-to-end guarantees. So one of the big users of Mirage Unikernels is the Tezos proof-of-stake blockchain. And Tezos is a complicated distributed system with lots of nodes and validators and security

Starting point is 00:25:13 keys flying around. So to build that as Unikernel involves a lot of OCaml code. It's a larger OCaml code base, but also Rust code. There's been really interesting work on hooking together the Rust type system, which is based around a borrowing model, so that there's a lifetime model for how long values persist, and the OCaml model, which is based around garbage collection. It involves dynamic collection. But this works because typically the Rust code is at the lowest levels of the system. It's kind of at the runtime part of the system. So as long as you have a clean layering where you're starting from a C runtime, then you're

Starting point is 00:25:42 moving into the Rust code, which is very unopinionated from a garbage collection perspective, but very opinionated from a lifetime perspective, and then calling it the camel code, things work out pretty well. We've made tremendous progress in building some really complicated unikernels from a very, very complicated distributed system, but you have to just make sure you look at your entire language stack and your dependency stack ahead of time, make sure you understand how they interoperate at a high level, and then dive into turning it into the unikernel. So it's definitely not a magic wand that you can just wave and expect the build systems to just work. Another example that we use Mirage for is in Docker, which is a container management system. And if you've ever used Docker for Mac or Docker

Starting point is 00:26:18 for Windows, then every byte of every container that you're using in your desktop is going through a Mirage OS translation layer. Because whenever you mount a file system on the Mac, for example, something has to translate the semantics of your Mac file system, which is APFS or HFS, into a Linux container, which is a similar looking file system, but actually completely different under the hood. And so what we did was we did a very special Mirage, Dave Scott's, David Sheets, and Jeremy Yallop. They figured out that if you treat one end of a Mirage compilation target as Linux and the other end as macOS, we can build translation proxies simply by serializing network packets into the OCaml stack and then deserializing it on the other end and turning it into socket calls. So now the Mac transparently reconstructs traffic coming out of a container and then emits them on your Mac desktop as normal

Starting point is 00:27:10 Mac networking calls. So a lot of the tricky difficulties of network bridging and firewalls and all of that stuff just go away. So when you run a Linux container on the Mac, it goes through Mirage OS and it looks just like a Mac application. When we deployed that in Docker, I think our support calls went down by about 99%. So anytime this software was deployed in the enterprise, everyone's got some crazy firewall and antivirus software and things that break some integration of a virtualization stack with your system. Today, Docker for Mac, you just double click on it, you install it on Mac windows, and it's like a background demon that just runs in the system with minimal interruption. And that's the user experience we were going for. But it's only possible because, again, we understood how to interface Go with OCaml, but made sure we did

Starting point is 00:27:52 it in exactly the right order. Then once you deploy it, it's incredibly robust in production. But you just have to take the time to make sure you understand the lifetime of Go values, the lifetime of OCaml values, and make sure they can interoperate correctly. And this is another example of the flexibility of Mirage, right? It's not just an all-at-once operating system. It needs to know everything, and then you run it on bare metal. Like here you are integrating it as a very carefully designed shim between two operating systems running on the same machine. That's right. So along the way, Casey Siverum-Christian joined OCaml Labs to work on multicore parallelism. Hannah's Menherd from Robur and David

Starting point is 00:28:26 Caliper were on a beach in Morocco and they wrote us a TLS stack. And then they did this incredible stunt where they decided they love Mirage and they'd never talked to me or any of the Mirage team. And on this beach in Marrakesh, they wrote a complete SSL stack in the wake of the Heartbleed attack. And then they put up what we called a Bitcoin pinata. And this Bitcoin pinata was in about 2015 or so, I think. They hid 10 Bitcoins inside a unikernel, put it on the internet, and they left the private keys inside the unikernel. And they said to the internet, if anyone can break into this unikernel and take those keys and trade those Bitcoin, we can't deny the fact that this thing has been hacked and you can keep the money. So back then, I think Bitcoin was worth not very much. But then during the course of the experiment, there was hundreds of thousands of

Starting point is 00:29:07 attacks against the system. And it got on Hacker News and all the social media networks. People kept crashing the system by denial of servicing it. But then like a real pinata, it just bounced back and rebooted in 20 milliseconds because that's how long a unicurl takes to reboot. And it was back up again and no one managed to take the Bitcoin. In the end, I think we donated it to charity because it was growing a bit much. But it just goes to show how you can assemble all these things. You can get a community who can then do what they want to do with it and then contribute back to the whole.

Starting point is 00:29:34 So today, if you use a TLS stack in OCaml or indeed an HTTP stack, you're probably using one of the Mirage libraries. There's many, many alternatives, but for a long time, the Mirage libraries became the de facto community stacks that people used. Right. And I would assume that Mirage in its various forms, maybe Mirage plus Zen together, are responsible for most of the deployments of OCaml code onto people's actual machines. How many machines do you think software that you've worked on has now been installed on? It's a hard question to answer because we're deployed in products. So there was an OCaml ZenStore, which is the management daemon behind ZenStore, which I believe Amazon used for many years.

Starting point is 00:30:13 So that would cover quite a lot of machines in the cloud. I can't say exactly how many, but a lot. And then Docker for Mac and Docker for Windows, I think, was the second most popular developer tool behind Visual Studio Code. So it's deployed on tens of millions of desktops, for sure. But then, of course, in the community, you have people like Facebook who have written their front end for their Messenger application in a variant of a camel known as ReasonML and compiled that to JavaScript. So that's also, to some extent, deployed, but not deployed in the same way. That's a good point. That might be more desktops than all of the Docker desktops combined.

Starting point is 00:30:44 In fact, it kind of has to be. It does. That would probably address a few billion desktops. But it's a good point. That might be more desktops than all of the Docker desktops combined. In fact, it kind of has to be. It does. That would probably address a few billion desktops. But it's a website, right? It's not an application running on the other side. But our plans right now are even bigger. I'm working on some climate change projects where we need to deploy millions of sensors around the world. And of course, we're using Mirage to deal with the complicated logic of carbon CO2 sensing and chemical tasting and deploying it in RISC-V hardware that's quite embedded. So the Mirage journey is just continuing, but on different pads and different use cases. We have in Germany, the Robor team deploying all kinds of different unikernels for the German government. I think they have a contract to build secure VPN tunnels and lightweight overlay networks. And all of these are unikernels that

Starting point is 00:31:23 are being deployed. So who knows how far it's going to go inside critical infrastructure on the internet in the coming years. So a thing I've always found striking about your background is you've dug deeply into a bunch of different areas. You've done a lot of different open source work over the years of various different forums. You've done lots of impactful academic research, and you've been involved in a bunch of pretty major industrial projects. Can you tell us a bit about how you got into this whole line of work in the first place? How did you get into computers and into systems research? Where did this journey start? Well, I'm actually not a computer scientist. I began my training as an engineer and I actually planned to get into electrical engineering. I was

Starting point is 00:32:01 fascinated by power systems and cars and planes and so on. But then when I was studying in London, I got working on a computer game, an online MUD, where you could program this game. And it was programmed in a really interesting language called LPC, which is kind of a pseudo-functional object-oriented language from the late 90s. And I went to a party. It was known as a MUD meet. And I got drunk and I woke up the next day and I'd been offered an internship at NASA to work on the Mars Polar Lander. And this was in California, an exotic land far away from grey and dreary London. So I ended up that summer working on the various bits of infrastructure for helping the Mars Polar Lander land. And when it finally landed, this was the first time that we had the technology to live stream the photographs that were coming out of Mars. I was kind of set up, I would say, as the person who set up all the infrastructure for

Starting point is 00:32:47 supporting one in three people on the internet to access a website all at once, because the world's attention was focused on this landing in 1999. So I rapidly learned how computers worked and stuff and operating systems and things. And I set up all of these Solaris boxes. And the first thing that happened was those boxes got hacked. So I put them up on the internet and obviously hackers love mars.nasa.gov as a domain to control. And so they took them over and I then looked around for more secure alternatives. And I found this operating system called OpenBSD. And what OpenBSD is, it's an all-in-one operating system designed with reliability and correctness in mind. And it used a variety of

Starting point is 00:33:25 security techniques. I wiped all of these expensive Solaris boxes, installed OpenBSD, and then managed to get the system running stably again. And then OpenBSD was open source. So I found a few bugs because when you're deploying something as large as that, you can't not find some bugs, right? And it turns out that I could just send in some patches and they got interested and they accepted my patches. And this is like some massive dopamine rush because when someone takes your code and incorporates it into this operating system used by loads of other people, it's an incredible feeling. They got more and more into that development. And I ended up going to an OpenBSD hackathon. And these are regular semi-annual events. And back then it was in Calgary in Canada because the US export restrictions prevented any

Starting point is 00:34:03 cryptographic code from being written in the US. So I got to travel and go to Canada. And then talking to Damien Miller, who's one of the core maintainers of SSH, it set me on the path to thinking, well, how can you start rewriting systems in a more secure fashion? And then I went back to Cambridge because the Mars Polar Lander crashed straight into Mars at very high speed. So all of the infrastructure we set up never actually got used. Well, it got into CNN and lots of people looked at very high speed. So all of the infrastructure we set up never actually got used. Well, it got into CNN

Starting point is 00:34:26 and lots of people looked at our sad faces. People got to watch the crash due to your hard work. People got to watch the crash. We had to wait for like two days until we decided it had crashed. So people stopped watching after about five minutes, but we waited two days. And then I had to find a new job

Starting point is 00:34:39 because I was so depressed that all of our hard work had hit Mars at high speed. And so I decided to go back to Cambridge and do a PhD. And then I really started my training as a computer scientist. So during the PhD, I did lots and lots of different projects. But I started working in the Zen hypervisor. I started using OCaml in functional programming more seriously in order to build the stacks that I described earlier.

Starting point is 00:34:57 And then it became this wonderful journey where all of the code I've ever written has pretty much been open source. A lot of it's terrible, but it's been included in lots and lots of products. It's really easy to move between industry and academia and government jobs because you're kind of taking your secret weapons with you wherever you go. So now it's not like I'm obsessed with the camel.

Starting point is 00:35:16 It's just the most efficient thing for me to use to solve any given problem because I've just deployed it in so many contexts that if I'm doing anything for building my website or doing a bit of data processing, it's just what I reach for. It's a really fun thing to work with, even after all these years. And you've talked some about why you think OCaml is a good fit for Mirage and what you're trying to do there.

Starting point is 00:35:33 But OCaml is not a tool that systems programmers reach for early. How did you end up coming across it in the first place? Well, in Cambridge, OCaml is now taught to first-year students because, first of all, it's kind of a reset button because most students would come with a background of JavaScript or Python and they'd have partial knowledge. So we wanted to find something that's a little bit obscure, but certainly not massively in the mainstream. Secondly, it's the easiest way to teach the foundations of computer science. So the basics of data structures and recursion and representations and all the beautiful logics and proofs that follow from that.

Starting point is 00:36:08 So at Cambridge, there's a long tradition of using ML style languages from standard ML to OCaml. So I couldn't help but be exposed to it because of the university environment. Secondly, it was also the most practical way to do systems programming in the early 2000s. So there weren't really any other alternatives back then. You could go for Java, which is very heavyweight. You go for Perl, which was right once. It still is to some extent. Python and Ruby were still very much in their fledgling phases. There weren't many other compiled languages. So today we have this wonderful spring of programming languages,

Starting point is 00:36:36 but we didn't back then. But languages have momentum as well. So this is a generational concept to keep going back to. It's not like we're just avoiding other languages, but when you build up such a large code base of OCaml code, it just gets easier and easier to build and advance it every single day. So it's almost at the tipping point now where it's easier to extend OCaml with Rust-style features than it is to rewrite all of our code in Rust, for example, or any other language that comes along. It's easier to go do a machine proof in using the Coq proof assistant and extract OCaml than it is to do anything else. And so it's this reduction of friction that just builds up over the years. I understand what you're saying, but I feel like what you're saying is also

Starting point is 00:37:14 on some level objectively false. Meaning you're saying like, well, you know, back in the nineties, what systems programming languages were there other than OCaml? And I'm like, there was C. And in fact, that's what everybody used, right? It is not the case that system programmers in general in the 90s looked around and were like, oh yeah, we're definitely going to write all our systems in OCaml. No, that's right. If I could go back in time, I would evangelize OCaml not now, but in the late 90s, because I feel like I missed a lip of innovation there. No one had heard of OCaml back then. And it was just this incredibly productive tool to write Unix-like code. It was just better than writing in C. And this is me emerging out of writing lots of C code for many, many years. And indeed, writing lots of PHP code

Starting point is 00:37:53 for websites and webmail stacks and so on. But OCaml went through a period of stagnation. Because like any open source project, if it's not invested in, if it doesn't have a large body of programmers, then it's really hard to sustain it over the years. So around 10 years into Camel's life, which is roughly when I was using it in about 2005, the rate of progress really stalled. And so at this point, we kind of missed a window where we could have heavily evangelized this to more systems programmers that didn't have the tools and the right development environment to make it easily possible. So while we used it heavily at Zensource, it never got picked up by other developers within Zensource because of that lack of tooling. So we talked some about your background in open source. Some of the work that you've done,

Starting point is 00:38:32 and in fact, that you and I have collaborated on over the years, has been about developing the open source community around OCaml and helping in part, certainly not just us, but helping in part to kind of combat some of that stagnation. And part of that was the creation of OCaml Labs. Can you tell us a little more about where OCaml Labs came from? I can. So whenever we finished at Zensource, it got acquired by Citrix and I left her for a few years of happily hacking on Zen within Citrix. I went back to academia and I knew that I had this burning desire to build MirageOS because everything was set. I had all the code from the previous startups. I had this burning desire to build Mirage OS because everything was set. I had all the code from the previous startups. I had the problem. I had five years of funding. I had this

Starting point is 00:39:09 wonderful research fellowship to work on, but it was just me. And I knew that if I wanted to make this as big as I wanted it to be, I needed help. And it was help on multiple fronts. The first thing was that the OCaml development team was incredible. I remember having dinner with Xavier Loi in about 2009, and he just said that they would maintain OCaml forever, but they were struggling with all of the bug reports coming in and the fact that they didn't have any dedicated staff working on it. But he said, you know, anyone can work on it, but why isn't anyone doing it?

Starting point is 00:39:35 I got talking to you, Ron. And we said, well, why don't we find someone that will help us do this? And it was really hard to find anyone who would actually work in the core compiler, look at bug reports, and build out tooling because these were all the things that we needed. In the end, it came to a hard decision. If you can't find anyone else, then perhaps I should do it myself. And the reason I was really motivated to do this myself was because I wanted this from MirageOS.

Starting point is 00:39:55 So anything I did to improve OCaml would directly leverage and improve MirageOS, the project I'm really passionate about. So we funded OCaml Labs in Cambridge. And one of the beautiful things about Cambridge University is that individual staff retain their intellectual property. It's not owned by the university. And so this meant that working in open source became really easy because anyone we hired at the university could just write code and there wasn't any need for any legal agreements or anything with the university. We just released it. So I'm really, really proud that what we started with, a seed in Cambridge, has now become a diaspora of people all around the world working in different geographies and different environments, but continue to communicate and share their code

Starting point is 00:40:34 through the open source ecosystem. And I think Cambridge as an institution deserves an enormous amount of credit for all of this, because this thing was messy and complicated and does not fit in in an ordinary way to a kind of simple notion of academic research. A lot of the work that needed to be done was work about coordinating open source ecosystems and maintainership work. It's not the kind of stuff that gets you tenure. Most institutions aren't willing to take it on. And Cambridge was, and I think it was important to have an academic institution that was willing to do it because OCaml is, in many ways, a deeply academic language. Its roots and much of the expertise just realistically resides in academic institutions. There's an enormous amount of connection to various different kinds of real and legitimate research work. We saw lots of exciting things coming out

Starting point is 00:41:25 of Cambridge on that kind of research side that were secondary to this, and all of this other real infrastructure that was created. We looked around and tried to find various homes for OCaml Labs, and Cambridge was the place that was willing to do it. It was an enormously important find that we found an institution that was really willing to partner with us effectively in doing this kind of work. Another thing that strikes me about the story you're telling is the degree to which OCaml Labs acted as a kind of effective form of glue. Like a lot of the work you're talking about, which is important advances in the state of the art for OCaml, they're not all things that were done at OCaml Labs, right? Merlin was created by some

Starting point is 00:42:06 INRIA undergrads, if I remember correctly, but they were later working with and supported by OCaml Labs. OCaml Format was just done as an internal Facebook project, and then Jane Street adopted it and made a bunch of further changes. But it was OCaml Labs that provided the glue to kind of take it and turn it into a maintained and general purpose piece of software and figure out how to kind of share between the various different contributors. Dune's another example. Dune was created at Jane Street for Jane Street's kind of narrow purposes. And now there's been a really deep collaboration between engineers at Jane Street, including Jeremy D'Amino, who wrote the first version of it and runs the team that manages it at Jane Street and collaborates very closely with OCaml Labs.

Starting point is 00:42:53 And so both the kind of industrial side of that work and the open source side of that work are well handled and handled by different parts of what is essentially one big team that's working on multiple aspects of the problem. That's right. The fundamental value that Cambridge brings is training, mentoring, and graduation. So graduation is a really important part of Cambridge, where you leave and you go do something else. And the same is true for INRIA and the universities in France, where the Merlin developers came from. And I'm particularly proud of the number of people that have learned and moved on from Cambridge to

Starting point is 00:43:25 other jobs in the ecosystem and succeeded. So Stephen Dolan and Leah White, both of whom are on this podcast, started off their degrees in Cambridge, did their PhDs there and have moved into James Street and many other graduates have done similar as well. And it's crucial for the longevity of a community to have this kind of easy flow of people across jobs because obviously people's lives change, they can't just all stay working in a university. And Cambridge was extraordinarily flexible in figuring out how to get people in. So David Alsop, for example, who is one of the most prolific contributors to Cora Camel, is also a countertenor singer in his spare time. But when I say spare time, it's actually his career. So I had to convince him to come be a

Starting point is 00:44:03 developer here because he was working on Caml in his spare time while also maintaining his singing career. He successfully juggled both of those and became an incredible contributor and an incredible singer. But explaining to Cambridge HR exactly why I was hiring a singer to work in my research group was a challenge, but they didn't say no. And he's still at OCaml Labs and he's still one of the prime maintainers many years on. One of the big and long running projects that OCaml Labs has taken on and really driven is the work towards having a multi-core garbage collector for OCaml and a multi-core capable runtime. This is a long-running sore point about OCaml.

Starting point is 00:44:39 You mentioned one of the limitations in Mirage is that OCaml is not multi-core capable in terms that you can't run multiple OCaml threads that share the same heap. This has been a thing that people have talked about for a very long time, and there's been some amount of work on and some discussion about how to get there for many years. One question I have is, why has it taken so long? Why has this been such a big and long-running project to add multicore to the language? A really important part of research is understanding that 90% of what we do is fail. And whenever we started adding multi-core parallelism to Camel, we were taking an existing ecosystem, an existing semantic for the language, and just trying to extend it with the ability to run two things at the same time instead of one thing. And the number of assumptions that break when you do two things at the same time instead of just one thing is incredible. So our first naive attempt was in 2013. We presented our

Starting point is 00:45:30 confident plan for exactly how Multicore would go into a camel and it got okayed by Damien Doliguez and Xavier Loire. Then a couple of years on, we just realized just how many edge cases there were and the need for a better conceptual core for what it means to be multi-core. So we went to a Camel consortium meeting, which was where the industrial users of Camel a few years ago would present their needs and requirements. And we presented our work to that team and they said, well, look, you can't add this without having a memory model to a Camel. So without a memory model, which says, this is what happens when two threads simultaneously access a single or camel value. Without that definition, it's really hard to

Starting point is 00:46:10 ascribe any meaning to multi-core camel, because what does the program do whenever this situation happens? So we then had to go off for a year and figure out new theorems. And we came up with something called LDRF, local data race freedom, which is published in PLDI, our top tier conference. But it also crucially resulted, in addition to this nice new theorem, to a clean, well-defined semantic for a multi-core parallelism in a camel. So then we went back to the core development team and we said, hey, here is this clean memory model semantic. It went, yeah, great. Where's the rest of it? But remember, there's only about two or three of us working on this while juggling many other things. So we then went off and frantically started writing the garbage collector and making sure that we could finish off the job. The garbage collector is more difficult than a normal single

Starting point is 00:46:53 threaded one because it has to deal with multiple cores simultaneously wanting to trigger garbage collections. And you have to make sure that irrespective of when the garbage collection is happening, that the program is still maintaining type safety. So nothing can ever observably be violated by a garbage collection happening. And we ended up with two separate schemes for garbage collection, and we couldn't decide between them. We then had to write a full paper about this. We had to make sure that we evaluated both sides. And we also had to do this against a backdrop where we could not tolerate more than a few percent of a performance hit for old OCaml code. So if you were building a new language, you could

Starting point is 00:47:30 just go ahead and build it and you could build the perfect parallel algorithm because you have no compatibility to worry about. But meanwhile, we had the entire Coq proof assistant community that said, we're not going to use multicore for a few years, but if we compile our existing code with multicore OCaml years, but if we compile our existing code with multicore or camel, it shouldn't get any slower. And back then we'd had maybe a 10 or 20% performance hit. So a significant slowdown until you use multicore. And after a few years of work, we got that down to a few percent. So it was almost indistinguishable from noise because all of the various techniques that we put into the garbage collector and the compilation model to ensure that that happened. This was, again, real research. So it got published

Starting point is 00:48:08 in ICFP. We then had to figure out how to present this to the core development team, get consensus, and then move it forward. I think we have been working on multicore incrementally since OCaml 4.0.2. So OCaml 4.0.2 was where we had the first branch of OCaml for multicore. We're now in OCaml 4.13, which has just been branched. And I think in every version since 4.0.8, we've put in a significant chunk of work in order to get towards multicore parallelism. Most of these things are invisible to the OCaml users. So you at James Street have been using different parts of the multicore compiler that we have upstreamed into lots and lots of different versions of OCaml. And we've done so in such a way that it totally respects

Starting point is 00:48:47 backwards compatibility. Because if you don't get it just right, then we'll end up with a split world where the multicore OCaml compiler is a new language, and it won't work with older existing OCaml code. And that would be a disaster. So the reason we're so careful in threading the needle is that whenever OCaml 5.0 lands,

Starting point is 00:49:04 it will compile almost every bit of existing code in the last is that whenever OCaml 5.0 lands, it will compile almost every bit of existing code in the last 25 years with a minimal performance hit. It will then allow you to add multi-core parallelism through this domains interface. And it has one of the best and clean memory models out of any language. So our research paper on bonding data races in space and time showed that C++ and Java, the two kind of gold standards for their memory models, have disastrous issues, is the best way to put it. So that's the opening of our paper. And we show that with just no performance hit in x86 and a 2% performance hit in ARM and PowerPC, or sorry, 0.4 on ARM and 2% on PowerPC, we could make it all work.

Starting point is 00:49:43 So that's a pretty big result. It took a lot of theoretical computer science, a lot of experimental evaluation, and a lot of implementation. All of these had to happen simultaneously. It wouldn't have been possible without Casey Siveramakrishnan, who's worked with me on this project for the last six years. And we've gotten two top-tier papers out of it. So it's not been a great ratio of coding to papers, but the end result is something we're very, very proud of. So the story you're telling highlights a lot of the ways in which OCaml is legitimately an academic language, and that part of the way of moving things forward and of convincing people to accept a new feature is actually going through the trouble of writing

Starting point is 00:50:17 serious academic papers to really outline the design and explain what the novel contributions are. And there are some novel contributions. So from a kind of more ordinary workaday systems programmer perspective, how should someone who is used to the parallelism story in Java think about the advances in OCaml? How, from a pragmatic point of view, is the coming OCaml multi-core runtime going to be better? It's only going to be better because it will not have any surprises. So whenever you use multi-core parallelism in Java, you have to know a lot of things. You have to know about the memory model in Java.

Starting point is 00:50:53 You have to understand the atomics and the various interfaces they expose. There's different levels of things exposed in different versions of the JVM. In OCaml, this is potentially just because of the young age of multi-core in OCaml. We think potentially just because of the young age of multicore in OCaml, we think we just have a cleaner model that avoids a lot of pitfalls that Java made. Now, one of the interesting properties about programming languages

Starting point is 00:51:12 is that it's very hard to take back a semantic. So if someone has written some code in it, there's just a vast number of complaints if that changes, because it can fail at runtime. So just by waiting for this long and observing how all the different languages have built their systems, and then doing the research to thread that needle to find the least surprising memory model across all of the hardware deployed today, that's why I have NoCaml. So a Java programmer should find it the most boring experience to do multi-core parallelism on

Starting point is 00:51:39 NoCaml. They'll just use high-level libraries like Domainslib that give them all of the usual parallel programming libraries, and it'll just work. No surprises. Fast. Do you have like a pithy example of a pitfall in multi-core Java that doesn't exist in multi-core OCaml? So there's something called a data race. And when you have a data race, this means that two threads of parallel execution are accessing the same memory at the same time. And at this point, the program has to decide what the semantics are. So in C++, for example, when you have a data race, it results in undefined behavior for the rest of the program. The program can do anything. Conventionally, demons could fly out of your nose is the example of just what the compiler can do. In Java, you can have data races

Starting point is 00:52:20 that are bounded in time. So the fact that you change a value can mean later on in execution, because of the workings of the JVM, you can then have some kind of undefined behavior. So it's very hard to debug because it's happening temporarily across executions of multiple threads. In OCaml, we guarantee that the program is consistent and sequentially consistent between data races. It's hard to explain any more without showing you fragments of code, but conceptually, if there's a data race in OCaml code, it will not spread in either space or time. So in C++, if there's a data race, it'll spread through the rest of the code base.

Starting point is 00:52:58 In Java, if there's a data race, it'll spread through potentially multiple executions of that bit of code in the future. In OCaml, none of those things happen. The data race, it'll spread through potentially multiple executions of that bit of code in the future. In OCaml, none of those things happen. The data race happens, some consequence exists in that particular part of the code, but it doesn't spread to the program. So if you're debugging it, you can spot your data race because it happens in a very constrained part of the application. And that modularity is obviously essential for any kind of semantic reasoning with the program, right? Because you can't be looking in your logging library for undefined behavior when you're working on a trading strategy or something else. It's got to be in your face at the point. Yeah, it seems to me like the core

Starting point is 00:53:33 thing you're talking about is buggy code is easier to reason about. It's enormously important because almost all code is buggy, like parts of every code base have bugs and problems. And this is why the classic undefined behavior stance of traditional C and C++ compilers is so maddening because there's a kind of amplification of error where you make some mistake where you step outside of the standard and suddenly, you know, anything can happen. I've actually been seeing this happening with my son who has a summer internship where he's off hacking out a bunch of C code. And when you make a mistake in C code, it can be really hard to nail it down because the compiler can make all sorts of assumptions and push the mistakes into places where you totally wouldn't expect it. It sounds like the same thing happens in the context of data races in C and C++ and to some degree in Java. And reducing that just makes

Starting point is 00:54:26 it more predictable and makes debugging easier. So I feel pretty convinced by this story. It's quite pleasant working in Multicore OCaml when it comes to debugging things because of this property. So are you brave enough to venture a date by which a mere mortal who installs the latest version of OCaml will be able to run two threads in parallel that access the same heap? Well, I can't give you a date, but I will give you... Well, I can give you a date. You can do that today.

Starting point is 00:54:52 So what I did a couple of weeks ago was to merge the multi-core OCaml working tree that we use, which is a set of patches against the latest stable OCaml, into the mainline OPAN repository. So this means with one line, you can switch from OCaml 4.12.0 to OCaml 4.12.0 plus domains. And all the work that the Multicore OCaml team has been doing has been focused around ecosystem compatibility. You can just start with your existing projects and you can then start adding in domain support.

Starting point is 00:55:22 And if you're really, really experimental, we have a future-looking branch which also adds something called an effect system on top of this patch set. This effect system is the ability to interpret certain external events that happen and just deal with them through what are known as effect handlers. So for example, if I'm writing to a blocking network socket,

Starting point is 00:55:44 instead of having to then use async await or LWT or monadic style concurrency, our effect system just lets another part of the OCaml program deal with the blocking IO and then resume the thread of execution whenever it's ready to happen again. So this is highly experimental, but it results in some of the most pleasant and straight line OCaml code I've ever written. It reminds me of writing code in the early 2000s when we just use Pthreads and Unix for everything. All of these different variants and levels of the Camel compiler are now available in OPAM. So depending on how nearline features you want to test, all of the trees are available for you to try out. And the next thing we're doing is that we're working on a Camel 5.0. And this is hopefully

Starting point is 00:56:25 going to be the release after 4.13, which contains the domains only patch set. It will expose just two extra modules that provide you with the ability to launch multiple threads of execution. After six years of work, it's two modules. But those two modules obviously have enormous power because you can then use those to spin up without having to fork multiple processes and do lots of complicated serialization, multiple threads. And then our plan gets more experimental. 5.0 is the sole focus of features that we have been approved to get into core OCaml because they've gone through extensive peer review. Then for 5.1, our plan is to propose the runtime parts of this effect system. This lets us not only express parallelism, which is what you get in OCaml 5.0, but concurrency directly in the language. So the

Starting point is 00:57:11 ability to interleave multiple threads of control in a very natural way. This is original research that we just published in PLDI this year on how we made the runtime part of the effect system as flexible as possible. And again, without breaking any compatibility with your existing tools. So it uses GDB and all of the familiar debugging tools you're used to. And then later on at 5.2, we're going to expose that effect system

Starting point is 00:57:34 into the core OCaml language using something known as effect handlers and typed effect handlers. We're doing that in close collaboration with Jamesford Engineers as well. So this roadmap is multiple years of work, but the first step, OCaml 5.0, we'll get into your hands as soon as we can.

Starting point is 00:57:48 But all the trees are in open source and the way to speed it up is by giving it a try, trying your applications against it and giving us bug reports. So that's the heart of open source and how you get a concrete date. Help us to help you. Question well dodged.

Starting point is 00:58:05 By the way, just to highlight a little point, of open source and how you get a concrete date. Help us to help you. Question well dodged. By the way, just to highlight a little point you said there, you mentioned how the domains only version of it is meant to provide the basic parallelism. And then on top of that, you want to add some notion of concurrency. In some sense, once you add parallelism, there's some amount of now concurrent execution. But I guess this reminds me of the old Solaris style. You have some number of kernel provided truly in parallel threads, and then you have some kind of micro thread notion that operates inside of there that's lighter weight. And that's the split that's really being talked about here. is you have something like one domain that you'd run per, say, physical CPU that you have, and then you might have tens or hundreds or tens of thousands of little micro threads that are running inside each domain, and importantly, migratable. So you can take one of these and

Starting point is 00:58:55 pick them up and move them to a different core. So that's, I think, an important part of that model. It's a really important point. Instead of calling them micro threads, we call them fibers. So these are really lightweight data structures. You can have millions of these in your heap. Resuming them on a different core is just a matter of writing some OCaml code. The really nice thing with effect handlers is that your schedulers,

Starting point is 00:59:13 the things that normally the operating system would decide to do for you, like thread scheduling, are written in OCaml as well. And so this means that you can write application-specific logic for things that conventionally the kernel would take care of for you.

Starting point is 00:59:26 And the kernel doesn't really know how to do things optimally. It knows how to do things to cause the least harm. And so by this kind of domain specialization, your applications in OCaml can get really, really fast. Now, this should be familiar to you, right? Because this is the future of Mirage OS. The goal of the effects system is to internalize about a decade's worth of learnings about how to build portability libraries, how to build abstractions and device drivers. And now we're having the time of our lives rebuilding all of these things in direct style code using the effects system. So we have a new effects stack called EIO, which is pure direct line code. Its performance is competitive with Rust and Go and so on. I think

Starting point is 01:00:05 it's faster than Go by quite a long way, and it's competitive with Rust. And it uses all of the new features in operating systems, IO, Uring, and Linux. It uses Grand Central Dispatch in macOS and iOS, and it uses IO CP subsystem in Windows. And all of these things happen invisibly inside the IO subsystem written in OCaml. But as a programmer, you just write normal straight line OCaml code and the effects system takes care of all of that for you. So it's a very, very exciting frontier for what's coming in OCaml in the future. And it makes MirageOS code even more Mirage-y

Starting point is 01:00:33 because it's just normal OCaml code that you write and all of this stuff is being handled for you in the background through various effect handlers. Well, I think that's a fantastic place to stop as you kind of tie a little bow around connecting Mirage and the most recent work you've been doing in OCaml. Anil, thank you so much for joining me. This has been a real pleasure. Thanks, Rod. Fun as always.

Starting point is 01:00:52 All right. Cheers. You'll find a complete transcript of the episode, along with links to some of the things that we discussed, including Mirage and some of Anil's other research, at signalsandthreads.com. Thanks for joining us, and see you next time.

Your Ad Here

Signals and Threads - What Is an Operating System? with Anil Madhavapeddy

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.