Signals and Threads - Python, OCaml, and Machine Learning with Laurent Mazare

Episode Date: October 7, 2020

A conversation with Laurent Mazare about how your choice of programming language interacts with the kind of work you do, and in particular about the tradeoffs between Python and OCaml when doing machi...ne learning and data analysis. Ron and Laurent discuss the tradeoffs between working in a text editor and a Jupyter Notebook, the importance of visualization and interactivity, how tools and practices vary between language ecosystems, and how language features like borrow-checking in Rust and ref-counting in Swift and Python can make machine learning easier.You can find the transcript for this episode along with links to things we discussed on our website.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I'm Ron Minsky. Today we're going to have a conversation about the use of Python and OCaml at Jane Street with Laurent Mazarin. Jane Street is pretty widely known for the fact that we use OCaml, which is a statically typed functional programming language similar to languages like Scala and Fsharp and Swift and Haskell. But the story is more complicated than that. We don't only use OCaml. We use some other languages too. And one that we use that's pretty important is Python.
Starting point is 00:00:38 And we mostly use Python for data analysis and machine learning tasks, but it also extends somewhat beyond those domains. So the topic of the conversation this morning is about how Python fits in with OCaml and how it fits in to the broader infrastructure at Jane Street. One of the reasons I'm excited to have this conversation with Laurent is both that he has done a lot of work
Starting point is 00:01:00 inside of Jane Street, both working with Python and on making the Python tooling itself better, but also he has broader experience working with different languages in different contexts, and so has some perspective on this from more than just his time at Jane Street. So Laurent, to start us off, can you tell us a little bit more about your experience outside of Jane Street and how that led you to working here? Yeah, so my first work experience outside of academia was at a small company called Lexify,
Starting point is 00:01:27 where we were using OCaml to model financial contracts. So in that context, OCaml was used mostly as a domain-specific language to represent complex financial payoffs. I spent a couple years working on the overall infrastructure that we were selling to banks and assets managers. And after that, I started working in 2010 at Goldman Sachs in London as an equity strategist. There, I was using the in-house programming language called Slang, which one can think of as a variant of Python.
Starting point is 00:02:00 It's mostly a scripting language. It was fairly nice to actually quickly see the output of what you were doing. But of course, efficiency was a bit of a concern. I spent a couple of years working at Goldman. And in 2013, I joined Jane Street as a software developer. I work mostly on trading systems at Jane Street. So back to using OKML for mostly everything. There I enjoyed functional programming aspects and strong time system
Starting point is 00:02:29 when it comes to design production-critical stuff. After four years and something at Jane Street, I actually left to go work for DeepMind, mostly because I was passionate about machine learning at that point and I wanted to work at the leading edge of the domain. At DeepMind, I used mostly Python to do machine learning with, of course, TensorFlow. And I also got to use Swift a bit as a replacement potentially for Python.
Starting point is 00:02:58 I spent a year working there, and then I came back working at Jane Street, where I focus more on research bits nowadays. So I would say that I spend half of my time working in OCaml and half of my time working in Python. But these two halves are very different. Another part of your background is you've also done some interesting work on the open source side. Can you say a little bit about that? So I tried to be active in the open source community and through that I had the opportunity to work on a couple packages. The most notable
Starting point is 00:03:30 one is probably OCaml Torch, which provides OCaml bindings for the PyTorch framework. PyTorch is this amazing thing developed by Facebook that lets you write some deep learning algorithm in Python and leverage your GPU and the power of auto-differentiation. Via these OCaml bindings, you can do that from OCaml too. So you add type safety to the mix. I also worked on Rust bindings for it. So it's kind of the same thing. In this context, you have bindings in Rust. So rather than writing Python code, you end up writing Rust.
Starting point is 00:04:07 I also worked a bit on Clippy, which is a Rust static analyzer. There you try to analyze Rust code and find legitimate errors. And there are plenty of other OCaml packages that I looked at that can be bindings for TensorFlow, bindings for Apache Arrow.
Starting point is 00:04:26 I tried writing a data frame library for OCaml and a couple other things. All of them can be found on GitHub. So this all reflects, you spent a bunch of time working in a bunch of different places using different languages and ecosystems. So you mentioned along the way
Starting point is 00:04:41 that here about half of your time is spent working on Python and half spent working on Python and half spent working in OCaml. And those two halves are very different. Maybe you can say a little bit more. How are those two different? So yeah, the first half of my time is spent using Python, and that's mostly for research purpose. In that context, I will use Python in a notebook. So for people that wouldn't know what a notebook is, it's a kind of web UI where you have some cells where you input some lines of your programming language,
Starting point is 00:05:11 most of the time Python, and you can evaluate each cells in order. The development mode is very interactive in this context. So you look at the actual output, tweak your code a bit, then evaluate it again. Then tweak your code again a bit because you notice that you've made some mistakes, and start over again and again. So it's some kind of interactive development. What is fairly nice is that you have a very quick feedback loop between editing the code and seeing the actual outcome.
Starting point is 00:05:42 And of course, it plugs very nicely with the very large PyCaton ecosystem of plotting libraries. So you can actually run your algorithm on some data and then plot the output, check that it matches your intuition. If not, try to debug, tweak your code and start over. So that would be mostly what I use Python for. And on the other side, there is OCaml and there it's far more of a development
Starting point is 00:06:05 work and production critical development, I would say, where you tend to build systems that are fairly resilient. What I'm often amazed by is that you write some OCaml job, you kind of spend a week on it, deploy it, and then you come back to it a few years later, and maybe even you've left and come back along the way, and you notice that your job is actually still there and still working, which is pretty amazing. Of course, on the Python side, it's a bit more rare than things will keep on working for a very long period of time. Say more about why that is. What do you think it is about OCaml that makes it easier for writing these kind of more robust tasks for catching bugs, all of that? Yeah, so I think there are two aspects actually to it.
Starting point is 00:06:48 One is about OCaml and the other one is about the general engineering practice that we have around OCaml. So it turns out that when you write OCaml code, you usually write proper tests for your things. You think a lot about all the error cases. Again, part of it because the type system will enforce that. And each function will tell you, oh, actually, I might return a result, but I might also return an error. And you have to think about what is this error and how you
Starting point is 00:07:17 want to handle it. And also your goal is to build something that is a bit resilient. So you also are in a state of mind where you think a lot about kernel cases. Whereas in the Python world, it's more about runtime exceptions. So you don't spend much time thinking about what the function actually outputs. And most of the time you use a library for which you don't really know what the function outputs. It turns out that you kind of guess it, try it, most of the time it works. And you go on with that, you just check kind of guess it, try it. Most of the time it works. And you go on with that. You just check kind of interactively that it's what you would expect. I'm really struck by your description here.
Starting point is 00:07:50 There's a mix of differences between the two languages. And some of them are fundamental things about the language. Some of them are things about the ecosystem around the language. And some of them are issues about the practices that Jane Street as an organization has around the language. And some of them are issues about the practices that Jane Street as an organization has around the different languages. So maybe we could just dig into those somewhat separately. Let's first talk about the ecosystem. What is it about the Python ecosystem and the
Starting point is 00:08:15 OCaml ecosystem that make them different? Python, I think nowadays is kind of the de facto standard for everything that is about machine learning and data analysis. So if you want to be able to plot your results again in a notebook, you have plenty of libraries that do that, plenty of tutorials online that will help you finding the right way to do it. And if you run into any kind of issue, again, you can just Google things, you will get some results super easily. There is also a wide variety of machine learning libraries available. All the major modern deep learning frameworks have Python frontends and are kind of released in a Python-first manner. So the API that developers focus on is the Python one. So I don't think that there
Starting point is 00:09:03 originally was any very good reason for Python to be more successful than other languages that are a bit similar in that domain. But it turns out that the more you have an ecosystem, the more it attracts people. It snowballs at some point.
Starting point is 00:09:17 On the other side, when it comes to HTML, there are a bunch of libraries. I think that in general, libraries tend to have a bit of a better quality, but you have far fewer of them. One way to get around this is that you can bind to other languages. So that's what we did for TensorFlow or PyTorch.
Starting point is 00:09:38 But you can even bind to Python. So we actually have some matplotlib bindings. When you want to plot something in OCaml, this will call Python, which itself will use these very well-known matplotlib libraries to render the plots. And I think it's really notable that the first thing you talked about when you talked about the ecosystem advantages of Python is you talked about visualization. And I think the visualization part is incredibly important. And I think fundamental to why notebooks are such a valuable way of working.
Starting point is 00:10:07 The basic mode in doing research is exploratory. You're going off and trying something and looking at the results and trying something else and looking at the results again. And having a quick workflow that lets you do that and lets you embed the visualizations in your workflow in a straightforward way, I think that's one of the reasons why notebooks are such a compelling way to work. And it's very different from the traditional workflow that software engineers work, which involves plain text and text editors, and just doesn't have the same visualization component. Indeed. So I would say that these two modes are a bit similar in a way that on both sides, you want to do fast iteration. In one case, you will have this fast iteration in the notebook. So you will iterate quickly over the data,
Starting point is 00:10:49 get some plot out, and get to actually visualize the output of your algorithm, check a bit some corner cases, and iterate again. And in the OCaml world, you tend to do the same thing about fast iteration, but it's about type safety. So you save your file file and your build system will tell you, oh, this type doesn't match, this type doesn't match, this type doesn't match.
Starting point is 00:11:10 In the end, you would expect your OCaml code to pretty much work once you've managed to get things to compile. Whereas on the Python side, you don't have these type aspects. So you're kind of relying more on experiments to tell you if things are working properly. I wonder if there's more of this experimental back and forth on the OCaml world than maybe you're giving it credit for. A place where I've encountered this a lot is in the work that we've done into building what are called expect tests. So expect tests, which is a thing that's not all that much known outside of the GNShoot context, although actually it was originally stolen from an external idea.
Starting point is 00:11:45 The basic idea actually came from the testing tools that the Mercurial version control system uses. But anyway, the basic idea of expect test is you have this kind of unified way of writing code and tests, which in some ways feels a lot like a notebook. You write some piece of code, and then you write a special annotation,
Starting point is 00:12:00 which means capture the output of the code up till here and insert it into the text of the code. And then you write some more code and have another place where you capture output. And then when the test runs, it regenerates all of the output and if there are any changes, it says, oh, the output has changed, you want to accept the new version, and then you can hit a key in your editor and load those new changes in. And we use that a lot for all sorts of different kinds of testing, but also for various kinds of exploratory programming.
Starting point is 00:12:26 So, for example, a thing that I've done using this is web scraping. There's some bit of web data you want to grab and analyze and transform and pull some information out of. And that's like very exploratory, right? You're often not trying to get it right in a kind of universal way for all time. You're trying to process some particular kind of document. And so you write some transformations and you apply them and you look at the output and see if it makes sense. Das heißt, Everything is plain text. One of the big advantages you get out of the notebook style workflow is the ability to be in a web page and actually display graphical stuff. That's right.
Starting point is 00:13:10 I kind of agree that expect tests give you some kind of interactive experiments. There, I think you focus more on corner cases and hard bits for your algorithm, whereas most of the time on the notebook experiment in Python, you're not writing some code that is there for a very, very long time. So you actually want it to work on your data rather than on any potential data. So even if expect tests are a bit interactive, I think that it's still a bit of a different experiment. So I'm curious to what degree you think this is an issue of the programming language or an issue of the tools and ecosystem around the language. If for a minute we imagined a world where the tooling for using OCaml in a notebook was all really well worked out, you had all of the nice things you're used to
Starting point is 00:13:57 from the editor integration experience we have with a language like OCaml. So like you can navigate to a piece of code and look up the type of a given expression or navigate to the definition of a given value. All of those things, all of those IDE-like features worked really well. And at the same time, you had all the benefits of using Python in a notebook in terms of the interactivity and the ability to visualize things. And you had all of the machine learning and data analysis libraries at your disposal. At that point, is there still a reason that you'd want to use Python? Is there something about the language itself that makes it a better tool for some of this kind of exploratory research work?
Starting point is 00:14:33 Yeah, it's a very interesting question. It's of course a bit hard to answer, definitely, because you don't know what it would look like if you had everything possible on the OCaml side. Still, I feel a bit that the dynamic aspects of Python have some advantages that you would not have on the OCaml side. A typical thing that you will do in Python is you write your small algorithms. It actually works for integers and you're happy with it, but it turns out that later you actually want to use floats. The fact that in OKML to make the type system happy, you would have to modify all the operation
Starting point is 00:15:12 will be kind of annoying to you. Whereas in Python, you would expect things to work most of the time. You will still have this problem that if there are some errors, you might not detect them. But on most reasonable outputs, the checks that you've done should bring you some confidence that it's working reasonably well. And all these dynamic aspects are fairly nice.
Starting point is 00:15:34 It lets you try pretty much any kind of algorithm on any data and relying on duct typing, as long as your object implements the proper method, sync will work out without you actually knowing what's taking place behind the hood. I guess one of the big upsides of working in a language with a static type system is that types provide a lot of guarantees. Guarantees that you can't get from any single run of the program.
Starting point is 00:16:00 You can't get by any amount of testing because the guarantees of type systems are essentially universal. They apply to every possible execution of your program. So that's an incredibly powerful thing, but it comes at a cost. Getting those guarantees requires you to effectively fit your program in to the structure required by the type system. Like type systems are good at capturing some kinds of rules and not good at capturing others. So there's essentially a kind of negotiation step where you figure out how to play nicely with a type system. And in a language like OCaml,
Starting point is 00:16:32 that actually worked out pretty nicely. The type system is quite lightweight. It doesn't require lots of type annotations. And it's flexible, so you don't have to bend your program too far out of shape to make it fit in and to get the benefits of types. But the nice thing about dynamically typed languages is that you get to skip that negotiation step entirely. If your program works in the particular context on the data that you happen to be running
Starting point is 00:16:53 on, then you don't have to think about how it works in a more general context. And when you're doing exploratory research, writing little bits of code that you get information from, and then maybe never run again, I guess the trade-offs are really different. At that point, being able to operate without types has a lot of appeal. Yeah, so the type system in OCaml will indeed give you a proof that your code is correct and that the code is correct for all possible inputs. But in Python, it's kind of the default that since you don't get any proof that your code will run correctly, you will only get exceptions at runtime. Turns out that the fragments of inputs that you're caring about might be very small.
Starting point is 00:17:33 So the time that you would spend looking for corner cases on the OCaml side and fixing them is actually not necessary on the Python side because yourself, you know that you won't have that kind of inputs. Whereas, of course, if you have a typer and a compiler to ensure type safety, the compiler cannot assume anything about that. So you will have to teach it, oh, actually, I don't really care about this variant nor about this variant and so on. I feel like we should be a little less grand in the sense that it's not that the type system gives you proof of the overall correctness of the program, but it does capture certain elements of correctness and nails them down pretty well. And there's certain kinds of programs where that gets you alarmingly far, where large classes of
Starting point is 00:18:14 bugs totally go away under the scrutiny of the type system. The thing that's also maybe worth mentioning is that type systems like OCaml's actually do relatively little to capture bugs in numerical code. One of the hard things about numerical programs, which when you get the algorithm wrong, things look a little worse in fuzzier ways. If you didn't do your linear algebra decomposition just right, then it's not that you get a totally crazy answer. You get like an answer that's less optimal or it doesn't converge as fast or it doesn't
Starting point is 00:18:40 converge at all. And separating out those bugs, finding those bugs, is actually quite hard. And I suspect if OCaml had a type system that was really good at finding that kind of bug, people would be really eager to use it in data analysis contexts. But in practice, the hardest bugs in numerical analysis often just kind of slip by without the type system helping you much at all. Yeah, so indeed the bread and butter of Python is numerical algorithms. And in that case, pretty much everything is a float or an array of floats or a matrix of floats. So having a type system that checks that the type are correct might not buy you that much. As you mentioned, the problems are more likely to be the float value is not the correct one, rather than the float is not a float,
Starting point is 00:19:25 but it's actually a string. So in that context, the OCaml type system definitely helps less. You mentioned that you might want some specific type system to catch that kind of error, but that turns out to be quite tricky. Even being able to decide if a float is going to be a non or is going to be an actual float
Starting point is 00:19:45 is not a straightforward problem and yeah that's the kind of thing that by experiment you will actually catch by looking at the data that you're taking as input and trying your algorithm but if you have to handle every possible data you probably don't want to check for floats at any point in your function. It would be too much of a pain. Right. And I feel like machine learning is split between two kinds of work, some of which is all about fast iteration and some of which is all about painfully slow iteration, which
Starting point is 00:20:16 is to say some of the work you do is taking some big model that you want to train and shipping it off to a farm of computers and GPUs and such to do a bunch of heavy lifting and then get the results back at the end. And that seems like a case where you would really, really like to not discover that after like three days of chomping on your numbers, that some trivial bug at the end causes the whole thing to fail and you realize you were just wrong. And so that feels like a case where you'd actually rather have more checking up front if you could get it. Is that like a thing that comes up? And is that something where you think OCaml can be useful in a machine learning context? Yeah, I think that's the kind of place where
Starting point is 00:20:54 OCaml could be nice. Turns out that what you tend to do in Python is you run your model first with very small input sizes so that it runs quickly. You run the whole thing and you check that it's able to write the model file in the end or to write some intermediary snapshot files. But you can still miss some branching in your code or whatnot. And so you don't see that at the end your model will explode and you might just not get anything out of multiple hours slash days of compute. And in that case, you're pretty sad. So having a type system that tells you, oh, actually, here you said that the file name was supposed to be a string.
Starting point is 00:21:36 It turns out that you've given me an int somehow. It's pretty good to be able to detect that kind of failure. Then it raises a bit the question of what is the ideal system. Maybe the actual model should be in Python because it's a numerical bit and you want to iterate quickly about building this model. But once it's come to more either production usage or training for real, which is kind of a production thing, then you want an actual infrastructure around it
Starting point is 00:22:05 that is really brilliant about lots of things. So about some computer being done in your compute farm, about the file system having issues, and of course about silly mistakes that you might have had in your code. Are you suggesting in this world, you would keep the modeling in Python the whole way through, but just build infrastructure around it in OCaml? Because taking the thing you wrote in Python and rewriting it again in OCaml doesn't sound like an enormous amount of fun. Yeah, and you might have discrepancies between the two things. I think it's a bit clearer even when it comes to productionizing things.
Starting point is 00:22:38 So you write your model in Python, and even if you train it in Python, at some point you're probably going to want to use it on a real life thing. For Jane Street it will mean that you might have a trading system that is actually using your thing and this trading system is likely to be written in OCaml because it has to be resilient to lots of different things. And still you will want to interface with your model at some point. So you have multiple ways around that. You can either make OCaml call Python and use that to decide on the model output, take the value back and process things accordingly.
Starting point is 00:23:16 But you could also just try to replicate everything. So that things might be faster. You might have more type guarantees. But of course, you have two different implementations. So then you have a new problem, which is how do you ensure that the two implementations stay in line over time? It's not something great to have, but we've done it for a couple of things. And one way we have around that is when you produce a model,
Starting point is 00:23:39 the model outputs itself, like the configuration of weights that you've finally reached, as well as some sample values. And on the OCaml side, when you deploy the actual model, you will run with the configure weights and the sample values and check that you obtain the exact same results. So some of what you were talking about there had to do with the tooling that we have for calling back and forth between Python and OCaml. And that's actually an area that you've done a bunch of work on. So can you say a little bit more about what we've done to make interoperability between the two languages better?
Starting point is 00:24:12 Yeah, for that we rely a lot on a library which is called PyML, which is open source and which is very neat. This library is mostly designed so that you can call Python from OCaml. So again, a typical example is you want a plotting library in OCaml, so you want to use matplotlib because it has its own issues, but it has tons of features and it works fairly well. So you just call the Python runtime from your OCaml runtime. But it turns out that you can also use it the other way
Starting point is 00:24:45 around. And it's the kind of use case that we have the most of at Jane Street. When you write things in your Python notebook, you want to be able to access the various Jane Street services, to get the market data, to get actually all the data that you can think of in your Python notebook, and even to trigger some actions or to publish some values to our different systems. So for that, it's more calling OCaml from Python. And again, with PyML, what you can do is compile your OCaml code
Starting point is 00:25:19 in a shared library. And this shared library, you will load it from the Python runtime. You will call the starting point of the OCaml code, which will bring up the OCaml runtime. And the OCaml runtime will register lots of functions for the Python runtime to be able to use. This makes it possible to actually write Python wrappers
Starting point is 00:25:40 around OCaml function in a way where you almost don't have to write a single line of Python. So for people that are not to use Python that much, that's very nice. And also when you want to build functions that are used a lot by lots of different people on inputs that you might not have think of, you're actually pretty happy to do that in the OKML world rather than to do it in the Python world. And the key reason it's important to do this
Starting point is 00:26:07 without writing any Python is there are lots of OCaml developers who are developing libraries that would be useful to Python programmers who themselves basically don't use Python at all. And so you want to keep down the amount of work that's required for non-Python developers to export things and make it available to the Python world.
Starting point is 00:26:23 Because if people don't run the Python command line to test things or to write small wrappers on a daily basis, they might not remember how to do it. It's a bit of a pain. Whereas if you do everything on the OCaml side, people in the development area are very familiar with doing that. Great. It's maybe worth noting that this is the kind of thing that we've done in multiple different places.
Starting point is 00:26:46 What you're describing is the ordinary thing that you have in any kind of higher-order language, whether that's a functional programming language or an object-oriented language, where you're essentially shipping around pointers to bits of code. And it's pretty common to have things which, when you stop and think about them, they're just calling back and forth multiple times across these different levels. And then what's interesting about it in a Python context is we're now calling back and forth between two different language runtimes, not just between two different functions in the same language. And we do exactly this for our OCaml Emacs integration. With Emacs, we wanted to write lots of extensions, but we really didn't want to write
Starting point is 00:27:25 them in the native language that Emacs uses for this called Elisp, because people found it harder to test, harder to teach, and harder to engineer. And so we wrote this layer called Ecammel. And now pretty much all of our Emacs extensions are written in OCaml, but you still need to interrupt with the basic Emacs functionality and library. So in that sense, it's very similar to the Python story. And you have exactly the same story where you have programs that are constantly bouncing back and forth between OCaml and between Emacs Lisp in that case. And hilariously, we're now doing the same thing with Vim. So you have the same kind of technology getting built over and over.
Starting point is 00:28:03 Yeah, it's pretty amazing when you build that kind of system and you notice the first time that it works, it sounds pretty weird that somehow you're able to interface that having layers of Python calling OCaml, calling Python, calling OCaml. And somehow, even if you do that hundreds of times, it ends up just working. So there are some nasty aspects to it. While doing the integration of some OCaml function, we actually run into a pretty funny bug because of that, which was as follows. So OCaml has a garbage collector. So the OCaml runtime from time to time stops the world and collects what is not alive. Here by alive, we mean the values that can be reached by the runtime.
Starting point is 00:28:44 So it's all the values that are actually useful to the user still at that point, and all the rest is the garbage that can be collected and removed. Whereas on the Python world, you use reference counting. So on each object, you keep a counter of how many times this is used. It has a nice benefit that you can release things as soon as possible, but it has a slightly sad aspect, which is if you have a cycle, then memory is lost. So you still need a garbage collector that you run from time to time to detect cycles and to actually remove them. And the bug that we actually encountered was because you had a cycle between the Python and the OCaml runtime. So you had an OCaml object that was itself pointing at a Python value,
Starting point is 00:29:29 and the Python value was pointing back to the OCaml thing. And that thing cannot be detected by either of the garbage collectors. Each of them will tell you, oh, actually, that thing is being used. I don't want to remove it. But overall, you could remove the old value. And it turns out that this was wasting tens of megabytes because of that. So we noticed it pretty quickly and fixed
Starting point is 00:29:50 the thing. But there are some small drawbacks that you can have because of this. This is exactly the classic problem you get when you have two garbage collectors interacting. And we have exactly the same bug in the eCamel story. So I think it is a fundamental problem that's hard to fix. So I want to kind of go back to this question of how do we think about Python and OCaml and to ask you, when you're going off to engage in some programming task, how do you think about the choice between whether you want to use Python and whether you want to use OCaml? Myself, I would think that when things are there to last, and it's some code that you want to still be working in multiple weeks or months, then OCaml is very neat.
Starting point is 00:30:32 And also there are specific domains where OCaml shines when it's about manipulating symbolic values. Of course, trying to do that in Python would probably not be that much of a good idea. One good example of manipulating symbolic values is writing a syntax extension. In that case, OCaml's type system does a really good job of helping catch errors in code that's going to inspect your program and generate new code depending on what it finds, and making sure that you don't forget to handle all the different kinds of syntax that show up in the language.
Starting point is 00:31:03 Although I see that the kind of example you gave about writing programs that manipulate programs is both a very good example in the sense that it highlights something OCaml is really good at, but also I feel like in some ways a bad example because I think it understates how often this kind of programming comes up. To try and frame it in a somewhat more general context, I feel like the kind of places where OCaml works really well is where you have combinatorial structure that involves just differentiating between lots of different cases, right? If like there's different ways that your data might be laid out and you want something that helps you do the case analysis and make sure that you capture all the cases, OCaml is incredibly good at that. Like OCaml's pattern matching
Starting point is 00:31:40 facility gives you a way of just exhaustively saying, what are all the different ways this data might be shaped, and making sure that you cover them all. And that for sure shows up and shows up a lot when you're doing compiler-style work or generally things that kind of feel more like program manipulation. But it actually shows up a lot in various kinds of systems programming tasks. Like I've spent a lot of time working on the insides of trading systems, for example. And there's actually lots of places where you want to think in exactly this kind of combinatorial way and where things like OCaml do a really good job of catching those bugs. Yeah, yeah. It's definitely the case. So you mentioned trading systems and indeed it's quite challenging. Building trading systems in Python is not the best idea.
Starting point is 00:32:22 You were talking before about how we've built tools to make it possible to call into OCaml from Python. And the kind of examples you pointed to are cases where you want to consume data where the primary wrapper of that data is some bit of OCaml code or to publish data, again, through some path that's in OCaml. But I think it's also useful
Starting point is 00:32:38 to be able to invoke computations, pure computations that exist on the OCaml side. I think for two reasons. One is because it's way more efficient, as you've been highlighting, but also because writing the code inside of our ecosystem, writing the code in OCaml means you could write the program once, write the computation once, and then share it on both sides. Being able to have all of the core computational stuff available on the OCaml side, and then being able to expose it in Python is a nice way to allow you to stop repeating yourself
Starting point is 00:33:05 and having to rewrite things on both sides. Yeah, efficiency is definitely something that we will care about and that will be far better on the OKML side. And also if some code is indeed to be called by lots of different people, and again, these people might have ways to call it that you would not have thought about.
Starting point is 00:33:24 In Python, it will be pretty challenging because your test suite will have to cover pretty much every single corner case. And still, people might at some point rely on some specificities of your code, wrap around that, which ends up being quite a mess. You don't want to write in Python some code that is going to be queried in lots of different usage scenarios. Whereas in OCaml, you're kind of forced by the type system to think about all these different scenarios and you want to get that correct. And it's not worth the investment for a one-off, but if it's some code that is going to be shared across tons of people, it's probably worth more of the investment. When people think about the difference between dynamically typed languages and statically typed languages, they often think of, well, in statically typed languages, you have the type system to help you, and in dynamically typed languages, you have to write
Starting point is 00:34:16 a lot of tests, which is, I think, not quite the right way to think about it. I think in a statically typed language, you still need tests, but they have almost a kind of snap in place property, which is if you write your program and can show that it works on a few key cases and catch a few of the corners, then the overall behavior tends to snap into place. And you get into almost like wider coverage of your tests by having the type system make the behavior of the program in some sense more rigid so that you can just get by with a much lighter testing story. So you still need to do tests, but you don't have to be nearly as exhaustive in the testing as you do in a language where tests are
Starting point is 00:34:55 the only thing that are nailing the behavior of your program in place. So earlier on when we were talking about the trade-offs between OCaml and Python, you talked about how, well, on the OCaml side, we have all this rich testing infrastructure, and on the Python side, there's much less of that and less practice around that. So first of all, I imagine some professional Python developer listening to this conversation and saying, what is wrong with you people? Obviously, when you write Python code, you should have good testing infrastructure. I'm wondering if there's stuff that we should be doing and maybe are in the process of doing to improve the tooling story that we have internally around Python and to make more of a culture of testing the things that we do
Starting point is 00:35:34 write. Yeah, we're certainly learning a lot on the Python side. On the OCaml side, things have been polished by the years and are very, very efficient. On the Python side, we are testing, having a bit of type annotation and generating automatically the documentation and so on. And you have to pretty much redo all the things that have been done for OCaml at Jane Street. So we are going through that and trying to focus first on the most important bits. But we're certainly not there. And I can imagine that our best practices are pretty far away from people that are doing lots of Python. I also think that when it comes to actually using Python, not alone to build your system,
Starting point is 00:36:18 but more as some kind of glue around some OKML components, the testing story probably has to be less involved on the Python side. You still want your OKML components to be well tested. You still want your intermediary layer that converts Python calls into OKML calls to be, of course, properly tested. But you have far less of a possibility of bugs on the Python side because what we ship on the Python side is fairly limited. After that, the user is going to write tens of lines, perhaps a bit more around it, but hopefully not that much, so there would be tons
Starting point is 00:36:59 of possibilities for bugs. Though we're quite commonly surprised about how easy it is to sneak some bugs in a very, very short amount of code. That is the thing about which programmers will never stop being surprised. Yeah, it's pretty impressive. And it's also impressive when you start to rely on Python libraries. So of course, we rely a lot on something called NumPy to represent multidimensional arrays, and also on a data frame library called Pandas that lets you represent matrix of each column as an heterogeneous type, like you can have a column of string, a column of time, a column of a float. And this library is
Starting point is 00:37:37 super powerful. It works very nicely. It lets you analyze your data very quickly, but it has very weird corner cases. And you might just notice that the code is actually not doing what you would expect. And learn it the hard way because the only thing you noticed in the end is, oh, the error that I get while running my simulation is not what I would have expected. It's pretty bad. So now I have to dig down in my code, annotate things quite a lot, and trying to understand why the input and output of this Pandas library is not what I would expect, to finally google that and discover that it's a well-known gotcha around the library. And that's only for the case that we know about, where the
Starting point is 00:38:15 error was large enough for us to discover, of course. You have cases where you won't even know about it. This reminds me a little bit of what data analysis is like in Excel. Excel is something like the equivalent of the Python notebook. The program that you write in there is a mix of the little expression language that goes into the cells of a spreadsheet and VBA that you write for manipulating and doing various things that don't look like simple computations that go along with it. And again, it has all the benefits that we were talking about, which is it gives you some ability to visualize the data. This cell-structured way of looking at computations is actually in some ways incredibly powerful. You look at someone who's good at a spreadsheet, they're able to very quickly do all sorts of rich computations in a way where
Starting point is 00:38:59 they can see all the intermediate pieces. And in the transformation, the numbers become actually much easier to follow through because all the intermediate computation is just kind of laid out in a big grid. And also it has a bunch of baked in, totally terrifying, weird behaviors. I'm reminded of this recent news where some organization in the academic genomics community changed the names they used for various objects that show up in spreadsheets, because I think some of them were being interpreted as dates, which caused all sorts of crazy things to happen in the middle of spreadsheets. Another example that we worried about a bunch when it came out is a stock called True. And there was a bunch of worries of like, what would, because Excel does a
Starting point is 00:39:43 bunch of stuff where, again, related to the fact that it's dynamically typed, where it tries to infer what kind of data you have from what it is that you typed in. And this is incredibly helpful in lots of contexts and also absolutely terrifying in other contexts. And because the work that's being done is numerical, when it fails, again, it fails in this kind of soft way a lot of the time where like the numbers aren't quite right and you may just not notice and your result might be a little less optimal
Starting point is 00:40:09 than it should be or give you not quite the real answer. That's the thing that we worry about and are trying to be pretty careful about and defensive about when we write spreadsheets. And there's stuff in the outside world where well-known results
Starting point is 00:40:20 had to be retracted because of subtle bugs in spreadsheets. Myself, I tend to make a lot of fun about Excel because I live in Linux land. I don't like logging on a Windows computer. I don't use Excel almost at all. But still, when you go and see, when you have a question for someone working on trading at Jane Street and you see him manipulating Excel, it's pretty impressive because people tend to know all the shortcuts and are able very, very quickly to actually get their data out of Bloomberg, plot them in Excel, basically check that the values are in line with what they would expect
Starting point is 00:40:58 and just send them to you. Whereas if you were to have written an OKML code for that, you would probably have spent multiple hours on it. So yeah, Excel definitely has its advantage. It also has a great reactive model, like having these cells where you modify a cell, everything gets recomputed. And that's fairly nice. Python is definitely kind of in the middle. You're able to do more computations there. It's a bit more efficient and it can handle larger data sets, but you lose a bit of the,
Starting point is 00:41:31 oh, I can basically eyeball all the data at once. So of course, in Python, at some point, what you're manipulating is too big. So you only look at statistics and the extremums, that kind of thing. So it's far less good than what it will be when using Excel. Also, the Python notebooks lack a nice property that Excel spreadsheets have, where in Excel spreadsheets, every cell is a computation which makes references to other
Starting point is 00:41:56 cells. And Excel keeps track of this graph of computations and how they depend on each other. And that's useful both for performance reasons, right? There's this kind of incremental update. If I change something, it will refresh the computation and just do the minimum that it needs to do to refresh things. So it has a nice built-in incremental computing model, but it's also good for correctness reasons. It can actually refresh all the things that need to be refreshed whenever anything changes.
Starting point is 00:42:21 So you know that your spreadsheet is in a consistent state and not so with a Python notebook. In a Python notebook, you have something similar. You have like little chunks of code, which are almost like the equations, and then you have various visualizations that are interspersed between them, and these chunks of code depend on each other,
Starting point is 00:42:36 but you have little buttons you can press to decide which ones to rerun when. And so you can look at a notebook, and you have no guarantee that it's in a consistent state. And so that's the thing that's always struck me as kind of nerve-wracking about this model of programming. Excel tends to be more functional than what Python would be. You have far less of an ocean of state than can be mutated. It's just this graph of computation, and that's actually fairly neat. The big problem with notebook is what you mentioned.
Starting point is 00:43:03 You can run each cell. Cell may depend on the current state. And it's very easy to create a notebook state and not remember how you actually reached that state. You might have executed a cell multiple times. You might have executed the third cell before executed the second one. And you would have to redo everything again
Starting point is 00:43:23 if you wanted to be able to go back to the exact same state. So for efficiency reasons, you don't want to keep this big graph and rerun everything. But it's kind of a problem because when you reach some conclusion from your notebook, if you're not able to restart from scratch and reach the same conclusion easily, it's actually annoying. Is there any work in the direction of trying to solve this problem with notebooks, to make it so you could have notebooks that had more natural incremental models, so that you didn't have to choose between efficiency on one side and correctness on the other?
Starting point is 00:43:57 Pick one. Is there any work to try and make that better? Yeah, I think there is a bit of work, but nothing that has the level of adoption of the main Jupyter-sl main Jupiter slash IPython thing. So the things that can be mentioned, there is an alternative to Jupiter, which is called Polinot, which is made by Netflix, I think. It has the same kind of drawbacks, except that it's far more explicit about the notion of state. So at least you would see the variables that your thing define. And when you go back to a cell,
Starting point is 00:44:27 it tries to put you back in the state that you had when editing this cell. So only taking into account what was there previously. But of course, because you want to do that efficiently, you don't really handle aliasing correctly. So if you're doing deep mutation inside the object, I don't think that this is tracked down properly. Just kind of the first layer of object is as it was at that point in time i see so it tries to bring you back but it's not a sound method it's not guaranteed to always work yeah because it
Starting point is 00:44:56 would explode probably right besides that there is a nice thing but it's very experimental at this stage in the julia world which is called Pluto. So it will notice all the dependencies. And whenever you edit a cell, it will recompute all the cells that are depending on the result. So this works, I think, only for Julia at this stage, but it's pretty neat. It might be the case that you want to rerun a cell and not rerun long things that would be too long to run. So it's a bit of a user interface problem at some point. Yeah, although I guess you could just be explicit about, you could say I rerun the cell,
Starting point is 00:45:31 and now if I could keep track and visually notify the user that the following cells are in some sense out of date, at least if you understood and could expose to the user so they knew which part of the computations were reliable and which parts were not reliable, that would already move you a big step forward. You mentioned at the beginning that you had spent some time when you were at DeepMind working on Swift as an alternative language for doing machine learning work.
Starting point is 00:45:52 And the way I understand it, part of the story there, I think, is about auto-differentiation. Is that, I guess, maybe you could just say a little bit more about what the ideas are behind using Swift as a language for doing this kind of machine learning work? So yeah, the idea there is indeed a differentiable programming. So what you tend to do a lot in modern machine learning is doing gradient descent. So you have a function that you want to optimize. So it has some parameters and you want to find optimal values for these parameters. You have a notion of loss and you want to find the parameters that minimize this loss. And this was a way to do that is by gradient descent.
Starting point is 00:46:33 So you would just use the Newton method of computing the derivative, except that in this case, the derivative is with respect to lots of different variables. And you follow the slope down until you think you have reached the minimum so of course what you discover is a local minimum right and for people who don't spend a lot of time doing machine learning the the way i always think about this is you have some function you want to evaluate think of it as like a series of hills or whatever and you know gradient descent is just the thing that a rolling ball does it just goes in the steepest direction instead of being at the top of a hill you'll find yourself at something that looks like a bowl in its shape. And that's exactly a local minima.
Starting point is 00:47:09 And having the ability to compute the derivative, particularly in a very high dimensional context, where instead of being, you know, a two-dimensional picture like the one I have in my head, you have 200 dimensions or 2000 dimensions or more. Having that derivative there is a very powerful thing. If you were to compute the gradient numerically, that would work with a couple dimensions. But if you have a model that has millions of parameters, you don't want to compute the gradient numerically. So having a symbolic gradient is far better.
Starting point is 00:47:38 And that's where differential programming is actually very helpful. So you can either use a library like TensorFlow or PyTorch. This library will build a graph of the computation for you and do the derivation, the symbolic derivation of this computation graph. But you can imagine pushing that one step further and just doing that at the compiler level. So at this point where you define a function that
Starting point is 00:48:03 takes as input lots of float values and returns a float, you can imagine that automatically you generate the gradient for this function. Because as a compiler, you know that this function is combining the addition with some subtraction and lots of other functions for which you might have already computed the gradient. To maybe make this more concrete for people, the notion of symbolic differentiation is basically the kind of differentiation you ran into in calculus class. You know, there were a bunch of rules that you could apply. You write down some expression, and there are rules of like, you know, the derivative of x squared is 2x, and you know, rules about
Starting point is 00:48:37 multiplication and addition and composition. And it turns out this actually very old insight about programming languages, the so-called automatic differentiation is 30 years old. It showed up in lots of languages in the past, including ancient implementations in Fortran and Lisp. Gives you a way of just saying, oh, let's just take this idea and generalize it and apply it to programs, not just programs that represent simple expressions, but programs that do iteration and recursion and all sorts of crazy stuff. Yep. And of course, there are lots of challenges around the way
Starting point is 00:49:06 because it's easy to differentiate an addition, a multiplication, and the composition of multiple functions. But you have functions that are far harder to differentiate. You mentioned, oh, I call my function recursively, or I might have an if branch, things that are kind of discrete rather than continuous. And then you have to decide about what is the proper differentiation for that. So it turns out that modern machine learning-based models
Starting point is 00:49:32 combine functions that are all reasonably easy to differentiate. So if you just write a model in a programming language that supports automatic differentiation, you will actually not need the library to compute the gradient for you. The compiler will return you the gradient because you only used operations that were easy to differentiate. And being able to compose that for pretty much all the functions in your program, that might let you optimize things that you didn't think you would be able to optimize.
Starting point is 00:50:05 So the work you were talking about is trying to take this idea of automatic differentiation and apply it to Swift. Can you say a little bit more about what kind of language Swift is and why people looked at Swift as the language for doing this? Yeah, so Swift, it's a programming language that originated from Apple as a replacement for Objective-C. And so the person that worked on it was Chris Lattner, working at Google. He focused on what could we do with modern programming language theory in the machine learning world. And it turns out he tried a few languages.
Starting point is 00:50:40 He tried, what can I do with Swift? What can I do with Haskell? What can I do with Rust? What can I do with Haskell? What can I do with Rust? What can I do with Julia? And it ended up being Swift that was selected to build a prototype, which was probably a good choice because he was leading the project and was very familiar with it. So Swift is a compiled language. It has some very nice functional aspects. So you have the equivalent of type classes, except that it's called protocols there. And overall it feels like you have some types. It feels like a modern programming language.
Starting point is 00:51:14 The sense I get is a mashup between something like Objective-C and something in the Haskell, OCaml, F-sharp kind of world, right? I think the Objective-C stuff was kind of just needed fundamentally to be compatible with the whole iOS world. Yeah, I think that's indeed the case. I want to be able to be attractive to Objective-C users and to interact with Objective-C systems. But besides that, it has lots of modern features
Starting point is 00:51:42 and feels quite close actually to OCaml when it comes to the type system. And yes, Swift is compiled, so it compiles down via LLVM. And there, the project that people were looking at and are still looking at at Google is, let's take some Swift code and rather than compiling it for a CPU, let's compile it for different hardware. Let's compile it for a GPU. So we know that in this Swift code that I've written,
Starting point is 00:52:16 I have some big matrix multiplication, addition, and so on. And I don't want to actually target a CPU for that. I want to target a GPU or even this fancy TPU set you can rent from Google online. So you want a compiler to actually extract the part of the code that will run on your CPU, extract the part of the code that can run on a GPU or on a TPU, and do the actual data transit from some part of my system, like the main memory, to the GPU memory or to the TPU memory. So that's actually a challenging bit. And the other aspect is having automatic differentiation. So there, you want, again, to be able to compute the gradients.
Starting point is 00:52:57 And of course, you want also that to be handled on the GPU or on the TPU or whatever your hardware is. So it's up to the compiler to decide. You would just annotate a function and say, this function let's produce the backward pass for this function. Roughly, it's gradient.
Starting point is 00:53:15 And the compiler will happily create at compile time the gradient function, provided that all the functions that you use underneath and all the constructions that you do are compatible with that. Is there a reason that Swift is a better language to try this in than Python? Python is, again, very dynamic and very stateful.
Starting point is 00:53:36 Swift is closer, as we said before, to OCaml and to something functional. So you have less of a notion of a state. You tend to have more pure functions and computing derivatives, of course, will make more sense in a world of pure functions than it would if you start having some state. Also, the compiler has far more information
Starting point is 00:53:59 about what the function does. It doesn't mean that it's not possible to do in Python. Actually, a framework like PyTorch lets you annotate Python functions with just-in-time information that tells the PyTorch framework to try to compile the function and compile its backward path from the actual Python representation of the syntax tree. One advantage that also Swift has, it's not with respect to Python, but it's with respect to OCaml, is that Swift actually uses reference counting rather than a GC. And that turns out to be actually pretty important. Because when you're training a big machine learning model, one of the challenges is to freeze the memory as soon as possible.
Starting point is 00:54:41 Your model is allocating huge matrices or tensors in general on the GPU. They can take multiple megabytes or even gigabytes. And your GPU only have, I don't know, 24, perhaps 48 gigabytes of memory if you have a very, very fancy GPU. So being able to release the memory as soon as possible
Starting point is 00:55:03 is something very much worthwhile. And that also explains the success of Python when it comes to machine learning. The fact that it does reference counting helps with that. That also explains why Swift is actually a bit better than OCaml to do that kind of work at the moment. It also explains the same thing for Haskell. I heard that Haskell was getting linear types, and linear types is another way to collect the resources as soon as you can. Yeah, that's right.
Starting point is 00:55:29 In fact, we're looking at doing very similar things in OCaml. We've actually, shockingly enough, hired someone where a big part of their goal is trying to get the work on algebraic effects to not go into too much, but algebraic effects is another kind of type system level feature, which, among other things, will make it possible on top of it to build things that look more or less like management of resources. And in many ways,
Starting point is 00:55:52 lets you capture some of the same resource management that languages like Rust have. And I think this goes back to the basic fact that garbage collection is great for managing memory and managing the one big resource of the shared memory that your whole program uses. But garbage collection is a terrible way of managing other kinds of resources.
Starting point is 00:56:10 A classic example that you don't need to think about machine learning to understand is file handles. Sometimes people will set up their programming language so that their files will get closed when the garbage collector gets rid of the file. But that's terrible because it turns out you have some shockingly low limit on the number of open file descriptors you can have, and your program will just crash because suddenly you can't open files anymore,
Starting point is 00:56:30 because your garbage collector didn't feel like it had to work that hard to collect memory, and therefore it wouldn't collect this other completely different resource of file handles. So for things like file handles, or GPU memory, or various other kinds of external resources that aren't just like another chunk of memory on the heap. You really want something else. Yeah, I feel that it would be very amazing to have that kind of possibility in the OCaml world at some point.
Starting point is 00:56:55 And like every scarce resource, indeed, you don't really want the GC to be accountable for it and you want it to be grabbed as eagerly as possible. So I mentioned that I worked on some PyTorch binding for both Rust and OCaml. And of course, the Rust one is very nice because you get all this borrow checker magic that will ensure that your tensors are released as soon as can be. Whereas on the OCaml side, most of the code that I'm writing uses PyTorch.
Starting point is 00:57:37 I force the GC to trigger on every training loop because I want the resources to be collected. So being able to mention that for some variables, you want them to be tracked in a more accurate way, is something that would be fairly neat. It's obviously ongoing work, but I'm pretty excited about having OCaml be a system where by default you have a garbage collector, but that you can, in a focused way where you want to do precise resource management, have it checked and enforced at the level of the type system. And my hope is that that will end up being a system which gives you a lot of the things that are most attractive about Rust, but is all in more ergonomic because you don't have to do the extra work of thinking about explicit tracking of the memory, except in the cases where
Starting point is 00:58:13 you really need to do it. But obviously this is like future stuff, things we're hoping to get to, but it's all vaporware at this point. Having a world where the default is actually you have a reference counting or you have a GC. But only for some resources, you have an actual proper tracking of the memory. That would seem like the ideal thing to me. Yeah, and if I remember correctly, Rust actually started the other way around. The earliest versions of Rust, I think, did have a garbage collector in there baked into the core system. And over time, they got removed and garbage collection became a library that you could add on. One thing to say about all of this is this is in some sense cutting edge stuff and there
Starting point is 00:58:51 is a big design space and I think we as a community of people working on programming languages are just starting to explore it. Yeah, and there are some ways around it in the language, but having it properly supported inside OCaml would be neat. Well, thank you for joining me.
Starting point is 00:59:10 That was a really fun and unexpectedly wide-ranging conversation. Thank you for having me. You can find links to some of the things that we talked about, including some of
Starting point is 00:59:20 Laurent's open-source work, as well as a full transcript of the episode, along with a glossary, at signalsandthreads.com. Thanks for joining us, and see you next week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.