ACM ByteCast - Leslie Lamport - Episode 16 (Special Episode in Partnership with the Hanselminutes Podcast)

Episode Date: May 27, 2021

In this episode of ACM ByteCast, our special guest host Scott Hanselman (of The Hanselminutes Podcast) welcomes 2013 ACM A.M. Turing Award laureate Leslie Lamport of Microsoft Research, best known for... his seminal work in distributed and concurrent systems, and as the initial developer of the document preparation system LaTeX and the author of its first manual. Among his many honors and recognitions, Lamport is a Fellow of ACM and has received the IEEE Emanuel R. Piore Award, the Dijkstra Prize, and the IEEE John von Neumann Medal. Leslie shares his journey into computing, which started out as something he only did in his spare time as a mathematician. Scott and Leslie discuss the differences and similarities between computer science and software engineering, the math involved in Leslie’s high-level temporal logic of actions (TLA), which can help solve the famous Byzantine Generals Problem, and the algorithms Leslie himself has created. He also reflects on how the building of distributed systems has changes since the 60s and 70s. Subscribe to the Hanselminutes Podcast: https://www.hanselminutes.com/. Links: Time-Clocks Paper Bakery Algorithm Mutual Exclusion Algorithm

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, this is Scott Hanselman, and on behalf of the Association for Computing Machinery, this is ByteCast, and we're doing this as a special relationship between ByteCast and the HanselMinutes podcast that I host every week. So you'll find this episode on both of those podcasts. If you're listening to this on HanselMinutes, I would encourage you to check out some of the great podcasts over at ByteCasts at the ACM at acm.org. Today, we're talking with Dr. Leslie Lamport. How are you, sir? Just fine, thank you.
Starting point is 00:00:33 It turns out you and I work for the same company. I got to look you up on our internal directory. I work on C Sharp and.NET over in the developer division. And I'm over in building 99 in research, except I actually work in California. So I actually work remotely in Portland, Oregon. My entire time at Microsoft has been remote. Have you been remote the entire time as well? Well, there used to be a lab,
Starting point is 00:00:59 Silicon Valley lab of Microsoft research, and that was closed. So now I'm at home, mostly. Are you finding that to work out pretty well? I'm personally finding it quite awkward, and I'm not really enjoying the video call lifestyle. Well, I miss having a laboratory that I can be around and talk to people and socialize with colleagues. Since I tended to work a lot by myself, it doesn't make that big a difference in terms of the actual work that I'm doing.
Starting point is 00:01:37 But it is certainly a social change in my life. One of the things that I'm focused on in my job at Microsoft is on making new developers feel welcome. As I was doing research on your work and thinking about and reflecting on my own, I realized that I've got about 30 years creating and writing software. If it's okay to say you're coming up on somewhere around 60 years of creating and thinking about computer science problems and making software. I don't know, sometime in the late 70s or early 80s that I said, gee, I seem to be a computer scientist. I'm not sure what I was before that, although I did start out just programming in my spare time or in part-time jobs while I was in school. I went and read your first paper that you wrote as part of Bronx High School in 1957 at the Mathematics Bulletin. So it sounds like you were a mathematician until you woke up one day and you were a computer scientist. Was that a bright line or did it just
Starting point is 00:03:01 happen? Well, when I got my PhD, which was in 1972, I wound up with the choice between continuing what I was doing, which is what I guess I would now call research at Massachusetts Computer Associates, or going off and teaching mathematics. And for somewhat random reasons, I decided to keep doing what I was doing with computers. is a clear fork in the road that is telling them to either turn left and go into computer science or turn right and go into software engineering and the practice of software. Do you think that that is a clear fork in the road for people entering school? Well, it's not that clear a fork in the sense that a lot of people in computer science seem to lot of academics, seem to go off and work in industry. It's less common for someone in industry to go and become an academic, but that happens as well. So it's not an irrevocable decision, but certainly it is a decision whether you're going to be working at a university or a research laboratory or as a software builder.
Starting point is 00:04:30 But you do think that there are different things, though. Like I remember learning computer science and then parts of it, parts of the theory and the compiler theory and the operating systems theory, I left behind as I started thinking about how agile groups work together and how projects and how humans and the squishy parts of computers work together. And I'm trying to understand that relationship between the kinds of work that happens at the mathematical level and then at the computer science level. And then when the bits actually happen to do their thing in the processor. Well, my view, and it's a view from a distance, since I'm not very close to academia, is that the universities are making that split in their education. And that as a result, people who go off building computer systems never really use the computer science that they should, even if they learn that computer science
Starting point is 00:05:35 in a classroom, because they may have learned the computer science but not really had to apply it in practice. Yeah, I wonder sometimes where that line should be and whether someone can be an effective engineer with a incomplete perspective on computer science. And sometimes the analogy I use is, does one need to be a mechanic to be a successful user of a service like Uber? If I'm driving a car, how much of a mechanic do I need to be? Do I need to be able to disassemble the engine? Do I need to talk to you about the theory behind the internal combustion engine? I'm just trying to drive to the store.
Starting point is 00:06:18 I don't think that's a very good analogy. Well, how's about this? If I'm going to be an electrical engineer, how much mathematics do I need? Do I really need this calculus stuff that I've learned in school? I think you do need it if you're an electrical engineer. You know, math is a very basic part of engineering. And not being able to think mathematically, you know. I don't think you'd make a very good engineer. On the other hand, you have software engineers who basically think that mathematics is irrelevant to what they do. But I don't think one can be a very good software engineer without
Starting point is 00:07:06 the ability to think mathematically. When you say mathematically, though, there is a spectrum of math. And the way that it's taught in most countries is as you kind of go from algebra and geometry and trigonometry and then up through calculus and then into advanced math, at some point, people, lay people stop. And that's as far as they went. And not everyone gets a PhD in math or gets even into graduate level math. When you say think mathematically, what are we talking about? Like basic algebraic proofs or Boolean algebra? Let me put it this way. I use math. I mean, to say that I use math is a separation as if that now I'm doing something, so, oh, I'll start using my math. It's like math is part of me. And everything that I do
Starting point is 00:07:56 and have done in computer science is done with having math as part of me. The- Math as part of me. Math as part of you. Yeah. Now, you see, if you're talking about... But all the math that I use, by the way, almost all of it, I learned in high school. The actual subject matter that I use. I mean, I have a PhD, but in terms of the actual mathematical stuff that I use, I think I learned most of it in high school. But you've probably learned it after you took something called a discrete math class or some basic math class in the university.
Starting point is 00:08:39 So math, it's not the subject matter, the body of knowledge in mathematics that is what's important. It's the ability to view the world in terms of abstractions. And that's what mathematics is all about. It is abstracting the real world into some kind of mathematics. Or I wouldn't even say mathematics. I mean, we all abstract the world into some mental concepts. Not having mathematics sort of built into you as a way of creating mental concepts limits your thinking ability. Okay, so seeing if I can paraphrase a little bit and understand, I hear a lot of people right now trying to teach young folks, even children as young as 10 or even younger, how to code. But one of the things that I've personally advocated for is less focusing on how to code
Starting point is 00:09:51 and more focusing on how to think about systems, how to think about layers of abstraction, encapsulation, and how the world works as a system, and how to debug problems where there is a layer that is hiding something from you or layers that go many layers deep. And then I tell them that the coding part will come later. The syntax of the text is really not coding. Yeah. The concept that should be taught is that the universe has a state that evolves with time. And the idea that you throw a ball, well, the state of the universe is changing in such a way that the position of that ball is changing with time as it goes from
Starting point is 00:10:57 my hand to your hands. That's a really great way to think about computer systems. But what makes computer systems different is that we can view them as instead of a continuum of states, as a sequence of discrete state changes, where you execute one step of a program, then the next step, then the next step, having that as a fundamental concept to build your understanding of a program, what a program is, gives you that mental model that is useful in so many contexts of computing. When the final executable code gets run, when everything happens and finally turns down into machine code and the software is run. It's usually an overwhelming amount of detail. There's a huge amount going on. But in the things that you've worked on, the systems you've worked on in the
Starting point is 00:11:55 formal specification language of TLA and TLA plus, it's really quite precise. Do you think that modern computer languages are too verbose and too detailed and exquisite in their expression of what's supposed to happen? Well, if you're chopping down a tree, considering what a complicated thing there is, there are thousands of leaves on this tree, each one's following its own trajectory as you chop the tree and the tree sways and the vibration goes up to it. How do you keep track of all this enormous complexity in the world? Well, if you're a lumberjack, you have this wonderfully simple abstraction. The tree is sanding there, I chop it,
Starting point is 00:12:38 and then the tree is on the forest floor. Bingo, one step. Well, you have to be able to view what's going on in those billions of gates inside of your computer as some higher level abstraction. And the code, you need to be able to view it in terms of higher level abstraction. You can't keep those millions of things that are actually going on inside of your code in your head higher level abstraction. You can't keep those millions of things that are actually going on inside your code in your head at one time. So that's what computer languages do. They hide a lot of that.
Starting point is 00:13:13 Well, all computer languages, starting from machine code, hide all that complexity that's going on in the circuitry. And then higher level languages hide a lot of the complexity that's going on in the machine language. And I think as programming languages evolve, people get better at, I think, probably better at building compilers more than building languages. Languages get more abstract and easier to use. But we're a long way from being able to write a web browser and 100 lines of code. And the web browsers that we have, we certainly don't know if they're correct. I mean, I can drop a web browser control on a form,
Starting point is 00:14:00 but I'm sitting on top of thousands and thousands of layers of abstractions, all assuming that everything's being handled correctly. I think in the example of the tree, we can pretty much count on the universe and math and God and the creator to pretty much handle all of the state machines happening inside of the tree. So the tree is either downed or the tree is up, but we're placing a lot of trust in the underlying abstraction layers. And what do you do when something 35 layers down has a problem?
Starting point is 00:14:28 There's no easy answer to that because, as I say, our software has grown by evolution, not by intelligent design. That is a very good analogy. It's not an analogy. It's a statement. Pardon me. other things and nobody knows exactly what they do in those cases. When you need complete reliability, basically where people's lives depend on things, you pretty much have to build everything with intelligent design, which means you have to start not very much above the actual hardware level and verified by some reliable method that each piece does what it's supposed to.
Starting point is 00:15:31 Each layer does what it's supposed to. And that's not the way most code is written. Indeed. Fortunately, most programmers don't write life critical code. So the TLA, temporal logic of actions, logic that you developed in TLA plus and the things that have been used against different services is all based on math and logic and being able to express the intention and the correctness of something, but it sits on top of an imperfect thing. So you've got a near perfect concept of that's based on math, trying to describe what we know to be an imperfect thing and then find its correctness properties and find the
Starting point is 00:16:17 issues in the system's design. Does it assume that there's always going to be some bugs we can't find or that it finds the bugs and then we work around them. I'm trying to understand the relationship between the perfection of math and the obvious imperfection of computers in general. TLA is not a magic bullet that's going to solve all your problems. What happens is that when you decide you're going to build a system, you start figuring out what it does. If they're careful, and I think any good engineer understands that, you don't just sit down in front of a machine and start writing code as the first thing you do if you're going to build some system. You have to decide what it does, and you have to decide at a very high level how it's going to do that. In the way most systems are developed, people think as hard as they can and try to see all the problems in their design, but they'll miss one. And not discover it, you know, some corner case that they didn't think of. And if they're lucky, they'll catch it in testing.
Starting point is 00:17:32 And then they've got tens or hundreds of thousands of lines of code. And they've got to fix that problem in this really complicated mess of a thing. What TLA does is it forces you to write that higher level conception in a completely precise manner and being able to test it and find bugs at that level. And so when you're down at the code, you're just worried about coding errors. You're not worried about design errors. That's what TLA is about. Why do you think that even now in 2021, in the 2020s, that we are still, as software engineers, doing a fairly ad hoc approach of things that might go wrong. We make a list of all of the different wrong states and all the different, oh, this could go wrong and
Starting point is 00:18:30 I'll have to catch that. And this could go wrong and I'll have to catch that. While TLA is more of an express and specific, this is what will go right language. When I started working on fault tolerance back in the late 70s, what I learned is that the right way to do things for fault tolerance is to think about what has to go right, not what can go wrong. And I don't think I've ever written that down anywhere or any place. And so I was surprised and delighted when somebody said, an engineer, it was an Amazon engineer who said, thing about TLA is that it makes you think about what has to go right rather than what can go wrong. It just seems to force you to think the right way in that respect.
Starting point is 00:19:35 Sometimes what can go wrong is unclear, it's confusing, and you get false signals from your system telling you that things have gone wrong. And in the 80s, you wrote about the Byzantine generals problem. What should someone know about that today? I suppose the question is, what should who know about it? That's a great clarity there. The average programmer writing his web server doesn't need to know anything about it. People who are building distributed systems or fault tolerance systems of any kind need to know anything about it. People who are building distributed systems or fault tolerant systems of any kind need to be aware of it. Whether they need to know how to tolerate Byzantine faults, that's an engineering decision, but they should certainly be aware of the
Starting point is 00:20:20 possibility and need to make an engineering judgment on whether that is likely enough to occur to mean that that has to be considered. I'm curious if you remember that the two generals problem was a thing that was a thought experiment and that was thought about and created in the 1800s and the Byzantine generals problem, was that postulated by you and some other mathematicians? Like, where did the name come from? Because that term Byzantine has not just means the empire, but it also means something that is overly intricate. And sometimes it's used in the context of something being a fairly labyrinthine and Baroque solution. How did that term,
Starting point is 00:21:00 the Byzantine generalsist problem, come up? Very simple. When we were doing the work, I realized that this was important and people should know about it. And I learned actually from Dijkstra and his dining philosophers problem that a cute story makes something popular. And so I created the story of the generals trying to come to some agreement and decided that there had to be some nationality of generals because there was actually what I think what you might be calling the two generals problem I knew was the Chinese generals problem. And I decided I would just take a country that nobody would get offended by. And I chose Albania because Albania was basically a black hole in those days. It was a communist regime that even closed itself off not only from the West, but also from the Soviet Union. And so I figured there wouldn't be any Albanians around to object. So I originally called it the
Starting point is 00:22:12 Albanian generals problem. Jack Goldberg, who was my boss at the time, said that we really shouldn't use that because there are Albanians around and you shouldn't offend them. And so I've just somehow, then the name Byzantine came to my mind, and it was obviously the perfect name. Indeed. I mean, the term Byzantine being associated with something being complicated and confusing and intricate. No, not confusing or complicated as underhanded or what's the term? I'm blocking on the word for... There was a sense of secrecy and spycraft associated with it as well. So today, the concept is that a component can be inconsistent and unreliable, and it could present different systems to different observers.
Starting point is 00:23:00 And when you were thinking about this, which really applies today in our very, very complicated distributed systems, were there a lot of distributed systems of sufficient complexity, or was this still a largely theoretical exercise? In one sense, it was purely theoretical. Well, no, it's not purely theoretical. Well, let me give you the history. I developed an algorithm that my time clocks paper of how to implement an arbitrary distributed system by basically having the components of the system collaborate to simulate, to execute a single state machine in which every component could issue commands, the algorithm would guarantee that everybody would execute the commands in the same order
Starting point is 00:23:54 so they'd have the same state machine. And then I knew that any system could be described by a state machine. Whatever you wanted to do could be described by a state machine, which was perfectly obvious to me but apparently not obvious to many other people. So I realized that this was a general approach for building distributed systems, but there was no notion of fault tolerance in it. And so I decided that I should find an algorithm that used that same approach, implementing a state machine, even with processes that were faulty. There was no idea of what a fault's doing processors. So I
Starting point is 00:24:33 just seemed natural to just assume that a faulty processor could do anything. And I came up with an algorithm that essentially was the algorithm in the Byzantine generals' papers that used digital signatures. When I got to SRI, I found that people there were building a distributed fault-tolerant system for flying airplanes. And they had realized that they needed to handle arbitrary failures if they wanted the kind of high reliability that needed for software that the computer had to fly the plane otherwise because the pilot wasn't able, of abstracting the problem as the consensus problem. Namely, instead of thinking about implementing a state machine, thinking about the problem of agreeing on what the next command of the state machine should be. The algorithm I had used in my algorithm basically did, in fact, implement a consensus algorithm,
Starting point is 00:25:54 implemented consensus on each command. So it had solved the consensus problem before it was sort of pulled out as a concept by itself. So as you see that my thinking about arbitrary failures, both predated and was inspired by practical considerations. Speaking of the practical, how different in your mind are the distributed systems that we create in a cloud like Azure or AWS now versus the distributed systems of the 60s and 70s? A lot of your papers talked about inter-process communication and communication on different processes on different machines. Has a lot changed in 40 or 50 years, or is it just simply a matter of we've just turned the dial up to 11 and it's just a matter of scale? I suppose it depends on which work you're talking about, but the kinds of algorithms that I was considering, for example, in that first Time Clocks paper, a cloud system is
Starting point is 00:26:56 not fundamentally different. It's just instead of I was thinking in terms of a half dozen computers, people are thinking in terms of a thousand. And so, you need different algorithms, but, that was very prescient. That was very thoughtful of the elders whose shoulders that we stand upon that they thought about that. Do you ever look back and go, huh, that is a surprise. I'm glad that that worked out so nicely. Or is it different? And then you just feel constantly affirmed in the early work was correct. It's not something I think about very much. It's not a question that I've ever asked myself, at least in that form. I mean, there's one thing that's funny in the sense is that the malicious generals was a metaphor, if you like, saying that we're just not sure of what kind of failures could actually occur. And I actually thought that, well, digital signatures that I used,
Starting point is 00:28:07 I actually regarded them more as a metaphor because I believe, and I think I'd still believe it, although nobody has ever done it, I believe that building the digital signatures for dealing with random faults should be much simpler than building digital signatures to handle malicious people trying to forge signatures. There was nothing difficult about building digital signatures. And in those days, the computers were a lot slower and the digital signature algorithms were very, very expensive. And so people regarded the digital signature, the algorithms that involved digital signatures as being impractical. engineering solutions that would guarantee that the probability of a signature being forged because of a hardware failure could be made small enough that you didn't have to worry about it, and you could then get an algorithm based on digital signatures would become not very expensive.
Starting point is 00:29:27 But nobody ever went that way. Despite the advantage of the digital signature algorithm, in particular, let's see, it's right, they require fewer processors. Yeah, I believe that's the case. It's been a long time since I've thought of this stuff. Now what's happened is we're dealing with failures that really are malicious because there are bad guys out there trying to mess up our systems. So what started as a metaphor of malicious generals has turned into reality. Yeah, I think that the initial internet, the ARPANET, didn't really consider truly malicious actors, did it? No, and I believe that the ARPANET or early internet was crashed when some node became malicious and was giving out inconsistent routing information, bad routing information.
Starting point is 00:30:26 I think what had happened is it sort of told everybody else that the best rate, the best path to everything was through it. And so all messages got sent to that node, which ignored them. As we get towards the end of our conversation here, my last question for you is, what work are you working on now that gets you excited to wake up in the morning and move forward? What makes you happy right now? At the moment, I've just, and this is for a few, you know, the past few weeks, I discovered a, what seems to be a new distributed mutual exclusion algorithm that's not particularly useful, but I think is really nice, but that has two properties that I think make it interesting.
Starting point is 00:31:14 I thought it had two properties that made it interesting. One is I thought it was actually an implementation of the bakery algorithm. And the other is that I thought that the algorithm from my Time-Clocks paper is an implementation of this distributed mutual exclusion algorithm. Doing this work, I've realized that the algorithm is not a refinement or an implementation of the bakery algorithm, but a refinement of some algorithm, some more fundamental, quote, algorithm that the bakery algorithm is also a refinement of. And that I still believe that the algorithm from the Time-Clocks paper is an implementation or refinement of this distributed mutual exclusion
Starting point is 00:32:07 algorithm. So I think that's really neat because I and I think a lot of people had the feeling that those two algorithms are related in some way, since the bakery algorithm uses, each process maintains numbers that can grow without bound. And those numbers are used to order the sequence in which processes enter their critical section. And the distributed state machine algorithm uses logical clocks, which are numbers that increase without bound and are used, from the date of those algorithms that come across, you know, a precise relation between them, I think is pretty cool. That is pretty cool. Yeah, that was in 1974 when the first paper was written on the bakery algorithm. I'm curious, as I'll sneak one last question in as we exit here, how do you come up with these? Are you walking around your, how do you come up with these?
Starting point is 00:33:25 Are you walking around your neighborhood? Are you sitting quietly with your fingers steepled in front of you? Are you staring into the long distance that come to you when you're in the shower? Very few things just sort of pop up spontaneously into my brain. Some things may, if I'm thinking about a particular problem, for example, just the other day, I came up with a generalization of Peterson's algorithm to multiple processes. There is something that is a generalization to it, but this is a different one. And that came about because, you know, I was thinking about mutual exclusion, but this didn't come out of a vacuum. I'm writing a web tutorial on PlusCal, which is an algorithm language that's based on TLA. looking for an example and someone suggested an example, well, actually the algorithm from the
Starting point is 00:34:28 application from my Times clock paper of using that general state machine approach to implement a mutual exclusion algorithm. Then I looked at and I thought, you know, maybe that would be a nice example for the tutorial, but Cytogen could be made simpler. And when I started simplifying it and eliminating as much as I could from it, I came up with this distributed mutual exclusion algorithm. Sounds like it's an organic process. And I think that the myth of the Eureka moment is just that myth. Well, no. I mean, Archimedes didn't think of, according to the story, didn't suddenly come up, gee, how could I weigh a piece of gold without destroying the gold? The king allegedly, supposedly said, how can I tell whether this crown is really made of pure gold? So, that eureka moment didn't come out of thin air. It came from Archimedes having a problem to solve. And almost everything I've done, I can sort of point to something and say, what, not necessarily what inspired the solution, but what got me thinking about the problem.
Starting point is 00:35:42 And then there's sometimes there's been a eureka moment where it really took a good idea. And actually, in my work, it seems that more often than not, somebody, you know, there was some particular problem. And I realized that this particular problem was a special case of a more general problem that nobody had ever thought of before. And that my contribution was that more general problem. Well, we very much appreciate you and your work and the things that you've offered us as both software engineers and computer scientists. Thank you so much for chatting with us today. Well, thank you. It's been fun. This has been a dual episode of ACM ByteCast and Hansel Minutes.
Starting point is 00:36:33 We'll see you again soon, and I would encourage you to check out both podcasts for great interviews with great and interesting people. Thank you very much.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.