Algorithms + Data Structures = Programs - Episode 192: Systems Programming & More with Kevlin Henney

Starting point is 00:00:00 Like I could I can dream and see on fire assembly. Welcome to ADSP the podcast episode 192 recorded on July 11 2024. My name is Connor and today with my co host Bryce we chat with Kevlin Henney in part three of our five-part series about systems programming and more. Here's a fundamental question. Sorry, the dog was throwing up this morning, which is why I was tasked with keeping her in my eyesight. Ah, right. I was just assuming, given that we're all on on teams here for those of you listening in podcast land and um and so therefore i always thought it was compulsory when you are you know you're on teams you're on

Starting point is 00:00:54 zoom or whatever there has to be an animal and i just thought yes i thought that bryce was was okay it's my turn i you know here is here is the uh the meeting dog. Usually I would do that for a few minutes. But in this particular case, if I see the thing is she's so quiet. If I didn't sit here on the bed with her, she would wander off and potentially throw up again. And then I wouldn't know about it. And then I would get in trouble with mom. So you got to stay here with me. I know you would like to go elsewhere, but sorry. Anyways, the fundamental question.

Starting point is 00:01:26 Fundamental question. Does software get better with time? Or sorry, do we get better at software over time? Are things better now than they were and will they continue to get better? I think that's actually more than one question, question interestingly although you're offering different perspectives on it i think um i think collectively we as individuals and as developers and development organizations i think we do get better i think there are a number of things that are taken for granted now and things that people just do that they never realized were a struggle that they never realized they that you know was a thing to worry about in fact i in digging around i did a did a talk at um the acu conference um in april um

Starting point is 00:02:19 basically uh on data abstraction i wanted to kind of focus on the history of that but also to bring it up today. One of the things I looked at was Barbara Liskoff's, I've been a fan of Barbara Liskoff's work for many years in terms of language design, the Clue language and her original work on data abstraction. But I'm nerdy like that. So I already knew this stuff, but I saw a couple of her talks and she said,

Starting point is 00:02:39 yeah, when she was given the Turing Award, I think 2008, one of the um one of the things that uh one of the critics online said is why is she getting an award for this everybody knows this stuff and they know it because she did this you know prior to 1973 1974 data abstraction was not a thing um and pretty much everybody that, you know, although there was an idea of object orientation around, her work on abstract data types had the greatest influence, particularly on statically typed languages and the stuff that Bjarne later did

Starting point is 00:03:17 and some of the evolution of it. And everybody just regards it as like, yeah, yeah, of course. It's always been like this. It's like, no, it hasn't. Somebody had to do this and had to do this. And so people are able to build on a lot of features in languages, in libraries, in ecosystems, but also in the way we disseminate knowledge. I think, you know, we can kind of, okay, slightly egotistical from the point of view of the developer view. So collective egotism is a thing.

Starting point is 00:03:47 A lot of things have happened as a result of developers wanting to communicate. The internet, email. I mean, literally, it is developers trying to talk to other developers. How do we communicate? Usenet, and therefore, ultimately, all forms of sharing of data. Open source, the idea, oh, guess what? I've got a piece of code. I wonder if there's a standard way we could share that and others take advantage of it. How can we share our knowledge? Well, tell you what, let's put this on a place that we can all get to. It was FTP servers, and then it was World Wide Web. And then it was just like, it's just that we're very good at solving some of these problems. Our ability to disseminate knowledge is actually really good. So I think in the 2020s, we are definitely building on a lot of benefits that somebody from former eras would go like, wow, you know, you guys have really got it together.

Starting point is 00:04:39 But when they look closely, they might be also disappointed by, yeah, you're a bit better, but not as much better as we thought you could, given all of this. So I think that's the thing. I think that we have a benefit, but also we are not in a position where we're making the best of it. And then what's the statistic? I don't know. It varies. But half of all developers have been developing for less than a year. It's something like that, you know. So the influx of people is huge. And then trying to communicate this, if we were slightly older and stuffier, and that was the whole of the demographic, then actually we might have a different story. But the point is that we've got a lot of people who are coming into development, and they don't have the backstory. They don't have the history. They don't see all of this or understand why such a thing is a big deal and and so therefore they are not in a position they're learning for the first time and so

Starting point is 00:05:35 therefore most developers are on a on on the earlier part of their learning curve and i think that's the that might characterize our profession, perhaps in a different way to many other professions where there's much more of a steady state. I think we're operating much more in a power law. So that's what may prevent us from taking full advantage of all the benefits that we actually have created. But there are also all of that, all of this progress,

Starting point is 00:06:04 all this greater interconnectedness it comes with downsides too um like like take one like security like in this new bold world where we have all this open source code and everything's connected like one like when everything's on the internet just like the internet itself like opened up like these attack surfaces to like almost all digital technology and and things like uh things like uh hey baby it's okay somebody's at the door presumably um things like open source and package managers have opened up supply chain attacks. That's like a whole new thing that we have to think about that we didn't have to deal with before. Yeah. And I think that's an important, that is an important thing because yeah. I'm going to have to, this may be my fridge,

Starting point is 00:06:56 which they were supposed to deliver a new fridge yesterday. And there's a whole story about why we have to get another new fridge because the fridge that we had was a new fridge, but I'll be right back. Talk amongst yourselves. So I'm going to pick up the point that Bryce made there. That whole, I think that, and we've seen it on the people front as well. You know, it's not, there's a social implication here as well. When you start increasing connection, there's a whole lot of benefits. There's a lot of really good network effects, but it's also an amplifier for other things and it creates new possibilities. And security is one of them. And the language or the focus on security in modern development has shifted hugely, even in

Starting point is 00:07:37 just the last five years. And that's not going to change, but that comes as a consequence of connecting everything together. But the assumptions that we have is really interesting. So yeah, do you remember I mentioned I worked in the electricity industry? And that was in the 1990s. And one of the things that we were working on, SCADA systems, so supervisory control and data acquisition systems that sit in substations, they monitor stuff. Every electricity network has these things. OK, so you've got all of that. And we were developing one of these. And we were literally communicating between substations and kind of headquarters

Starting point is 00:08:25 using wet pieces of string. It was kind of like modem die-up speed. Wet pieces of string. Wet pieces of string. Modem die-up speed, okay? So therefore, for us, having a compact representation for our wire-level protocols was really important because you don't have you know we we worried

Starting point is 00:08:47 about bandwidth in a way that um people now simply wouldn't understand um right and so we worried about that that was our one of our main things it's like okay it's got to be really compact and then the question of security came up and it's just like i remember pretty much the decision was well we probably don't need to worry much about security because nobody would be stupid enough to put their um uh their you know their fundamental infrastructure uh onto the internet where it could be attacked um we're already aware of worms and and other issues and it's just like yeah but you wouldn't put the electricity grid online would you no not not so. You'd make sure that was it. Yeah. Now, that's charming from the modern perspective. But also,

Starting point is 00:09:30 it does betray the assumptions that we had because our driver was, well, the minute we start using any security, then that's going to involve encryption. That's potentially, or that's extra stuff. If we were to use secure sockets, then actually we lose a whole load of bandwidth by doing that. So, having made everything really, really compact, we'd immediately lose that and there's a whole load of issues that we'd have with that that were performance related. Now, we took those decisions consciously. And hindsight suggests, you know what? It turns out people do put this stuff on. The Stuxnet worm was an example of the fact, yeah, you can disable another country's nuclear power stations via the internet. Oops. Consequences of connectivity, that connectivity went even further than people anticipated and

Starting point is 00:10:26 that you know there are all of these issues but it does mean everybody now has to worry about security in a way that they were only paying lip service to even only you know five or ten years ago um it's one of those things is like everybody's concerned your language has to be safe you're you're you're top to top to bottom whole stack, we have to know what's going on. Because the fact is that most of your software build is not yours. It comes from other places. Who knows what's there? And we're seeing that that has changed the landscape of what people are wary of. But it's also changed the way they build. And yeah, you're going to get one with the other. But I think that is an important consequence. And we're going to get one with the other. But I think that is an important consequence.

Starting point is 00:11:05 And we're going to see that for pretty much any language. Again, if I enter the language space today with a new language, you can bet, you know, 20 years ago, nobody would have said, so tell me about security and memory safety of your language. It's not likely that would have been a big issue. But now that's going to be one of the first five questions they're going to ask. Hey, you've got a new programming language. Tell me about this.

Starting point is 00:11:28 How do you do that? So the priorities have shifted, which is obviously some languages in a better position to take advantage of that or to be able to say, yep, already sorted. Other languages, oh, we have to do a bit of catching up. But that has now become such a concern that if you were creating a new language that you wanted people to take seriously, you have to answer that question almost by the time you've read the first half of the landing page for that language. You've got to have answered that question. Well, it's interesting. Yeah, yeah, it's exactly the priorities and also the constraints have shifted. You know, if you think about it, in the 70s, the power company had had the money, they may not have been able to get the sort of connection between substations that the programmers would have wanted. Because the technology may not have existed for that.

Starting point is 00:12:43 Whereas, so we used to, computing started off in this era that was very heavily resource constrained. Today, we are much more cost constrained. And that cost can be either on the actual dollar cost for compute. Like, okay, we have an unlimited amount of compute. We can get as much from AWS as we'd want, but it'll cost you. And the cost can also be on the power side. Like, okay, you could do a whole lot of compute, but if you do a whole lot of compute,

Starting point is 00:13:20 you're going to drain the power on your phone real fast. And so it are like, it's much more of an efficiency, more than resource constraint limits us. And then also, we have this, you know, safety, security and reliability concern, that is much more of a a priority than it than it used to be. And I mean, maybe it's that like all of the things that we deal with today, like the cost and the, you know, energy concerns and safety, maybe those would have been more of a priority back in the day, but they just weren't because, you know, there were bigger problems like, you know, oh, we only,

Starting point is 00:14:03 we only have this much memory, you know, we only, we only have this much memory. We only have this much bandwidth. So, yeah, I think that kind of speaks to something else. We're always going to hit some kind of constraint. And I would also argue that power is itself a resource. If anything, it's the original resource. And I love the fact that the mobile phone, I've got a Samsung. It's got better battery life than my previous Samsung.

Starting point is 00:14:35 And we're now actually almost still not quite at the levels that my Nokia in the mid-90s had. In other words, you know, in the sense of being able to, how long can I go without recharging? You know, we hit different boundaries at different times um and as you said memory was an issue in fact there's been some really interesting cases where we've ended up with this curious mismatch and we saw it um with um the pc at one point suddenly you've got all this memory but it's not addressable because it's a 16-bit system and you have to do weird things and then we kind of got over that and then we hit in the 2000s we hit the issue of like well yeah i've got loads of memory but i've only got 32-bit addressable um and uh and then it's a case of like well maybe actually it's cheaper to run things across multiple machines um so you've got the kind of the hadoop kind of approach of like yeah let's

Starting point is 00:15:20 just break this break this computation up scatter it across the network um and then it's a case of people saying, well, actually, it's always faster if you can put it on the same machine. We've got 64-bit addressing. So I've worked with one team. Their limit, they'd hit the 32-bit limit, even though their machine had clearly a lot more than 4 gigabytes. and so therefore we had to solve the problem with various optimizations of compacting you know really you know messing up the c++ data structure to kind of like oh we're going to save a bit here and a bit there so that when we multiply it up to large data so not big data but large data it will still fit in memory but also the other option of one of the other systems they used was like we have to run it across separate processes because that's the only other way we can take advantage of that you

Starting point is 00:16:08 hit 64 bits suddenly you can address it but now we also start hitting the other limits each time we do something you hit the other limits and go back to connectivity um we hit light speed um as an issue you know it's that it's one of those things that you's 20 milliseconds between London and New York. So if you've got a trading system and somebody says, oh, we need to have New York and London in sync with respect to 10 milliseconds, you need a Nobel Prize because you are not going to break that limit. This is one of those laws you don't get an option on. You can't open up a config file and say, you know, I'm just going to change the speed of light today i think it should be faster well let's reboot the system with a faster speed of light so we found and this the light limit is genuine in the sense of um the systems we're working with it's um not i'm not just talking space travel and stuff like

Starting point is 00:16:58 that because that messes with a whole load of protocols and times at the timeouts you can't have a tcpip connection to the probes on Mars because you'll have timed out with those. But even if I've got something like a geosync satellite, that incurs a round trip of over 100, 150 milliseconds. And so that's really slow. That's noticeably slow for certain classes of system and even for phone calls. So we're always going to build a system and then we're going to hit the limits. And as you say, there's power. We're hitting light speed limits on a number of things that actually prove to be impractical. Yeah, it's not just on communication too. It's also on how many transistors, how densely can I pack transistors? We're starting to reach

Starting point is 00:17:43 the point with process technology where there's just not a lot of headroom for us to make smaller transistors we start to to reach a a scale we've started to reach a scale where uh just the physical limitation yeah quantum effects are becoming it says just like no like this yeah you cannot build smaller smaller transistors and that and that's the point's the point and therefore we have to kind of like as it were squeeze the toothpaste in a different direction multi-core etc and I think that

Starting point is 00:18:12 that's the point is every era is going to hit its own limits and regard something as normal and given and so I did so I mentioned Hadoop earlier on the whole idea the map reduce architecture the idea of pushing stuff out into the network just parallelizing the task yeah we break it up into small tasks and do that and i i highlighted um something that um

Starting point is 00:18:39 something to somebody i was running a workshop um i was running a workshop for a company. Where was it? About 2017. And I made an observation to somebody. I said, okay, so we could break this up and look at this. And then I said, look, somebody just wrote a script to solve this problem. And they did it in memory. And it was based on a blog in 2014. and they said they did it all in memory whereas historically they would have used a hadoop

Starting point is 00:19:11 solution and they'd have said okay we're going to use a compiled language and we're going to spread it across the network and they said yeah they just use a scripting solution and because it was all in memory it was it was over 200 times faster than the so-called fast optimal solution. And I said, that's simply down to things like the speed of light and the cost of the network. And I had one person in the workshop say, oh, that's because our architectures are better now. And I said, we know more, going back to this idea. And I said, yeah, that's not a problem. I said, actually, you couldn't have done this solution 10 years in the past. It was not available to you because you wouldn't have been able to address that memory. It simply wasn't available to you.

Starting point is 00:19:55 But I said, and he said, yeah, we know that we're doing good architecture now. And I said, I want you to write a letter to yourself. Send yourself an email in 10 years time and tell you describe to your future self what your current architecture is and then your future self will laugh and say oh you thought you had it all worked out that's the problem is that it's not that we get some things are improving but other things we're just chasing the horizon we're always going to be chasing the horizon the the the shape the the problem space moves um and what one generation thinks, and so going back to the fact that a lot of developers are coming into the industry now, they think there's a whole lot

Starting point is 00:20:31 of things that are normal. And in 10 years time, they're going to go, well, that's not normal anymore. Or what I assumed was normal is now weird or antiquated. And I just dug out this wonderful quote from Douglas Adams. And he said, i've come up with a set of rules that describe our reactions to technologies um anything that is in the world when you're born is normal and ordinary and is just a natural part of the way the world works anything that's invented between you uh when you're 15 and 35 is new and exciting and revolutionary and you can probably get a career in it anything invented after you're 35 is against the natural order of things um and and the point there is that this is also this is a social thing and uh you know uh and causes uh uh older grumpy people to go around go oh in my day

Starting point is 00:21:15 you know just like oh this is unnatural this is on you this is new and fangled and it's not necessary but being inside the technology space i think this is also quite interesting. It's probably on smaller timescales. A lot of things that people regard as a given, somebody else had to fight for. But also, in some time in future, you're going to have to recognize is either part of the furniture, very much given, or is actually no longer relevant. Not because we've progressed, but because we've kind of moved up and sideways. We're constantly moving up and sideways. If the problem space had remained fixed, we'd always be moving up, but we're moving up and sideways. We're hitting new boundaries. We're hitting new constraints. Every time we solve a particular problem, somebody creates software

Starting point is 00:21:57 in a different way. And I think that that, or we use software in a different way. Going back to the beginning of this, where we talked about connectivity the the fact that we're also connected changes the the very software we're trying to create but it also changes the problems we're experiencing um had we stuck in ourselves in the early 90s kind of mode then actually that would all be solved and very stable but we didn't we didn't just move up we moved sideways as as well. You know, not necessarily for CPUs yet, but it'll come, every modern GPU is a multi-chip module. And by that, I mean, it is separate little chips that are stuck together on some interproser. And so, everybody remembers, you know, NUMA systems used to be all the rage back in the day in HPC before GPUs. You

Starting point is 00:23:06 know, you'd have a four socket, you know, an eight socket system connected with some, you know, crazy interconnect. You'd have all of these, you know, you'd have this flat memory space in your process. But in reality, there'd be some memory that some chips could access faster than others. And that that's you know rapidly becoming how we build each individual processor um now and so much of the engineering into building a modern chip goes into the not the chips themselves but also this interconnect that connects each one of those individual modules on the interposer into a single thing. And what we take for granted today of this sort of flat view of memory, which has been an incredibly useful abstraction,

Starting point is 00:24:04 incredibly potent abstraction throughout the life of programming. And we've had chips where it's broken down, and just the utility of having a single view of memory has been so great that the performance penalties for that not actually being how the hardware works under the hood has not been worth the complexity of having to deal with it. Ten years from now, we may live in a very different world where people who are doing systems programming

Starting point is 00:24:37 have to think a lot more about not just allocating memory, but like when I allocate this memory memory where is this memory being allocated um you know it could be it could be the case that that uh 10 15 years from now the the way we program systems will be very different than they are today yeah i think that that point about systems programming is is is really key because that that idea the the illusion of a flat memory model, a consistent memory model, is baked into something that we find in C++. And that's why C and C++ had such a difficult time on segmented architectures historically.

Starting point is 00:25:16 Yes, yes. There's kind of an implicit notion there that you have this. And exactly as you say, what we're doing is maintaining an illusion. We're maintaining this very powerful but convenient abstraction. But that illusion has to be propped up on stage. There's a whole load of stuff that happens behind it. And there's a cost for that. But in practice, we're also looking at, I think, another idea that, as you described, every single chip is a distributed system.

Starting point is 00:25:46 So we normally think of distributed systems, we think we look out to the world and there are the distributed systems. Actually, it's running right next to you. You've got a distributed system. And it's got this very different view of how memory is organized. And that, I think, if you're not at the systems level, then the illusion is okay. It's fine. But if you are at the systems level, this is the system. And it challenges you with a lot of complexity. And it's having the languages or the paradigms

Starting point is 00:26:19 that will actually align with that and say, here's an easy way to think about it or work with this. And as you say, another decade could shift that at the lowest level. It's definitely not standing still. But at the higher level, people might not have to worry about it. In other words, the upper level is actually potentially more stable in this respect. Yeah. You know, it's interesting. I think we're all people who identify to some degree as systems programmers. And it is, I think, to some degree, a nebulous and vague term.

Starting point is 00:26:51 But to me, systems programming is any form of programming in which you need, at least to some degree, to care about the underlying architecture of the platform. And that might be operating system architecture, it might be hardware architecture. But it's a type of programming where you need to think about the whole view of the system, not just application logic. You need to think about what is it that I'm programming to? Yeah. How does this get organized? When this is run, how is this organized at runtime in terms of sequence of instructions, the fact that pipelines matter and cache lines matter and stuff like that?

Starting point is 00:27:40 And how is this organized in memory? And then to start worrying about that. And I think it's interesting from the C++ perspective is that C++, although we say it's a systems language and it allows us a lot of this access, it doesn't allow us a lot of this access. In other words, we're up at L4 caching at the moment, I think, and 10 years time, who knows, L57 caching, who knows what my process is going to have. But C++ has no opinion on this. It can't talk about that because it doesn't have an idea that that's actually what's going on and how my memory is organized. So it can't speak to that. In other words, it's written the system that C++ is built against, so to speak, is the flat memory model that C grew up under with processes that really,

Starting point is 00:28:32 what you saw is what you got. You know, you didn't have to worry about all of this magic happening, you know, this illusion being maintained. And so C++ has yet to cut through that. And it's very hard to do so. We now talk about that as that's the hardware people that worry about that.

Starting point is 00:28:50 We no longer think of it as a systems program, but in one sense it is the system. And we often worry about, you know, the cache friendliness of our data structures. And yet we can't talk about the cache in the language. We have to step outside the language to say, oh, caching matters, I profile it. But I can't talk about the cache in the language. We have to step outside the language to say, oh, caching matters, I profile it. But I can't talk about the caching inside the language.

Starting point is 00:29:10 It's kind of like removed from it. So, you know, we're systems programming, yet without enough access to the system, which I think is an interesting story on that subject, which is I was at Berkeley Lab back when, gosh, I can't remember the Corey. Corey was the name of the supercomputer, was installed at Berkeley Labs. And this was a Xeon Phi system. And it was the second or third generation of Xeon Phi. And this was the first one that was a standalone system. So the Xeon Phi would boot up the OS and run the OS. It wasn't an accelerator card.

Starting point is 00:29:53 And it was basically an x86 architecture, but with a lot more cores. And this particular chip that we had in Cori, it had this HBM memory, this high bandwidth memory, in addition to its sort of regular memory. And there were a few different modes for the HBM memory. One mode was an explicitly programmable mode where you could explicitly allocate this memory and use it for fast memory because it was higher bandwidth, although maybe a little bit lower latency than other memory. And then there was another mode where it just acted as like another layer of caching. And to switch between the modes, you need to get to like reboot the nodes. It was like a boot time option. And there's this big prize in the HPC space. It's basically the Nobel Prize for the

Starting point is 00:30:48 HPC world called the Gordon Bell Prize. And to win a Gordon Bell Prize, you have to do some runs on a big, big supercomputer. And so, when Cori was installed, it was a top 10 supercomputer. And so, right after installation, there were a number of teams, I think it was eight teams, that got exclusive access to this supercomputer for a period of time to do full-scale, full system runs on this 10,000 node cluster to try to get results for a Gordon Bell submission. So, these teams, these were the best of the best. These were the ninjas. They had a ton of developers.

Starting point is 00:31:28 These were folks who had been spending two years prior to this machine being installed, learning this architecture. Like I can dream in Xeon Phi assembly. You know, I knew this chip very well and everybody else who was working on one of these teams knew this very well. And yet, out of those eight teams, seven of the teams used this chip in the cache mode. Some of those teams were like, it's not worth the effort

Starting point is 00:31:58 of us having to change our code to explicitly use this separate pool of memory. We get a good enough performance boost just using this as cache, just using this implicitly. And only one of the teams tried to use it explicitly, but they ran into too many issues and too many quirks, and they ended up just using it in the cache mode too. And it was such a smart move of Intel. They were like, you know what, We're going to try something new. We're going to have this explicitly programmable mode. But somebody at Intel was like, hey, wait, hang on a second. We got to have a fallback just in case this is really hard for

Starting point is 00:32:37 people to use and they don't switch to using this thing. We got to have something so that this thing will be usable. And boy, was that a smart decision, because I'm sure almost everybody who used that chip used it in the implicit mode. And it just speaks to how pervasive the flat memory space abstraction is. And hiding caches from users, hiding differences in memory hierarchy from users, so much work in hardware goes into that.

Starting point is 00:33:10 And it's worth it because it's very hard to program otherwise. Be sure to check these show notes either in your podcast app or at ADSPthepodcast.com for links to anything we mentioned in today's episode as well as a link to a GitHub discussion where you can leave thoughts, comments, and questions. Thanks for listening. We hope you enjoyed and have a great day. Low quality, high quantity. That is the tagline of our podcast.

Starting point is 00:33:32 It's not the tagline. Our tagline is chaos with sprinkles of information.

Algorithms + Data Structures = Programs - Episode 192: Systems Programming & More with Kevlin Henney

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.