Algorithms + Data Structures = Programs - Episode 193: Kevlin Henneys with Kevlin Henney

Starting point is 00:00:00 you there will be failures they will need to be reported how are you going to deal with that welcome to adsp the podcast episode 193 recorded on july 11th 2024 my name is connor and today with my co-host bryce we chat with Kevlin Henney about Kevlin Henney's. So, switching topics a little bit, but one thing we haven't talked about is I've realized the original way that I know of Mr. Kevlin Henney is not from his talks or not from his books, but it's from the term kevlin henny which i can tell from the look of his face is something that he is well known for but perhaps not the thing that he would like to be remembered for but but can you explain what a kevlin henny is yeah i think i'm going to be remembered for it because you know it's uh yeah my name is now associated with failure yeah um literally so um back in the day um i started taking screenshots of you know if an application started crashing

Starting point is 00:01:16 i'd take a screenshot of it you know it's just like oh okay let's say and i'd sometimes integrate it into a talk uh and then mobile phones with cameras, that was a thing. And it's just like, oh, okay, I can start taking, you know, we've got software everywhere in the world. But, you know, it's not all perfect. And we see a number of failures. And sometimes it's the software you write, sometimes it's something else about the system. But from the public's point of view, it's computer, it's failing, that's it. And I took pictures of these. I would include them in my talks. I'd include my workshops.

Starting point is 00:01:49 Sometimes just to make a point, guess what? Software runs the planet. Here's the thing that doesn't work. Sometimes, you know, one of my favorite ones from about 15, 20 years ago was a screenshot. I took a PowerPoint crashing. And one of the things is a dialog box saying, oh, yeah, you called basically memory error in the particular DLL and pure virtual function called. And I said, what can I learn from this? I said, here is an application that offers you. We were talking about illusions. The whole point about software is that is the business of software.

Starting point is 00:02:27 It's creating an illusion. Okay. Whatever you're doing, you're creating an illusion. And for as long as anything works, that illusion is maintained. The minute something breaks, there's a crack. And you start to see how the thing was built, how it was constructed. It's like a magic trick going wrong. It's like a stage set falling apart.

Starting point is 00:02:48 You go, oh, that's how they do that. Oh, okay, I can see behind the scenes. And so here was PowerPoint. Okay, sure, I knew Microsoft used C++, but suddenly I'm getting information about their DLLs. In fact, if I remember correctly, it was an Internet Explorer DLL that was failing. And I thought, here I am in PowerPoint,

Starting point is 00:03:03 but an Internet Explorer DLL failing is causing PowerPoint to crash. That's interesting. And it's written in C++ and there's a pure virtual function call. Now, how can you call a pure virtual function? Well, there's a couple of cases where that might happen. Perhaps there's a path in the code

Starting point is 00:03:19 that actually leads to calling a pure virtual function. Somebody mistakenly called a pure virtual function in a constructor or destructor of the class in which it was declared. That happens every now and then. It's statically detectable. So perhaps Microsoft are not using static analysis on this or code reviews. Or alternatively, it's a memory trampling thing. Something got zeroed. And so therefore, we've got a use after free issue you know in other words i'm speculating i don't have the answer but i'm already learning something about the way the system was constructed by the way that it is failing and you can see this in a number of

Starting point is 00:03:54 public screens so i would use these kind of instructions kind of humorously and sometimes people would then send me examples photos they had taken you know hey look at my printer you know here's the device you know Here's the device driver version on my printer, this kind of stuff, because my printer's crashed and they got all this information. And then we hit the era of social media. And people don't start emailing me, they are atting me, they are including me. And then I start basically resharing it. And that's particularly visible on Twitter. I started resharing these screens because I thought this is kind of interesting. I think it's amusing, but it's also educational.

Starting point is 00:04:31 But it's also humbling. You know, if we're software developers, we kind of make the world go round. And so, therefore, our software is in everybody's face in one way or another. Whatever level you are in the stack, there's an error message for you. And then people started calling. It was around 2016. Somebody started calling it, oh, I saw a Kevlin Henney screen at the airport. And that kind of that moniker stuck.

Starting point is 00:04:58 And so, yeah, it's kind of ended up being a thing. And, yeah, I've actually, you was uh it has got to that point that my you know my wife started a new job last year and it's just like oh yeah what's your you know what's your husband known for or what is it and she said oh well he's got he's got you know failure screens are named after him in some circles it's just like oh okay yeah that's kind of different but yeah that is actually the way that works and that is it is kind of interesting that that's become uh popular but i think it is you know i don't mind i don't mind that association with my name um uh it's normally normally it's a case of it's really

Starting point is 00:05:36 just software developers and a few people who know software developers yeah but also it's kind of an interesting thing that idea that failure screens are a thing they uh we can learn from them i also like to think of it as a public service you know i i the ones that i'm going to say are a public service is when people are at um you know uh train stations and stuff like that um sometimes they will add in the rail company the relevant rail company yeah yeah and and normally they will respond oh we're really sorry about this. Which station was this? We could try and get this fixed.

Starting point is 00:06:10 It's kind of like, yep, just happy to help, happy to be associated with this and to have people report it. And so my sort of succinct definition of what a Kevlin Henney screen is, is specifically it's a containment failure. It's not only a failure screen, but it's a failure screen that exposes a diagnostic that was intended for consumption by programmers or the people who designed the system, not by humans. When I was checking into the Hilton Hotel at the St. Louis committee meeting, Hilton was experiencing like a worldwide outage with their entire check-in system. And all of the hotels had to just – it was just the honor system.

Starting point is 00:06:56 They were just like, oh, what were your check-in dates? Like, you know, we'll just figure it out when the system goes back up. But I got this error on the app, which is a little hard to see here, but I'll read it out where it said like, minus 7000 colon RCI check-in failed open DB equals minus 7000 application and database versions are out of sync. So that is a Kevlin Henney in my book. But if it had simply said, we're sorry, we're experiencing a system, you know, error, we can't check you in my book. But if it had simply said, we're sorry, we're experiencing a system error, we can't check you in right now. That I don't think I would call a Kevlin Henney because that is, I think, the proper way for an error to be surfaced. And in particular,

Starting point is 00:07:39 if I see an error message like this, where it tells me application and database versions are out of sync, what I'm getting from that is that this has failed in some way that has not been predicted by the people who developed this system. And also, if I'm seeing this error, it's likely that it's not been reported back to whoever runs that system. Because if it was, they would probably log it and send it back to them and then just tell me like we're so sorry this failed but um it's it's like it's some some edge case where the the unexpected has happened it's that idea it was in some way unexpected and and i'd agree with you because it's what you're learning is typically for from my perspective the ones that i regardless of how other people might use it i regard that i definitely regard that as a Ketlin Henny in the sense that that's for me is of interest because it reveals something about the way the system was built in a way that was clearly not intended. It's not part of the illusion.

Starting point is 00:08:34 If something says, I'm sorry, we're experiencing problems right now. That's just, and I'm sorry, this was anticipated. It was planned for. This is actually business as usual in that sense because it was anticipated but if in failing it surprises both the both the reader and a developer you know if you said this to the developer it's just oh that's surprising we weren't expecting that you know it surprises everybody and it shows us something about what was assumed and what was about how something was built yeah and this can be integer overflow errors um you know every now and then you get a piece of software that sort of says that gives you um you know maximum the

Starting point is 00:09:16 max for a 32-bit number and you sort of suddenly well yeah that probably shouldn't be in the screen um uh and you suddenly see various bits and pieces and one of the most common ones people experience is nan um in web pages you know that was a popular one um you know i got a few of those we just had a general election in this country and uh there was a point where i think it was a bbc's coverage um they were talking about various percentages because that's what you do on election night and apparently um, you know, Nan out of Nan seats have been won by. It's just like, OK, the parliamentary swing is Nan. OK, well, that's probably not what you wanted to say.

Starting point is 00:09:55 And that kind of thing is the thing that is surprising. It shows that something was not anticipated. It shows you that something is broken in a way that surprises all people concerned. But it also reveals how the thing was built. It reveals something internal. The diagnostics that I like best from user-facing things are sort of a happy middle ground between the two. And I think Hulu and the Microsoft blue screen of death perhaps do this best so when in hulu if there's some sort of error um hulu will usually say like we're sorry there was an error like you know uh

Starting point is 00:10:34 we couldn't play this video um you know please refer and it'll give you like instructions like please try logging out or something but it also has an error code, a code that says, and it's not an error code that tells you, the user, anything. That's not necessarily what I'm looking for. If I get an app that just tells me, we're sorry, we're experiencing a system error, and then I go to report that to the people who created the app, because maybe the issue is on my end. Maybe it's something that hasn't been reported yet. If I want to be helpful and report it, if all I have is, sorry, it's a system error, I can send a screenshot.

Starting point is 00:11:10 But that doesn't necessarily have the information they need. But if it's something like a Hulu error where they've got some code, one, I can give them the code. And two, I can Google the code. And in some cases, like, I do learn some information about that because maybe some people online have figured out like what these codes mean um and i and i think that that's like the best model for things that have um yeah uh like a user-facing interfaces like like mask don't don't expose programmer stuff up to civilians but have some reference code or something so that if somebody reports it, you can get the useful information out of it. Yeah, it offers something distinct, but it shows both.

Starting point is 00:11:50 Yeah, something's gone wrong, but basically it shows a kind of a deliberate approach. There's an intent here. We have designed for this. We have accounted for the things that cannot be accounted for it's part of our architecture i think a lot of developers regard that as either not that you know sometimes it falls between the posts sometimes you have an organization where you know ui designers yeah ui designers kept separate you know work differently separately from developers and the ui designs go our user experience is all about this that's errors is not part of the user

Starting point is 00:12:24 experience that we care about because that's not shiny and interesting. But actually errors is part of the user experience. And then developers are sitting there going, well, that's kind of error stuff. That's not really what I'm here to do. And maybe it fails in a way they're thinking from a programmer point of view,

Starting point is 00:12:41 but not from a user perspective. So it falls between these two posts and you often end up with either something you know something's inadequate either way either you end up sort of anticipating but not offering useful feedback or information somebody can do something with in other words you just say we're sorry there is a problem well that's astoundingly vague um i hate those ones where you know like you sometimes you get it's just like an error has occurred so yes but how exactly do i can't google for this and i also can't report it meaningfully um i've i can't give them the thing that they would give me you know here here is the little

Starting point is 00:13:17 treasure item that you can give us and we will unlock this you know it's just like this is the key for us they don't give me that information. I can't do anything with it. There is an error is not false, but it's so vague that the truth is not useful. And then at the other end of the scale is like, here's your stack choice. You know, here's this arbitrary message. And that is kind of like, what does that mean to me as a user? You know, I can't do anything with that either. But it also shows me you didn't care or hadn't really thought that failure was an option.

Starting point is 00:13:48 And this whole topic also is applicable at the level of libraries and our interactions with other programmers and how our libraries report errors. I used to, sort of my career, I worked on HPX, which was a distributed system. And we had this distributed exception mechanism where if something failed in another node, that exception would get propagated. And eventually, asynchronously, you would receive the failure as the system got shut down. And even within a single node, if you had something that failed in another thread,

Starting point is 00:14:25 then it would be reported as an exception while the other threads shut down. And one problem we would have is people would send us error reports where they would just be like, I got this exception. I don't know what it means. I don't know where it came from.

Starting point is 00:14:41 And it would be very hard for them to, on their own, get a stack trace and a debugger. Because the asynchronous event that caused the failure has already occurred and is gone. And this is just the reporting of that thing. And so, we built in to our exception catching throwing mechanism, something that would take a stack trace. And we also would take a stack trace. And we also would capture the environment variables. And one of the reasons we did that is distributed system. We would want to know, like, hmm, what's the

Starting point is 00:15:12 specific setup of this node where the thing failed? And you see actually a lot of programming languages and a lot of libraries have a stack trace that'll print out by default and their error reporting like python does this um uh and like i i seg faulted the python interpreter doing some

Starting point is 00:15:32 wild stuff the other day and i didn't get a stack trace and i was like like what do i do now like i don't have i don't have a stack trace i don't and i don't know how to go about getting GDB running on the Python interpreter in the way I want. And LLVM does this as well. A lot of compilers will dump out a stack trace of the compiler internals. And some compilers will even say, hey, we've put into a file in slash temp, we've put all the information that you need to file a bug report just like go copy this file and then like send it as a bug report um and uh i i think that like if you're developing a system uh anything that's going to play at scale you really want to build in some some you want to think about this like how are people going to report bugs to you

Starting point is 00:16:23 uh because otherwise you're going to get you're going to get people opening issues which is just like i i got you know file not found exception what does that mean yeah and um yeah you're going to end up with a lot of noise i mean that's that's the thing is either you end up with information that's not useful because it's not really information or you end up with just like a load of noise and you've got to cut through that to try and find something. And I think that that, again, it goes back to this kind of issue of constraints and how sometimes the shape of development changes over time. You will, you know, there will be failures. They will need to be reported. How are you going to deal with

Starting point is 00:17:08 that? And that is something that historically people have always said, going back to, I just highlighted it as the UX people versus the developers before. But actually, I'm going to say it's all wrong. So let's take a step further back. It's just like, this is in the architecture. Sometimes a developer will say, oh, I'm going to solve it like this. And they're thinking of it as a programming problem. It's actually not a pro it's architectural. It's, it's about the, what's the user experience. What about the customer relations? So it's even the customer experience at that level. And so although a developer sometimes sees, you know, you're close to the code and though, so you see it as a localized problem. Oh, a thing might bad might happen. Oh, I'll just, you know, either we report this or I just log it or something simple or I throw an exception or I just rely on the magic of undefined behavior to figure out whatever this platform is going to do.

Starting point is 00:17:55 And maybe that is going to be a segfault with a core dump or maybe it's going to be a windows exception or whatever and somebody else or you know somebody else's problem again douglas adams somebody having having an sep field around something it's somebody else's problem makes it invisible and each group thinks that the architect says oh that's just error handling that's a programmer problem uh the user user experience person goes like well you know what this isn't really what this is about even the product owner is going to start saying oh well that's not really what we're in this for. That's not the value of this product. So it's kind of everybody's problem, which is why sometimes it ends up being nobody's problem, but it becomes much more visible. We see these failures. They are frustrating, but how do people act on them? How enabled do they feel? And then going back to where you say, if I've got a searchable term, then maybe I find something online.

Starting point is 00:18:45 And as a user, I can understand, oh, okay, other people have this. It's not just me. Or this is how I report it. Or they've thought out a process. Just send us this file and that's fine, you know, for the more developer-centric thing. But what you're doing is you're actually enabling people to say, oh, okay, you know, you're not ignoring me. It's nothing worse. It failed and you're on your own, kid. Oh, wow. You know, you're not ignoring me. It's nothing worse. It failed and

Starting point is 00:19:05 you're on your own kid. Oh, wow. You're stuck in the rain. It's dark. It's miserable. And you can't get your work done because whatever has failed. Here, you're at least offering people like, okay, you've had a problem, but don't worry, we've got this. You know, there's a thing that you can do. And, you know, we apologize for that and so on. And that's not something that happens by an individual developer having an afterthought. It's very much kind of a whole team and a whole architecture view rather than a bolt-on and a detail. Yeah, you have to design for failure and you have to design for, for debugging. Um, and, and, you know, if you do that, like, I think, I think the languages that, that stack trace by default on error, um, it's far, like,

Starting point is 00:19:54 it's far simpler for me to fix a problem because I'm going to say 70, 80% of the time, just from the stack trace, uh, just knowing like where in the code, which function did the bad thing happen from, is enough for me to know. And in C++, I don't get that by default. And if I open up the debugger in C++, I, by default, don't compile my code

Starting point is 00:20:21 with debugging information on it. I compile it with O3. And I have a lot of problems that only show up at O3. And by default, I open up the debugger and it doesn't give me the stack trace. I don't have a way to get that stack trace and figure out exactly where I am. I was thinking just, you were talking about Hilton Honors

Starting point is 00:20:37 and I've got a screenshot from a website. I can't remember what I was doing, but it was Hilton. And it's a Hilton Honors one. And it gives me this perfect stack trace it's hilton on as well and it gives me this perfect stack trace it's a java stack trace and you can go through it you can tell everything about the technology stack that they're using um which curiously enough leads us straight back into the discussion of security because you can kind of look at this it's like a manifest of like oh you're building it like that oh okay you know it's just like um and if you and if you have

Starting point is 00:21:04 any kind of like guesses like oh, oh, okay, you're using this version, which I happen to know has a vulnerability, then you could, you know, if you're a bad actor, then you can get in on that. Oh, yes. Seeing like language or stack version information in like a, you know, a publicly exposed error from anything that's on the internet is like a little bit scary. It says something about the security posture of the system. I wonder how many vulnerabilities have grown out of information that's been found from a Kevlin Henney screen.

Starting point is 00:21:40 Oh, that's a really interesting thought, isn't it? Actually. Yeah. Yeah. Yeah. Because sometimes you don't even need to know anything about the stack. It's just like, screen oh that's a really interesting thought isn't it actually yeah yeah yeah because sometimes you don't even need to know anything about the stack it's just like if you get enough information about what type of error has occurred like and you can re-trigger the error again then you might you might be able to figure out how to use that to do something something of evil yeah yeah something malicious oh i like that i mean i don't of course i don't know but no no i'm not i'm not on record for saying that damn let's let's no but yeah that's the point that it is and again you are you're you by showing the fracture lines you know in in this broken illusion uh it allows somebody

Starting point is 00:22:18 else to potentially it's it's interesting from an academic perspective is interesting from a support perspective but for somebody who is a bad actor, it's kind of like, OK, I know something here that I can take advantage of. And that's the caution that needs to be done. And again, it reveals, again, why this is an architectural question. It's got to be able to say, this is how we fail. This is how we fail gracefully, but with sufficient, you know, with sufficient grace and sufficient information, but without showing the whole, you know, without dropping our whole hand on the table and saying, here's what I'm playing. You've got, you know, there's got to be this balance, but it has to be intentional. It doesn't happen by accident. Be sure to check these show notes, either in your podcast app or at ADSP, the podcast.com for links to anything we mentioned in today's's episode as well as a link to a get up discussion where you can leave thoughts comments and questions thanks for listening we hope you enjoyed and have a great day

Starting point is 00:23:11 low quality high quantity that is the tagline of our podcast it's not the tagline our tagline is chaos with sprinkles of information

Your Ad Here

Algorithms + Data Structures = Programs - Episode 193: Kevlin Henneys with Kevlin Henney

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.