Algorithms + Data Structures = Programs - Episode 193: Kevlin Henneys with Kevlin Henney
Episode Date: August 2, 2024In this episode, Bryce chats with Kevlin Henney about Kevlin Henneys.Link to Episode 193 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)TwitterADSP: The PodcastConor Hoe...kstraBryce Adelstein LelbachAbout the GuestKevlin Henney is an independent consultant, speaker, writer and trainer. His software development interests are in programming, practice and people. He has been a columnist for various magazines and websites. He is the co-author of A Pattern Language for Distributed Computing and On Patterns and Pattern Languages, two volumes in the Pattern-Oriented Software Architecture series, and editor of 97 Things Every Programmer Should Know and co-editor of 97 Things Every Java Programmer Should Know.Show NotesDate Recorded: 2024-07-11Date Released: 2024-08-02HPXIntro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8
Transcript
Discussion (0)
you there will be failures they will need to be reported how are you going to deal with that
welcome to adsp the podcast episode 193 recorded on july 11th 2024 my name is connor and today
with my co-host bryce we chat with Kevlin Henney about Kevlin Henney's.
So, switching topics a little bit, but one thing we haven't talked about is I've realized the original way that I know of Mr. Kevlin Henney is not from his talks or not from his books, but it's from the term kevlin henny which i can tell from the look
of his face is something that he is well known for but perhaps not the thing that he would like
to be remembered for but but can you explain what a kevlin henny is yeah i think i'm going to be
remembered for it because you know it's uh yeah my name is now associated with failure yeah um literally so
um back in the day um i started taking screenshots of you know if an application started crashing
i'd take a screenshot of it you know it's just like oh okay let's say and i'd sometimes integrate
it into a talk uh and then mobile phones with cameras, that was a thing. And it's just
like, oh, okay, I can start taking, you know, we've got software everywhere in the world.
But, you know, it's not all perfect. And we see a number of failures. And sometimes it's the
software you write, sometimes it's something else about the system. But from the public's point of
view, it's computer, it's failing, that's it. And I took pictures of these.
I would include them in my talks.
I'd include my workshops.
Sometimes just to make a point, guess what?
Software runs the planet.
Here's the thing that doesn't work.
Sometimes, you know, one of my favorite ones from about 15, 20 years ago was a screenshot.
I took a PowerPoint crashing. And one of the things is a dialog box saying, oh, yeah,
you called basically memory error in the particular DLL and pure virtual function called.
And I said, what can I learn from this? I said, here is an application that offers you. We were
talking about illusions. The whole point about software is that is the business of software.
It's creating an illusion.
Okay.
Whatever you're doing, you're creating an illusion.
And for as long as anything works, that illusion is maintained.
The minute something breaks, there's a crack.
And you start to see how the thing was built, how it was constructed.
It's like a magic trick going wrong.
It's like a stage set falling apart.
You go, oh, that's how they do that.
Oh, okay, I can see behind the scenes.
And so here was PowerPoint.
Okay, sure, I knew Microsoft used C++,
but suddenly I'm getting information about their DLLs.
In fact, if I remember correctly,
it was an Internet Explorer DLL that was failing.
And I thought, here I am in PowerPoint,
but an Internet Explorer DLL failing
is causing PowerPoint to crash.
That's interesting.
And it's written in C++
and there's a pure virtual function call.
Now, how can you call a pure virtual function?
Well, there's a couple of cases where that might happen.
Perhaps there's a path in the code
that actually leads to calling a pure virtual function.
Somebody mistakenly called a pure virtual function in a
constructor or destructor of the class in which it was declared. That happens every now and then.
It's statically detectable. So perhaps Microsoft are not using static analysis on this or code
reviews. Or alternatively, it's a memory trampling thing. Something got zeroed. And so therefore,
we've got a use after free issue you know in other words i'm
speculating i don't have the answer but i'm already learning something about the way the
system was constructed by the way that it is failing and you can see this in a number of
public screens so i would use these kind of instructions kind of humorously and sometimes
people would then send me examples photos they had taken you know hey look at my printer you know
here's the device you know Here's the device driver version on
my printer, this kind of stuff, because my printer's crashed and they got all this information.
And then we hit the era of social media. And people don't start emailing me, they are atting
me, they are including me. And then I start basically resharing it. And that's particularly
visible on Twitter. I started resharing these screens because I thought this is kind of interesting.
I think it's amusing, but it's also educational.
But it's also humbling.
You know, if we're software developers, we kind of make the world go round.
And so, therefore, our software is in everybody's face in one way or another.
Whatever level you are in the stack, there's an error message for you.
And then people started calling.
It was around 2016.
Somebody started calling it, oh, I saw a Kevlin Henney screen at the airport.
And that kind of that moniker stuck.
And so, yeah, it's kind of ended up being a thing.
And, yeah, I've actually, you was uh it has got to that point that my
you know my wife started a new job last year and it's just like oh yeah what's your you know
what's your husband known for or what is it and she said oh well he's got he's got you know
failure screens are named after him in some circles it's just like oh okay yeah that's
kind of different but yeah that is actually the way that works and that is it is
kind of interesting that that's become uh popular but i think it is you know i don't mind i don't
mind that association with my name um uh it's normally normally it's a case of it's really
just software developers and a few people who know software developers yeah but also it's kind
of an interesting thing that idea that failure screens are a thing they uh we can
learn from them i also like to think of it as a public service you know i i the ones that i'm
going to say are a public service is when people are at um you know uh train stations and stuff
like that um sometimes they will add in the rail company the relevant rail company yeah yeah and
and normally they will respond oh we're really sorry about this.
Which station was this?
We could try and get this fixed.
It's kind of like, yep, just happy to help,
happy to be associated with this and to have people report it.
And so my sort of succinct definition of what a Kevlin Henney screen is,
is specifically it's a containment failure. It's not only a
failure screen, but it's a failure screen that exposes a diagnostic that was intended for
consumption by programmers or the people who designed the system, not by humans.
When I was checking into the Hilton Hotel at the St. Louis committee meeting, Hilton was experiencing like a worldwide outage with their entire check-in system.
And all of the hotels had to just – it was just the honor system.
They were just like, oh, what were your check-in dates?
Like, you know, we'll just figure it out when the system goes back up. But I got this error on the app,
which is a little hard to see here, but I'll read it out where it said like,
minus 7000 colon RCI check-in failed open DB equals minus 7000 application and database
versions are out of sync. So that is a Kevlin Henney in my book. But if it had simply said,
we're sorry, we're experiencing a system, you know, error, we can't check you in my book. But if it had simply said, we're sorry, we're experiencing a system error,
we can't check you in right now. That I don't think I would call a Kevlin Henney because
that is, I think, the proper way for an error to be surfaced. And in particular,
if I see an error message like this, where it tells me application and database versions are out of sync, what I'm getting from that is that this has failed in some way that has not been predicted by
the people who developed this system. And also, if I'm seeing this error, it's likely that it's
not been reported back to whoever runs that system. Because if it was, they would probably
log it and send it back to them and then just tell me like we're so sorry this failed but um it's it's like
it's some some edge case where the the unexpected has happened it's that idea it was in some way
unexpected and and i'd agree with you because it's what you're learning is typically for from
my perspective the ones that i regardless of how other people might use it i regard that i definitely regard that as a Ketlin Henny in the sense that that's for me is of interest because it reveals something about the way the system was built in a way that was clearly not intended.
It's not part of the illusion.
If something says, I'm sorry, we're experiencing problems right now.
That's just, and I'm sorry, this was anticipated.
It was planned for.
This is actually business as usual in that sense because it was anticipated but if in failing
it surprises both the both the reader and a developer you know if you said this to the
developer it's just oh that's surprising we weren't expecting that you know it surprises
everybody and it shows us something about what was assumed and what was about how something was built yeah and this can be integer overflow errors um you know every now
and then you get a piece of software that sort of says that gives you um you know maximum the
max for a 32-bit number and you sort of suddenly well yeah that probably shouldn't be in the screen
um uh and you suddenly see various bits and pieces and one of the most
common ones people experience is nan um in web pages you know that was a popular one um you know
i got a few of those we just had a general election in this country and uh there was a point
where i think it was a bbc's coverage um they were talking about various percentages because that's
what you do on election night and apparently um, you know, Nan out of Nan seats have been won by.
It's just like, OK, the parliamentary swing is Nan.
OK, well, that's probably not what you wanted to say.
And that kind of thing is the thing that is surprising.
It shows that something was not anticipated.
It shows you that something is broken in a way that surprises all people concerned.
But it also reveals how the thing was built.
It reveals something internal.
The diagnostics that I like best from user-facing things are sort of a happy middle ground between the two.
And I think Hulu and the Microsoft blue screen of death perhaps do this best so when in hulu if there's
some sort of error um hulu will usually say like we're sorry there was an error like you know uh
we couldn't play this video um you know please refer and it'll give you like instructions like
please try logging out or something but it also has an error code, a code that says, and it's not an error
code that tells you, the user, anything.
That's not necessarily what I'm looking for.
If I get an app that just tells me, we're sorry, we're experiencing a system error,
and then I go to report that to the people who created the app, because maybe the issue
is on my end.
Maybe it's something that hasn't been reported yet. If I want to be helpful and report it, if all I have is, sorry, it's a system error, I can send a screenshot.
But that doesn't necessarily have the information they need.
But if it's something like a Hulu error where they've got some code, one, I can give them the code.
And two, I can Google the code.
And in some cases, like, I do learn some information about that because maybe some people online have figured out like what these codes mean um and i and i think that that's like the
best model for things that have um yeah uh like a user-facing interfaces like like mask don't don't
expose programmer stuff up to civilians but have some reference code or something so that if
somebody reports it,
you can get the useful information out of it. Yeah, it offers something distinct, but it shows both.
Yeah, something's gone wrong, but basically it shows a kind of a deliberate approach.
There's an intent here.
We have designed for this.
We have accounted for the things that cannot be accounted for it's
part of our architecture i think a lot of developers regard that as either not that you know
sometimes it falls between the posts sometimes you have an organization where you know ui designers
yeah ui designers kept separate you know work differently separately from developers and the
ui designs go our user experience is all about this that's errors is not part of the user
experience that we care about
because that's not shiny and interesting.
But actually errors is part of the user experience.
And then developers are sitting there going,
well, that's kind of error stuff.
That's not really what I'm here to do.
And maybe it fails in a way
they're thinking from a programmer point of view,
but not from a user perspective.
So it falls between these two posts
and you often end up with either something you know something's inadequate either way either
you end up sort of anticipating but not offering useful feedback or information somebody can do
something with in other words you just say we're sorry there is a problem well that's
astoundingly vague um i hate those ones where you know like you sometimes you get it's just like an error has
occurred so yes but how exactly do i can't google for this and i also can't report it meaningfully
um i've i can't give them the thing that they would give me you know here here is the little
treasure item that you can give us and we will unlock this you know it's just like this is the
key for us they don't give me that information. I can't do anything with it.
There is an error is not false, but it's so vague that the truth is not useful.
And then at the other end of the scale is like, here's your stack choice.
You know, here's this arbitrary message.
And that is kind of like, what does that mean to me as a user?
You know, I can't do anything with that either.
But it also shows me you didn't care or hadn't really thought that failure was an option.
And this whole topic also is applicable
at the level of libraries
and our interactions with other programmers
and how our libraries report errors.
I used to, sort of my career, I worked on HPX,
which was a distributed system.
And we had this distributed exception mechanism where if something failed in another node, that exception would get propagated.
And eventually, asynchronously, you would receive the failure as the system got shut down. And even within a single node, if you had something that failed in another thread,
then it would be reported as an exception
while the other threads shut down.
And one problem we would have is
people would send us error reports
where they would just be like,
I got this exception.
I don't know what it means.
I don't know where it came from.
And it would be very hard for them to,
on their own, get a stack trace and a debugger. Because the asynchronous event that caused the failure
has already occurred and is gone. And this is just the reporting of that thing.
And so, we built in to our exception catching throwing mechanism, something that would take a
stack trace. And we also would take a stack trace.
And we also would capture the environment variables. And one of the reasons
we did that is distributed system. We would want
to know, like, hmm, what's the
specific
setup of this node where the thing
failed? And you
see actually a lot of programming languages and a lot of
libraries have
a stack trace that'll print
out by default and their error
reporting like python does this um uh and like i i seg faulted the python interpreter doing some
wild stuff the other day and i didn't get a stack trace and i was like like what do i do now like i
don't have i don't have a stack trace i don't and i don't know how to go about getting GDB running on the Python interpreter in
the way I want. And LLVM does this as well. A lot of compilers will dump out a stack trace
of the compiler internals. And some compilers will even say, hey, we've put into a file in
slash temp, we've put all the information that you need to file a bug
report just like go copy this file and then like send it as a bug report um and uh i i think that
like if you're developing a system uh anything that's going to play at scale you really want
to build in some some you want to think about this like how are people going to report bugs to you
uh because otherwise you're going to get you're going to get people opening issues which is just
like i i got you know file not found exception what does that mean yeah and um yeah you're going
to end up with a lot of noise i mean that's that's the thing is either you end up with information
that's not useful because it's not really information or you end up with just like a
load of noise and you've got to cut through that to try and find something.
And I think that that, again, it goes back to this kind of issue of constraints
and how sometimes the shape of development changes over time.
You will, you know, there will be failures. They will need to be reported. How are you going to deal with
that? And that is something that historically people have always said, going back to, I just
highlighted it as the UX people versus the developers before. But actually, I'm going to
say it's all wrong. So let's take a step further back. It's just like, this is in the architecture.
Sometimes a developer will say, oh, I'm going to solve it like this. And they're thinking of it as a programming problem. It's
actually not a pro it's architectural. It's, it's about the, what's the user experience. What about
the customer relations? So it's even the customer experience at that level. And so although a
developer sometimes sees, you know, you're close to the code and though, so you see it as a localized
problem. Oh, a thing might bad might happen. Oh, I'll just, you know, either we report this or I just log it or something simple or I throw an exception or I just rely on the magic of undefined behavior to figure out whatever this platform is going to do.
And maybe that is going to be a segfault with a core dump or maybe it's going to be a windows exception or whatever and somebody else or you know somebody else's problem again douglas adams somebody having having an sep field around
something it's somebody else's problem makes it invisible and each group thinks that the architect
says oh that's just error handling that's a programmer problem uh the user user experience
person goes like well you know what this isn't really what this is about even the product owner
is going to start saying oh well that's not really what we're in this for. That's not the value of this product. So it's kind of everybody's problem,
which is why sometimes it ends up being nobody's problem, but it becomes much more visible. We see
these failures. They are frustrating, but how do people act on them? How enabled do they feel?
And then going back to where you say, if I've got a searchable term, then maybe I find something online.
And as a user, I can understand, oh, okay, other people have this.
It's not just me.
Or this is how I report it.
Or they've thought out a process.
Just send us this file and that's fine, you know, for the more developer-centric thing.
But what you're doing is you're actually enabling people to say, oh, okay, you know, you're not ignoring me.
It's nothing worse.
It failed and you're on your own, kid. Oh, wow. You know, you're not ignoring me. It's nothing worse. It failed and
you're on your own kid. Oh, wow. You're stuck in the rain. It's dark. It's miserable. And you can't
get your work done because whatever has failed. Here, you're at least offering people like, okay,
you've had a problem, but don't worry, we've got this. You know, there's a thing that you can do.
And, you know, we apologize for that and so on. And that's not something that happens by an individual developer having an afterthought.
It's very much kind of a whole team and a whole architecture view rather than a bolt-on and a detail.
Yeah, you have to design for failure and you have to design for, for debugging.
Um, and, and, you know, if you do that, like, I think, I think the languages that, that stack trace by default on error, um,
it's far, like,
it's far simpler for me to fix a problem because I'm going to say 70,
80% of the time, just from the stack trace, uh,
just knowing like where in the code,
which function did the bad thing happen from,
is enough for me to know.
And in C++, I don't get that by default.
And if I open up the debugger in C++,
I, by default, don't compile my code
with debugging information on it.
I compile it with O3.
And I have a lot of problems that only show up at O3.
And by default, I open up the debugger
and it doesn't give me the stack trace.
I don't have a way to get that stack trace
and figure out exactly where I am.
I was thinking just, you were talking about Hilton Honors
and I've got a screenshot from a website.
I can't remember what I was doing, but it was Hilton.
And it's a Hilton Honors one.
And it gives me this perfect stack trace it's hilton on as well and
it gives me this perfect stack trace it's a java stack trace and you can go through it you can tell
everything about the technology stack that they're using um which curiously enough leads us straight
back into the discussion of security because you can kind of look at this it's like a manifest of
like oh you're building it like that oh okay you know it's just like um and if you and if you have
any kind of like guesses like oh, oh, okay, you're using
this version, which I happen to know has a vulnerability, then you could, you know, if
you're a bad actor, then you can get in on that.
Oh, yes.
Seeing like language or stack version information in like a, you know, a publicly exposed error
from anything that's on the internet is like a little bit scary.
It says something about the security posture of the system. I wonder how many vulnerabilities
have grown out of information that's been found from a Kevlin Henney screen.
Oh, that's a really interesting thought, isn't it? Actually. Yeah. Yeah. Yeah.
Because sometimes you don't even need to know anything about the stack. It's just like, screen oh that's a really interesting thought isn't it actually yeah yeah yeah because sometimes
you don't even need to know anything about the stack it's just like if you get enough information
about what type of error has occurred like and you can re-trigger the error again then you might
you might be able to figure out how to use that to do something something of evil yeah yeah something
malicious oh i like that i mean i don't of course i don't know but no no i'm not i'm not on record
for saying that damn let's let's no but yeah that's the point that it is and again you are
you're you by showing the fracture lines you know in in this broken illusion uh it allows somebody
else to potentially it's it's interesting from an academic perspective is interesting from
a support perspective but for somebody who is a bad actor, it's kind of like, OK, I know something here that I can take advantage of. And that's the caution that needs to be done. And again, it reveals, again, why this is an architectural question. It's got to be able to say, this is how we fail. This is how we fail gracefully, but with sufficient,
you know, with sufficient grace and sufficient information, but without showing the whole,
you know, without dropping our whole hand on the table and saying, here's what I'm playing.
You've got, you know, there's got to be this balance, but it has to be intentional. It doesn't
happen by accident. Be sure to check these show notes, either in your podcast app or at
ADSP, the podcast.com for links to anything we mentioned in today's's episode as well as a link to a get up discussion where you can leave
thoughts comments and questions thanks for listening we hope you enjoyed and have a great day
low quality high quantity that is the tagline of our podcast
it's not the tagline our tagline is chaos with sprinkles of information