Two's Complement - Manual Testing and Observability

Episode Date: January 26, 2021

Matt and Ben discuss whether the city of Portland exists, and decide they don't care. Ben argues that you should test your code manually. Matt talks about when government regulators made him build an ...observable system, and how great it was. Really, it was great!

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Matt Godbolt and I'm Ben Rady and this is Two's Compliment, a programming podcast. Hey Ben, the last few episodes we've been talking testing, and it occurs to me that we're leaving a lot undiscussed because I think, as we've said before, my my intro to testing was all about handing off a video game partially written to some poor person who was going to be sitting and playing it for four hours while videotaping in case it went wrong, which is not ideal. But it's all I had for the first sort of decade. I know that you have some opinions about this. Well, so there's nothing immediately or inherently wrong with that. I think that the key thing, and we've talked about this a few times, is finding a way to get some confidence, right? Like confident that your code works.
Starting point is 00:00:58 Confidence means the ability or the feeling that you're ready to move forward with whatever the next step is, right? And so there's lots of different ways to get there. And I actually think one of the most essential ways is being able to do manual testing. If you can't put yourself in the seat, in the position, in the shoes of a user who's going to be using your software and use it exactly as they do, you're never really going to be completely sure, what I would maybe refer to as Portland sure, and I'll explain what that means in a minute, that the software actually works.
Starting point is 00:01:36 You can write all the unit tests you want. You can write all the integration tests you want, and those will create confidence, but they won't create surety. There's always the chance that you missed a test, that the software, that there's something that you didn't understand that doesn't work the way you think it works. That's often one of the criticisms of writing tests is that, well, it's never going to catch everything. So why even try? Well, yeah. It's the why even try is the place where that falls down. Right. I know. But yeah, it's like, you know, why wear a seatbelt in your car when...
Starting point is 00:02:06 Yes, right. If you crash, then you're probably... Yeah, I don't really subscribe to that viewpoint, so hard to defend. If I'm going to get hit by a semi-truck, what good is the seatbelt going to do, I guess? I don't know. Right, I suppose that's more it, yeah.
Starting point is 00:02:18 Yeah. But yeah, there's definitely that line of reasoning out there. But it's not an either-or, right? Like, you can do all of these things. You don't have to necessarily choose that I'm going to only do one or only do the other, just like I would never put myself in a situation where I was only writing unit tests, because that's just not going to get you to the point where you can be confident in all situations. And it's not going to get the level of certainty that you need to do things that are where there's millions of
Starting point is 00:02:45 dollars on the line or lives on the line or things like that where you have to be very very sure that things are going to work as opposed to well you know this is a web app we're going to deploy it and if there's a bug okay i'll just deploy it again it's fine right right those are there's a cost associated with there being a mistake and sometimes that's something that you can sustain as sort of part of your process. It's like, well, we're 95% sure. And that's certainly good enough for me to push an update to one of my hobby websites, but it's certainly not good enough for me to turn on a new trading system that's going to lose me millions of dollars potentially. 95% is not good enough for my day job. Exactly. Exactly. And so building these systems that create confidence, whether it's unit tests or a testing environment or manual tests or all these other things, it depends on all these factors.
Starting point is 00:03:33 You know, Alistair Coburn was a guy that actually tried to quantify this with his crystal system. He had all these, like, different dimensions of, like, cost and scale. I don't think I've heard of that. And, yeah, he was way into this. And I don't know that've heard of that. on a lot of things and one of the things it absolutely depends on is the cost of the price of failure right so one way that you can make it easier to move forward is not by like writing more tests but by reducing the cost of failure right like structuring things such that if something breaks it's fine no one will notice so things like uh red green deploys or red blue deploy type things where you can well i start rolling it out it out and I'll notice I have good monitoring. So I know if things are going awry, I can sustain like a 0.5% error rate before I decide,
Starting point is 00:04:31 ah, yeah, I'm going to roll this back and we'll lick our wounds and see what we did wrong. And as long as that's acceptable to your business or your use case or whatever. Right, exactly. Or building systems that fail fast, right? Where it's like doing nothing is fine, right? The failure mode of like the system does nothing is completely fine. The thing that's unacceptable is doing the wrong thing. And so if we ever do anything that even smells like the wrong thing, everything just shuts down and turns off and we'll figure it out what happened after
Starting point is 00:04:56 that. It just depends on the context and the kind of systems that you're building. But there's no way that you're necessarily... I mean, it's very difficult to get to a state where you are completely sure that something works, where you are Portland sure. What does Portland sure mean? What does Portland sure mean, man? What does Portland sure mean, he asked rhetorically. So I've never been to the city of Portland. It's a wonderful place from what I hear. I actually have a few friends that live there.
Starting point is 00:05:24 But I've never been there. So I don't actually know for sure that Portland, Oregon exists. I'm very confident that it exists, right? Right. It shows up on maps.
Starting point is 00:05:39 Like I said, I have friends there. I talk to these friends. They say they're there. They say they're there. And these are reputable people. Most of them are reputable people. I would ask you to name names. But maybe they're being deceived. Maybe they have some mental illness that I'm unaware of. Maybe they actually don't live in the city of Portland.
Starting point is 00:06:00 They're just outside of it. And they've been told that the nearby city is Portland and they've just outside of it and they've been told that the nearby city is Portland so there's all these like you know one in a bajillion possibilities that actually Portland's not really real but probably it's real right it's it's all practical purposes you treat it as if it exists exactly exactly and so you can get you can go way down this path and you get into sort of these deep philosophical like like, what is real? And, like, you know, all these, like, Age of Enlightenment theories on, like... Very, very late night conversation after a few beers level discussion rather than... Yeah, exactly.
Starting point is 00:06:36 Like, is the physical world a real place? Are we living in a simulation? But here's the thing. All of those kinds of questions are not useful for engineering, right? Right. Like, that's not a useful engineering question to ask. But it is, you do have to sort of have this level of, like, you can never really be 100% sure about anything, but you can be so sure as to be, you know, assuming that the world around me is real and assuming that the thing that I observe actually is happening and I'm not suffering from some sort of hallucination. There's no Machiavellian aspect to this.
Starting point is 00:07:10 Yeah, then I'm sure. And so that's what Portland sure is to me. It's like- You're as sure of it as you are the state, sorry, the city of Portland exists in Oregon. Even though I've never been there. Even though you've never been there. There's a lot of, yeah. There's a lot of supporting evidence, there's a lot of supporting evidence,
Starting point is 00:07:26 but no actual firsthand experiences or hearsay. Right. So that's Portland Shore. Yeah. So how do you get to Portland Shore? Well, so Portland Surety is this thing of like, there is a certain level of trust in there.
Starting point is 00:07:39 It's sort of like when I, like programming languages are, are these amazing things and computers are these amazing things that have a level of reliability that is hard to match in other areas. Right. Not impossible. And there's certainly other things that can achieve that sort of level of reliability. You're talking like if you open a file and the file handle comes back non-zero or whatever, then you have a file that works and the operating system works. It's very rare that you have to say, what if the operating system isn't working? Or what if the CPU has a bug? Or what if the RAM is corrupt? Right, exactly. It's not that those things can't happen. You know, there's gamma rays, there's other things. But generally, if you ask a
Starting point is 00:08:18 modern CPU to add two integers, and you're not giving it an invalid instruction that's going to cause like an overflow or something, it's going to correctly give you the value of those two integers. And questioning whether or not that is actually going to happen is about as useful as questioning whether or not the city of Portland exists, right? Like from an engineering perspective,
Starting point is 00:08:36 you know, maybe that's an exercise that you want to do at some point and it is not totally impossible that that couldn't happen. We all have war stories, right? Where we've ended up finding, oh, and it was a bug in the kernel. Right but there's a few in far between yes and and and if you spent all of your time getting that level of confidence where you were like you know checking
Starting point is 00:08:55 these every single one of these things and there's like millions of them right like very few people you are one of the few people that i know that actually dive so deeply down into the inner workings of how computers actually work at like the level of the silicon to be even able to answer these questions, let alone be able to, you know, verify that it really works the way that you think it works. Right. And, you know, it's there's only so many things in the world that you can do that with. Right. Like you can't do that for everything. And so understanding how everything works is just going to – it's not practical from an engineering standpoint.
Starting point is 00:09:33 So my point here is when I say I'm Portland sure about something, it's I'm as sure about this as the city of Portland. It means I've dug down to the necessary levels of abstraction, the one that is that inner sense of mind that has seen those kernel bugs and has seen, you know, all those sort of wear and off errors that happen sometimes. Let's just assume it isn't a broken operating system right now. It's more likely to be the threading code we just added. Right, right. Exactly.
Starting point is 00:09:57 So, like, developing that is really important. And so, like, one of the ways you can do that is with automated tests. And this is, when you say developing that, you mean developing the faith in the system. The matching of the faith with the risk. That sort of cost of failure that we were just talking about. Has my level of confidence risen to the point where this is now safe enough to move forward? Can I drive through that intersection with the green light with enough confidence that I'm not going to get hit by a truck? The cost of failure there is really high. So your confidence needs to be high, right?
Starting point is 00:10:28 If it's, you know, walking out onto the sidewalk, it's like, well, you know, if a bicycle is coming along and they hit me, that's not going to be the end of the world. So I'm just going to keep my AirPods in my ears and keep walking, right? Like those are different levels of failure cost. The trade-off between the certainty that you're right and the cost of being wrong there's a sort of yeah yeah so you're developing you so i interrupted you were talking about like unit tests are just one part of developing a sense that an appropriate level of confidence that your code is correct but what other things can there be i mean obviously we've just talked about we started with manual testing that is an obvious thing that I would do if I have just made a change to a piece of code,
Starting point is 00:11:08 then I'm going to run it. Maybe I'm going to step through it in the debugger. Maybe just, you know, go through and see line by line, is it doing broadly what I would expect it to do under the test circumstances that I have created for it? If it's a web app, I'll load it up in the browser and i'll look at the javascript error console and i'll click on a few things that i know are problematic and just develop a little bit of a sense of is it okay it's the thing about that is it's hard to communicate to other people like i work on a hobby project which is web-based i know the things that i randomly click on that
Starting point is 00:11:40 have gone wrong in the past and i've written down a few of them but we don't have i don't have that nice sense of a safety bell of an automated version of it i've tried to create one of those and it was difficult to make and hard to keep up to date and ultimately i don't think gave me the surety that i was expecting it compared to the pain of keeping it up to date but it was intersubjective. I could say to other people who were working on the same code base, hey, deploy to the staging environment,
Starting point is 00:12:12 run these tests against the staging environment, and then you're pretty sure that the staging environment is going to work when we promote it to production. But ultimately, those of Atrofid, and I think really it comes back to your original thought, the cost of me getting it wrong is egg on my face not lost business not lost right uh revenue not trust
Starting point is 00:12:32 really going so i can afford to make it the odd mistake in my particular case but maybe if you are you know if you're um a government website you do need to be up all the time or if you're i mean have you seen many government websites they're not really all that great that's yeah okay i was trying to think of something for which you know the certainty of it being up was important uh and yeah government sprang to mind but your point is valid yeah well you know trading systems are a good one i mean there's like embedded devices or another one and you know maybe we'll talk about that at some point where it's like yeah you have to be certain because you're not going to get a chance to change it right like you're
Starting point is 00:13:06 gonna you're gonna upload this firmware on the devices that are not going to be connected to the internet because you know do you really want your pacemaker connected to the internet right and i mean medical things in general also i mean if you want your portland surety indicator to be as high as it's going to be It is definitely in the firmware for the defibrillator that's on the wall saying... Absolutely. And aerospace, there's definitely these kinds of situations in aerospace. I mean, there's lots of domains where it's like there's either lives at stake or there's significant amounts of money at stake. And so it's really important to get things right. And I mean, to your example here about I have these manual steps that I go through and I tell people to do that.
Starting point is 00:13:45 I mean, I think we would all recognize that the best way to do that is to try to find ways to automate that in a way that is scalable, right? Where you're not writing really slow running integration tests and having like hundreds of them that are kind of brittle. But at the same time, like you don't want to have the readme with the manual set of steps that goes, here's what you do to check this. But I will say, I do think the ability to test things manually is incredibly important. And I personally, as much as I'm like the testing guy, the automated testing guy, I don't ever, well, maybe not ever, ever is a strong word, but I very, very, very rarely, and I have the one counterexample to this actually, I very, very rarely make a change to a piece of software where I haven't gone through and used that software as the user
Starting point is 00:14:37 would, right? So if I'm trying to add a feature to a system, usually even beforehand, I'm like trying it out and trying to reproduce like, oh, I can't do this. Then I go try to do that for myself. And I say, oh, well, that is kind of painful. Maybe we need to add some functionality here. Right. And then I will drive that behavior out with tests and ensure that my tests are, you know, have all the nice attributes that we've talked about where they're fast, they're reliable, they're informative. They help guide me toward a design that is testable and therefore, you know, maybe a little bit more decoupled and all these nice properties.
Starting point is 00:15:06 But then once I'm done writing those tests, I go back and I use the software manually and I put myself in the shoes of the user that I'm building this thing for, users that I'm putting this thing for, and I try to use it just as they would. And if I find that difficult to do, because for example, I don't have access to the production data that they have, or I don't have an environment that's realistic or I have a device that's different, I solve that problem. I go and I get the data or I change the software so that I can connect to a production environment in a safe way and use the data. That's a trick. I mean, read-only access to a production database that's like a mirror of your pride database is a great technique for this.
Starting point is 00:15:45 There's lots of other techniques for this. Right. But being able to use the software as your users is using it, that's how you find the missing tests, right? Those unit tests that you didn't realize you needed to write. That's how you find them. But you should only ever do that once. The purpose of that exercise is to give yourself that sort of Portland surety that when you go and you tell some user, whether it's directly face-to-face or with an email marketing broadcast, it says, hey, check out our cool new features. You are really sure that that stuff works because you've seen it work, right? But you only ever want to do that once. you take that knowledge that you learned by doing that and figuring out, oh, actually, this doesn't work in this case.
Starting point is 00:16:30 And you go back and encode that into the tests so that not every person that comes after you has to follow those manual steps again. You've taken that confidence and you've put it intersubjectively into the code. So now everyone can share your confidence because you've kind of put it into the tests, right? Right. You've seen it work for yourself and you've recorded that in the tests. So the one situation where I will usually not do any sort of manual testing when I'm fixing a bug specifically
Starting point is 00:16:52 is when I have a stack trace that shows exactly what the bug was, and the stack trace has some unique elements in it, right? Like it's hitting a piece of code that is not often traveled or is pretty deep and can show like a particular path. And if I can write a unit test that completely reproduces that stack trace to where it's like almost or exactly identical, that usually gives me enough confidence to then fix the bug and make the test pass and then just commit and deploy and not actually have to reproduce the bug manually first
Starting point is 00:17:28 and then fix it and then go try again and confirm that I've fixed the bug. Because usually those stack traces, depending on exactly what they are and what path they're taking through the code, if you can reproduce it, it's a pretty good indication. It does sound slightly pipe dreamy for some of the things that i i'm involved with um just because of the number of moving parts so the pipe dreamy thing is interesting right so i think you have to
Starting point is 00:17:57 address those things as they come up and i think that you know part of the skill of writing these kinds of tests is starting with the assumption that given enough effort this is almost always possible and then sort of backing off from there and finding the sort of the right level of effort to put into it where you can maintain because the key thing is is you want to be able to maintain right the sense of confidence among yourself and your team that if you have a whatever your process is whether it's run a ci environment with a whole bunch of unit tests and then do some limited amount of integration or manual testing or whatever it is. But if you follow the process, that you will achieve a result that is good enough to move forward. So that's not complete certainty of no failures.
Starting point is 00:18:54 Right. But it's given our environment and given our risk tolerance and given our failure costs. If you follow the process, you achieve the right level of risk, right? Mm-hmm. Okay. process, you will, you achieve the right level of risk, right? And if you're finding that you can't, that after you're done following the process, you're like, maybe I'll check a few more things, right? That's, you need to listen to that and, and, and say, okay, the thing that I should do is okay for this immediate thing that I need to do, maybe check a
Starting point is 00:19:22 few more things. But then, then right after that, I need to make some improvements to our process, whether it's writing more tests or, you know, one way that you can talk about, you can address some of this is by adding observability to your systems, right? Maybe I can't write the unit tests that tell me for this huge production environment with thousands of servers that if I were to replicate would double my AWS costs, and I don't really want to deal with right now, right? I'm sure you can relate to this. That's exactly where I'm, yeah, all these things are coming from that sort of sense. Yeah. So if you can't get that level of confidence just from unit tests because of the nature and the
Starting point is 00:19:58 cost of your environment, another way to get that level of confidence is through observability. So maybe you deploy this new thing, and you have some, and we can talk about lots of different ways to get observability, but like one would be adding in structured logging, right? So you have a special structured log that all of your applications write to that lets you gather certain metrics about what the software is doing, how it's behaving. You know, maybe it's error rates. Maybe it's like, oh, I know that there's a queue over here that takes these incoming messages. And I just potentially change the number of incoming messages. So what I really want is an ability to see the size of that queue as I roll this thing out. And as it starts propagating to
Starting point is 00:20:42 all these different servers, is that queue starting to grow? And if it is, I'm going to roll this thing out and as it starts propagating to all these different servers is that queue starting to grow and if it is i'm gonna i'm gonna roll this back so that implies a whole bunch of things it does the ability to roll back yes i was gonna say that was the first thing that came to me because like oftentimes by the time you've hit the big red button maybe it's not so easy to unhit the big red button right right right right so this gets back to this whole thing of like in order to move forward you have to have confidence you have to reach that level of safety one way to get that is writing tests another way to get that is to make it real easy to roll back. So you put in the effort to build those systems that are easy to roll back so that, okay, well, our unit test coverage in this area isn't great. I'm going to add a little bit of observability
Starting point is 00:21:16 here. I'm going to mix that in with a little bit of rollback magic. And so I can get that confidence to deploy. I just, yep, push deploy. We're good because I can see very quickly if this doesn't work, I can undo it and I have confidence that it's going to undo properly. That's kind of a third dimension to the sort of confidence versus cost of getting it wrong. Maybe it's sort of related to the cost of getting it wrong. And that is how long it can be wrong for before the severity kicks in. Now, like if you're doing a database migration, like a huge new change to the way things are stored,
Starting point is 00:21:53 maybe it's very, very expensive to go back because you've now created a ton of records with the new format that you can't undo. So it's hard to roll back. So there you perhaps have to account for it by having extra testing even more confidence in the system before you roll it out but if you're moving a widget around on the ui right the cost of rolling back is the cost of getting it wrong was also that same noise
Starting point is 00:22:17 uh it's de minimis yeah i didn't think i could do it again so i wasn't going to try it's it's not such an expensive thing um yeah so for me when I'm doing my staging rollouts of my my funny little hobby um that's mostly because I can't 100% trust the rollout process and so if I've broken something by renaming a directory somewhere on some AWS thing again because the cost of keeping everything up and running and to parallel systems is high I push to a staging thing and then I guess what I'm looking for is does it start responding to requests and does the page open up okay cool that probably means I can take that exact version and push it to production without any incident so in a sort of funny way maybe that's observability
Starting point is 00:22:59 will this deploy succeed in at least one instance that looks very very close to the real production yes then okay now it's good to go and then i can roll back because it's a symlink i change right to go back to an older version so i i'm lucky that i have that in my case and as again the cost is very very low if i get it wrong but in a like as we say an expensive database system or a financial system where if you can turn it off within 10 seconds having observed it doing something wrong you could still easily have lost millions of dollars then you do need a different approach but yeah observability is a useful trait in of itself as much as you know having metrics and dashboards and counters so that while you are rolling out your software you get the nice warm
Starting point is 00:23:41 glow of seeing the queue length either increase because you you know you're now putting more things in the queue and that's good and that's what you wanted or decrease because you've sped up the the calculation or whatever those things are good for as a human to sit and watch and kind of enjoy the expected results of your change being becoming visible to you both in a pretty graph format but also in terms of something you can look at and debug later if it turns out to not be the case that you wanted but observability is useful for a number of reasons other debugging later on like what what else went wrong building observability into your application has always been a good thing for me like you mentioned structured logging what kind of thing are you
Starting point is 00:24:18 thinking about when you say a structured log a lot of times we get into these situations where we just use whatever logging framework is available to us when we're building stuff. And we write out these sort of human readable, it's like timestamp, log level, subsystem name, and then a message. And we've all been there and done the grip to find the metric that you didn't actually push to Prometheus or whatever.
Starting point is 00:24:43 It's like, well, okay, we can infer it by grepping with this thing and this regular expression and right. Half of my bash foo comes from just, you know, being forced into situations where I have to do, you know, a grep sort unique blah, blah, blah, blah, blah, to figure out what things are doing. And so the structured logging is an approach that says, maybe we should try to do these things on purpose instead of by accident, right? Because we know this is going to happen. And I mean,
Starting point is 00:25:09 this is something that happens a lot. And I think one of the problems, we have these discussions in the break room, and we have these meetings, and we have these podcasts we talk about where we'd be, wouldn't it be great if we had this? And everyone's like, oh, yeah, that's great. But if I'm doing that, then I'm not doing something else. And that's a valid concern. But what inevitably happens is that we wind up needing these things and it's like the accidental observability moments that we end up relying on. How many times have you and I cracked open Wireshark to see what two services who are talking to each other are doing? And that was an absolutely essential life-saving move. Well, maybe not life-saving, but money-saving move essential life-saving move well maybe not life-saving but money-saving money yeah and and how terrible would it have been if we if we hadn't been able
Starting point is 00:25:49 to do that and the reason that we were using wireshark instead of something else is that's all we had only by virtue of communicating over the network where we provided this tunnel into what our systems are doing and the fortune and we never of of having expensive recording machines for other purposes that happen to be capturing all the packets anyway often yeah well oh that's that's lucky that we have this this sort of trace knocking around but i mean also i mean how often have you run s trace or what some of the other applications because you're like well i don't have the observability right that i need to be able to understand what's going on in this situation and thankfully i can reproduce it enough.
Starting point is 00:26:25 And the best thing I can do is S-trace the process and hope to heck that the problem happens and we can see whatever file descriptor is hanging on or what's going on in that respect. Exactly. Exactly. So those are the things that we sort of, the moments of observability, the ability to do this, that we sort of stumble into just by the fact of the environment that we're running in. And so structured logging, I think, is one example where you could say like, no, no, no, what if we actually did this intentionally, right? What if we built the system
Starting point is 00:26:55 with these needs in mind that we know we have? We know we're going to need this stuff, right? It's just a question of what tools do we have at our disposal to get it. And I think one of the things that can happen if you do that, that is more difficult if you don't do it intentionally, is that that observability matures and morphs into something that lets you now take automated action based on it. So it's one thing to grep through a log and see something that's happening. It's another thing to send a stat to StatsD or Prometheus or some other thing to see a pretty chart. But only when you get to a point of maturity where you can be confident that the system should behave in certain ways and confident
Starting point is 00:27:37 that it shouldn't behave in other ways, can you start doing things like, oh, I know that I don't even need to trigger a rollback if there's a problem. I can just push this out and I have enough experience with the tools that I built for observability and I have enough history now to be able to say with confidence, if this queue size exceeds this, there's something real wrong and it just should roll back automatically. Right? I see. But to get there, you have to progress through the stages of adding the observability in the first place, having it be in a format that is easily consumable by all the environments that you need and all the
Starting point is 00:28:12 ways that you need, and then establishing that sort of pattern of what normal looks like for your application and understanding how the failure modes look like. And a lot of this comes from not only adding that observability, but also doing like chaos monkey things to like simulate failure and understand what your failure modes are having the recording in place so that when those sort of you know gamma ray moments happen and things break in strange ways you have recorded it and you can go back and be like oh well this was really interesting because this failed like this so you have to sort of have that that sort of hard-fought history to be able to get to that point but once you do and you know what normal looks like and you know what abnormal looks like then you can start automating some of these things i think actually one of the problems
Starting point is 00:28:53 that people run into when they start hearing about this and like observability it's like oh yeah i'm gonna add all the structure logging and all these stats is they try to jump right to the automation right i see they're like oh yeah i'm, without passing go, without going around a couple of times and saying, I think I see how this is going to fit together now. Yeah. They start making assumptions about how they think the system should behave instead of observing how it actually does behave.
Starting point is 00:29:16 And then what happens is that you get a whole bunch of other failures that happen on top of it, whether it's like a ton of alert spam. It's like, oh, the queue size exceeded the 10,000. It's like, yeah, actually it does that all the time. It's just that every Monday morning when some other exogenous event happened and you had to let it run a ton of alert spam, it's like, oh, the queue size exceeded the 10,000. It's like, yeah, actually it does that all the time. It's just that every Monday morning when some other exogenous event happened and you had to let it run a couple of weeks
Starting point is 00:29:29 to just notice that that's normal. Exactly, exactly, exactly. So if you have to sort of progress through that whole process of like, you know, get those tools for observability in place, observe them, figure out what normal looks like, figure out what real is. And then you can get to the point of actually taking automated action on it.
Starting point is 00:29:54 But when you get to that point, now you've created this wonderful safety net where you can go real fast because it's like, yeah, it's like there's all these different ways that we just can't break things because if we make certain mistakes, the will will recover in a safe way now obviously this doesn't this can't apply to everything you know we talked about embedded systems earlier and i think you know we're pretty sure that that's not so what we're talking about here are sort of traditional server models certainly fall into this category where you can say hey i got a request i did a whole bunch of things to it and i as a result of it here's the response i posted and i can measure queue sizes or how long each function took or whatever is a useful piece of information throughout the processing of a request and then making that and then probably aggregating that over many instances in today's
Starting point is 00:30:39 sort of modern server infrastructure and then sort of having an alerting and monitoring system that sits a level above that and is configured to to look at the the queue size in aggregate or the average queue size or the minimum to maximum queue size or things like that but observability can give you more than that kind of uh of alerting and monitoring i mean we i talked about it a little bit with the the idea of like using it as a debugging tool. You know, we talked about S-Trace and stuff, but I've had some experiences with something similar, which was not aggregated at like a server to server level. It wasn't alerted on to server to server, but it was recorded and kept because I was working on a trading system where quite reasonably, five or six years later, we might get a call from a regulator saying, hey, this trade you did in 2015, why did you send this specific trade? What was special about this trade?
Starting point is 00:31:36 And we have to be able to answer the question of like, why do we buy 100 Google shares then? And one of the ways that we developed a system to answer those kinds of questions was essentially the kind of thing you've been talking about, observability. We had a trace from every single piece of the software. As an event flowed through it, everything annotated like a single message with, hey, I made this call because these things. And they were all referenced and cross-referenced with numbers numbers and time stamps and things and then we would like write that out to disk and it was written specifically for answering these kinds of questions for regulators but it was the most useful thing in the world for ourselves just to pick over the corpse of a problem trade or a crash that we'd had well we got every piece of information we needed oh we just processed a batch that had 300 things in it.
Starting point is 00:32:27 That's higher than we've ever processed before. Maybe that leads to the crash. And we were able to do that in a real time, as in, you know, microsecond level trading system. So there's kind of no excuse. Well, that's not true. There's always an excuse to not expend engineering effort, but like it can be done.
Starting point is 00:32:43 You can make it a high enough priority to keep track of the decisions you're making and gather observability, even when you're worrying about microseconds in terms of latency. So that for me was super useful. And now I kind of look for that level of observability in almost everything that I come to.
Starting point is 00:33:03 And very few things have that, but very few things also have this sort of very straight, one piece of observability in almost everything that i come to and very few things have that but you know very few things also have this sort of very straight one piece of information comes in calculations happen one outcome comes out the other side it's not we're not always in that set position so but yeah observability is is fascinating and i hadn't really thought of it in terms of um taking it to the alert level before where where the very fact that you can develop surety about what your system does can give you the faith in your system not in your system but the faith in the deployment of the system you will quickly know whether you got it right or not and then you can roll it back and those are all kind of interlinked in terms of
Starting point is 00:33:40 the confidence of being able to move forward right right right there's some deep relationships there for sure and i mean you kind of were in the interesting position of you were forced to build that observability into your system for a regulatory reason. But once you got it, it was like, you can never take this away from me, right? Like it was such a wonderful thing. It's like almost anyone who knows how to use S-trace
Starting point is 00:34:00 beyond like man S-trace and like looking at the first thing now suddenly you have a new thing in your arsenal for debugging almost everything like the first thing you do like well this is weird i would just run strace on it and having that level of observability and that level of information i guess to to your point earlier similarly wireshark right once you've worked out wireshark it's amazing how many things i solve with wireshark these days oh why can't i see this thing on my home network oh i know i just run wireshark and it's like that's that's crazy why are you using this tool to do that well because it it answers
Starting point is 00:34:34 the question yes no one gave me this information but they're telling me but not explicitly and i can find that that information if i use the right tool how much better could the tool be if it was built into the applications that actually have the information that you want and is exposed? Yeah. Not digging on Wireshark. I mean, sometimes one of the ways that you can get this observability is like, well, we're going to send this data over the network, right? And then we'll be able to see it in the captures and then we'll know what it was, right? And we'll build a parser for it. Like sometimes you do that on purpose. Yeah. Right.
Starting point is 00:35:11 I mean, we've had situations before where we've had just a ping running and used that as well. This means that there's connectivity going between these machines. We can see it in the capture. This kind of gives us a yardstick to measure stuff by or a timestamp of the last known good connection. So, yeah, it's definitely there is the wireshark hammer there's the s trace hammer there's even the system tap hammer which i've used twice to to amazing effect and found kernel bugs which to your point earlier about whether the kernel is it could be trusted or not but they they they do in give the the wielder of that hammer maybe a little bit too much confidence that it's the right hammer to use for all problems so you know we should all acknowledge that right now but they
Starting point is 00:35:49 are useful yeah useful tools but i mean it's it's it's a different thing i feel like when you when you build these kinds of capabilities intentionally rather than be forced into doing them by a regulator or sort of fall accidentally into them by the nature of you sending data over a network or whatever it may be. I can't prove this. I'm not Portland sure of this, but I get the feeling, based on my experience, you will get to that sort of magical state of having full understanding of the normality of your system sooner, where you can start automating these things, automating rollbacks, automating deployments, automating alerting in a way that isn't spammy, if you do it intentionally, and if you kind of do it, especially from the start when you can.
Starting point is 00:36:35 And then the other thing that I kind of wonder about on this topic is, you know, I was saying before, we want to put ourselves in the seat of a user that's using our software, right? And, you know, we've been talking for the last few minutes now about observability. I do kind of wonder if there's maybe a little benefit that you get from test driven development where it naturally makes your design better. I wonder if there is sort of this natural thing where like adding observability to a system makes it easier for you to put yourself in the seat of a user. Because a lot of times, if you want to reproduce what a user has done, you kind of need to like rearrange the matrix so to speak right like you need to put this data over here and you need to have this thing be in this state and you need to be able to sort of manipulate the the environment that you're in um to reproduce what the user was doing at the time and i have to wonder if if that sort of ability to sort of reach into the different parts of your system and at least see what's going on, if not control what's going on, is very related to the ability to observe it. And so like a dumb example of this would be, I can clone the production database to my local workstation using a read-only account and a read-only password.
Starting point is 00:38:05 Maybe it's even from a mirror of the production database that has yesterday's data in it. And I can clone that out of my local workstation. And we have whatever procedures are in place and everybody's comfortable
Starting point is 00:38:17 with the idea that I can do this. There's no sensitive data in there. It's anonymized or whatever. Yeah, yeah. Some magical... Yeah, it's been scrubbed if that's what needs to happen. You know, minimum amount of scrubbing necessary, but we've
Starting point is 00:38:28 making sure that there's no personally identifiable information or whatever. Whatever constraints are your problem, I can do it, right? Yep. I can then go in and I can monkey with that data to experiment and figure out what I think happened to this one user one time that caused this stack trace to happen that is now sitting in my JIRA ticket or
Starting point is 00:38:43 whatever it is, saying like, hey, we got this bug. I can then mess around with that data until I can use it and reproduce the stack trace. And if that stack trace matches exactly, I am very confident that I have now reproduced this bug. And that means that I know what data was in the database at the time that caused it. And that means that when I fix it, I can go back and follow that same series of steps. And if I don't see the stack trace, if it doesn't error, I can be very confident that I've fixed it.
Starting point is 00:39:10 Yeah. The ability to do that, I feel like is not all that different from the kind of things that you would normally build into most systems to make them more observable, right? Because the whole like, what about PII? We can't like take all this data and just shove it into this whole system where anybody can see it because then all the personally identifiable information using acronyms. Yes, thank you for clarifying. Yes. All the personally identifiable information will leak out and you can't have that from a regulatory problem or from a legal thing. Okay, well, we're going to have to solve that problem anyway because we need it to be observable and we need it to be manually testable.
Starting point is 00:39:43 We need our developers to be able to put themselves in the shoes of a person who's using the system. So again, this is something that I'm less sure about. I'm very sure that there's this overlap between testing and good design. I'm less sure about this, but I'm starting to wonder if there is a little bit of like, once you start adding in these hooks and once you start sort of thinking in this way where systems are decoupled enough and then the connections between them have this observability property to them or they have this sort of like capturable property to them this this savable storable property once you start building things that way if you don't wind up with a system that is naturally more observable
Starting point is 00:40:25 because the engineers have to be able to reach into various parts and tweak things that's an interesting point and probably a good point for us to stop here because I'd want to think about that some more that's a good one to think about for sure you've been listening to Two's Compliment
Starting point is 00:40:41 a programming podcast by Ben Rady and Matt Goldwald find the show transcript and notes at twoscompliment.org Thank you for listening to Two's Compliment, a programming podcast by Ben Rady and Matt Godfrey. Find the show transcript and notes at twoscompliment.org. Contact us on Twitter at twoscp, that's at T-W-O-S-C-P. Theme music by Inverse Phase, inversephase.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.