Two's Complement - Observable Metrics

Episode Date: April 10, 2026

Matt and Ben explore the intersection of testing, metrics, and observability in performance-critical code. They debate push vs pull metric systems, share war stories from financial trading systems, an...d ponder what to do when your program can't tell anyone it's in trouble.

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Matt Godbolt. And I'm Ben Reidy. And this is Tooth Compliment, a programming podcast. Hey Ben! Hey Matt. I am not convinced my levels are right here, so I apologize if the audio is awful on this. Something isn't going well here. You think it's too quiet? I think it's too quiet. I mean, how does it sound to you?
Starting point is 00:00:33 It sounds good to me. All right, but I've turned the gain up here and it looks really... Okay, well, we'll go with that. Anyway, hi! How are you doing, friend? This is a catastrophe for How are you doing friend? This is a catastrophe for, for, for editing Matt, as we were just, uh, during the, the opening jingle, we were, we were taking the Mickey out of editing Matt because he's a jerk, um, but that's fine. Cause he's not here. Yeah. The moment you find that yesterday, Ben is a jerk and tomorrow Ben is the most
Starting point is 00:01:00 responsible person on the face of the earth. That's he's going to take care of everything. Cause I'm not, yeah, you know, that's, yeah, that's, yeah, that's a hundred percent how I see the world too. Yeah. So we, we, we had started this as we often do by. Chatting in a Google meet before this. And, and then we were like, let's just record, let's do it.
Starting point is 00:01:21 And then we had to switch out and then unfortunately, because we're now using a separate recording system that involves fiddling around with web browser settings and plugging in different microphones, everything. And now I can't even remember what it was we were talking about. So when I just cut over to this, I was of course in the middle of writing a test
Starting point is 00:01:40 because what else would I be doing? Of course. And I was saying how one of the things that I love to do is test my metrics. That's right. And you had made a very interesting point about performance sensitive code and this technique. So do you recall what you said?
Starting point is 00:01:57 I do now. Thank you for being the responsible adult and remembering from one minute to the next what on earth is going on. Tomorrow, Ben, has manifest today. Yeah, it seems so. It seems so. So yeah, you had said that you were writing tests for the metrics in your code. And that's a fantastic thing to do, because if you're relying on those metrics, you
Starting point is 00:02:16 probably want to make sure they're right. And then I'd said for some areas of performance code. So certainly in C plus plus land, sometimes there isn't a seam for some areas of performance code, so certainly in C++ land, sometimes there isn't a seam for you to put, uh, like a, an interface or some kind of like, um, uh, a measuring point in your, uh, your regular code to write a test around. So like the classic example I can think of is if you've got a high performance, like cache of information, that is like a software cache, just to be clear, then you want to be able to test whether or not
Starting point is 00:02:49 you hit the cache or not. But that's not the cash is there to be transparent, right? It's either there or it's not there or whatever, or, you know, it fetches a value or maybe, maybe it's more like a memo ization cache. So, you know, it either computes a value, or it returns the value that you did before and The caller doesn't need to care about that and you don't want to have to break your interface Just to write a test and you don't have to pass in a listener class that says hey on Cash result because that's all you know What is it's it's not good for the design of your system
Starting point is 00:03:20 It's maybe not good for the performance, but you probably do want metrics about how often your cache is being hit. And if you're buying performance code, you've probably written a relatively performance metric system. Right. Yeah. And then it becomes a natural way of measuring whether your code is testing the things that you're wanting. Am I getting a cache hit?
Starting point is 00:03:41 Am I getting a cache miss by looking at the metrics? Yes. And so that was where we were when we said, let's record this. And now that's the end of the podcast. Thank you for listening everybody. Thanks everybody. Outro music plays. No, and I mean, I think that this is a great specific example of a general thing, which
Starting point is 00:04:01 I think we've talked about on the podcast before of, of what does it mean to have testable code? What does it mean to have code that is testable? And there is a sort of a premise that is baked into all of this and it's woven into like test-driven development and a bunch of other things, which is if you build software that has a nice interface for some definition of nice, it will be easy to test. And writing the tests helps you create that nice interface for some definition of nice, it will be easy to test. And writing the tests helps you create that nice interface. And this is a specific example of that, because the thing that's nice about this
Starting point is 00:04:32 is the observability. We want to have code that is observable. We want to have code where we can know what it's doing. And the tests in this case is giving you a very specific thing of, I need to know if I hit the cache. You need to know that for the test. And you need to know that when you're running your software and that is the same problem is the exact same problem. And so the tests are there to debut that.
Starting point is 00:04:54 Yeah, but I suppose specifically in this instance, like a not totally unreasonable API to your whatever this thing cash thing is, is that you return like a tuple of the thing that you got out of the cache and some status object that said, did I get it from the cache? Was it a hit? Was it a miss? Sure. Yeah, that's a reasonable interface in which case now that way. Well, that's my point, right?
Starting point is 00:05:21 That is a reasonable way to write this, uh, uh, system, but very specifically in a, oftentimes, if you're like in a very high performance kind of piece of code that you, you, by writing the interface that way, you have pessimized the case where you don't care if it was in the cash or not, which is the very common case. And you're forcing everything through this. But the metrics are something that are sort of a side channel that are something you still care about and are performing. And you're now using them to test the inner workings of something that should be transparent and you in fact want it to be transparent.
Starting point is 00:05:57 And I think there's sort of a, maybe you're saying those are the same things. Um, I've just found it as being a, uh being an interesting way of saying like there's some internal workings of a class that I would like to be able to test, but I don't really want to expose it to the outside world. And I can't expose it to the outside world through either because the performance would characteristics would be different, but the, well, I can't directly expose it to that as well. And so this metrics represents an indirect way of me accessing interesting things that happened in my class. Yeah, I kind of get what you're saying there, but I think it all hinges on the definition
Starting point is 00:06:36 of outside world, right? Okay. Like the color of the code is, is let's, let's live in a multiverse for a second here. The color of the code is one world, but another world is sort of you as an operator of the software. And the test can stand in for both of those things. You don't have to do them all as one thing, right? The color of the code can be like, yeah, I asked for this value, I got this value back.
Starting point is 00:07:02 I'm not getting a tuple that indicates whether it was a cache hit or a cache miss or some sort of other metadata that rides along with it because I don't actually care about that, right? And I shouldn't have to care about that just for the purposes of testing to make sure that my caching system works. But you as an operator of that software,
Starting point is 00:07:18 you as somebody who is going to be, you know, watching it run and making sure that the performance is good and making sure that the changes that you've made have taken place effect as you would have expected need another dimension into the multiverse of ways to see what is going on. And the test can stand in for that too. There is another actual aspect of this. I was just ranting to you earlier this week about this actually, when we were at lunch. There is another aspect of this that I think holds here and that is logging. Usually, what people do is they have a logging system and they just dump things into the logs. It's like, I've got this variable here, I'll log this out or whatever it is. Right. And, you know, it's the really unfortunate case when it's like,
Starting point is 00:08:06 okay, yeah, we have log statements in the code for when the terrible thing happens. And then you go and you look in the logs because the terrible thing has happened. And what you see is like the magic value is percent S. Because your whatever logging thing that you set up didn't actually capture the value that you wanted because you thought it was templated and it wasn't, or whatever it thing that you set up didn't actually capture the value that you wanted because you thought it was templated
Starting point is 00:08:26 and it wasn't or whatever it is that happened. And like the one in a million moment is gone and gone and you're never gonna see it again, right? And so- Are you about to write tests for logging is what you're about to say. That is exactly what I'm saying. This is exactly what I'm saying is that I think
Starting point is 00:08:42 that one of the benefits of structured logging is that you can approach it in the exact same way that we are talking about these metrics, right? They are similar sounding things in a sense. It's just a different way of structured logging. One is a counter and the other one is maybe a sequence of events that you've logged. Exactly. Exactly right.
Starting point is 00:09:04 But it is exactly this thing of the tests are not just standing in for like the caller of the code as they usually do, but they are standing in for the sort of tomorrow you, who's a very responsible person and wants to know what their metrics are and what their logs are and wants to make sure that they're correct. And then you can also use both of those dimensions of kind of observability to understand what your code is doing and verify that they're correct. And then you can also use both of those dimensions of kind of observability to understand what your code is doing and verify that it is correct, right? The tests can operate on both of those dimensions at the same time. I mean, who among us hasn't written that warning statement like this is weird? And then, you know, your test coverage says, hey, you never hit the this is weird log line. And you're like, oh,
Starting point is 00:09:42 I should write a test for it. But realistically speaking, what am I going to do? All it does is log, this is weird. Right. Right. You know, I've, I'm sure you've done this before, you know, even with, you know, most logging systems are certainly ones that I've interacted with. You have a test fixture that can capture the log so you can write it and then, then, but your assertion is something weak like assert this is weird in captured dot log.
Starting point is 00:10:05 Right. And that's better than a kick in the teeth, but it is not ideal. And what you're saying is with a more print, you know, certainly in terms of the text, your mapping and, you know, it makes your test quite brittle, but if you can have a structured dog, so I think we have talked about structured logging before, but do you want to just give us a quick recap of what you think of or what right now, because I know you're in the middle of it all, what you think of as structured logging.
Starting point is 00:10:28 Yeah, I mean, and I grant that people have differing takes on this, and I think you can do it in different ways. But I think that if I were to try to summarize all of the different approaches that I've seen that have been called structured logging, it is kind of, you alluded to it earlier, it is treating your logs as a stream of events, right? Sometimes multiple streams of events, like you can think of like the info logs as one
Starting point is 00:10:49 stream and the error logs as a separate stream and the warning logs another stream, or you can mush them all together and have a heterogeneous thing. But the basic idea is that you are going to not think of your logs as I'm just puking some text out to standard error or standard out. It is, you know, there's a stream of events that is coming out of my system and I can turn those into human readable logs if I want, but I can turn them into whatever I want because I'm a wizard and I have programming skills and I can transform a stream of events into anything. And so it solves a number of kind of problems. And one of them is this sort of case of like making sure that you are actually capturing
Starting point is 00:11:33 the information in your logs that you think you are. Another one is this sort of case of like, well, how do I make sure that we are responding to this situation in which I want to do nothing? And in fact, the thing that sort of kicked off this whole conversation 10 minutes ago was me writing a test for a situation where I was skipping a trade that I wanted to ignore intentionally
Starting point is 00:11:53 because it was being replayed, right? Like it was like, oh, I wanna make sure this is item potent. We've seen this trade already. I don't wanna publish it again. So like the correct action is to do nothing, right? Now in that case, I was making an assertion about a metric, but you could easily imagine that that could also be a log statement and testing that that testing that nothing has happened is a very important thing to be able to do.
Starting point is 00:12:17 Right. And more importantly, discriminating between the nothing has happened because I processed the event correctly and determined that nothing should happened compared to you didn't call the process event function at all in your test. Therefore, nothing happened. Right. Yeah. So you can distinguish them when the nothing that happened is actually something did happen. The something was I bumped a metric saying ignored events plus plus, or I logged warning. This event was skipped because it's a replay or whatever it is that you've done. Yeah. That makes a lot of sense there. It certainly gives you a lot more. It lets you sleep at night a
Starting point is 00:12:53 bit more comfortably because you know, again, how many times have we written tests where you realize this test is passing and then like scratching your head like, wait, it's, it's not being run, is it? That's right. I've missed out test as T set. And now my, my system that it's not being run, is it? That's right. I've missed out test as T set. And now my, my, my system that looks for only the word test is not actually running any of these files at all. Right. Right.
Starting point is 00:13:11 Right. The test where it's like, you can comment out all of the code that you thought you were testing and the test still passed because there's no assertion in it, right? It's just like run some code and hope an exception doesn't happen. Right. Like those are, those are very unfulfilling tests. Right. So this gives you a way very unfulfilling tests. Right.
Starting point is 00:13:25 So this gives you a way of measuring some of the, some of those types of events and quantifying them and saying that this, yeah, gathering confidence that actually you are doing the thing that you thought that you were doing. Yeah. Yeah. Yeah. Yeah. Another thing that you can do with structured logging, which has another
Starting point is 00:13:43 sort of flavor of this, is you have these moments sometimes where you're trying to test something and you're like, part of me just wants to reach into the center of this class and pull out this state, but I don't want to really do that because that's going to break the encapsulation of the class. I want to be able to refactor this code. I want to be able to change certain things about this code without having to change the tests because that's what refactoring is.
Starting point is 00:14:10 And I don't want to reach into the guts of this class because that'll make my test less valuable and make it so that I can't refactor. But I really want to know what this value is. And so one of the things that you can do with structured logging, which I think is really interesting, is it gives you a conduit to more carefully and selectively pull pieces of information out
Starting point is 00:14:29 of the internals of a class in a way that doesn't expose all of the guts. It just sort of exposes like the one little piece of information that you want. And the example of this is like you're going to have a log statement that says like the queue size is five, right? Well, it's like, I don't want to reach into the guts of the class and check to see what the queue size is. But in the instances where it's important to log what the queue size is, I can use that as a way to confirm my suspicions about what it should be.
Starting point is 00:14:58 And you can go another level deep with this if you want to. And I have, and I don't know if it's generally a good idea, but I think it's an interesting thing to talk about, which is when you have structured logs and you can find a way to do object serialization in those structured logs in a way that's not totally insane or sometimes just mildly insane, you can have complete objects that come out of there
Starting point is 00:15:23 and go into your logging system and can be reconstituted later. And the one place where I think I have seen this done the least insane is with exceptions, right? Like you have, you know, part of your logging system where if an exception occurs, you have a reasonably high confidence serialization system that allows you to capture that exception,
Starting point is 00:15:45 maybe with some special cases in there to make sure it's not too big, or contains a reference to a femoral resource or some other thing like that, but you have some confidence where you can turn it into something, and then when you're troubleshooting that error later, you can reconstitute it.
Starting point is 00:16:00 And I think that is a more obvious way to do this kind of thing, but I could also see situations in which that that structure logging allows you to sort of in a less brittle way, in a less encapsulation violating way, check to make sure that the internal state of things is what you which is to say, you know, again, the, the internal state in this instance is whether the cash was hit or not. And it's just way of exposing that internal state without making it either in the face of the caller or having to add a whole metric subsystem into the specifically to that cash and say, did the last thing, all those kinds of things. So it's a really nice way of, yeah, like kind of side channel
Starting point is 00:16:45 attacking the internal state of your, you know, and slightly better than, you know, like having the, um, uh, the other sort of, I guess it, is it an anti-pattern? Let me see, see what you think. You know, how, how many times have you written something that's like, you know, um, get cash for testing the function that, which is, you know, like you look at it and you say, this uses the same functionality as the untested cache get function that isn't in my, you know, for testing cache and you kind of look at it and you go like it's three lines,
Starting point is 00:17:33 I think it's fine. Or, you know, sometimes you can implement one in terms of the other and hope, fingers crossed that the optimizer throws away the fact that in your not test version, you always discard that kind of side channel and therefore, you know, it all goes, it nets out. That's a nice way of doing it. But, um, yeah. So that, that, yeah. Do you think anytime, I mean, I certainly think of it anytime I write a test that has
Starting point is 00:17:55 for testing in it, I do die inside a little bit, but sometimes it's a necessary evil if I haven't. Yeah, it's, it's not great, but if I have to choose between adding a little bit of extra complexity to my code and not being confident that it works, I'm going to go with a little complexity is worth knowing that it actually works. But if there's a way to do both of those things at the same time, or do it in a way where that sort of surface area of the for testing is not only smaller, but also useful for other things, In which case that's a better way to do it. It loses the testing at that point, right?
Starting point is 00:18:26 It just becomes, yeah, it is just like, Hey, this is a, a, yeah, a window into this class that is useful. Exactly. And the metrics are exemplified, the metrics and all structured logic. Yeah. Yeah. No, that's cool. Yeah.
Starting point is 00:18:40 Well, that's kind of all we had. I mean, I was going to say that's what we had planned, but we had no plans. We were just talking and then we're like, we should probably record this. So here we are. We could talk about metrics some more. I have lots of ideas. Well, let's do that. Good ways to do metrics.
Starting point is 00:18:58 Let's do that then. Yeah. I didn't want it to peter out awkwardly here. So one thing that I debate a lot is the sort of, I would say, the difference between push and pull metrics. So let's contrast two systems in particular as examples here. So one of them that is kind of top of mind for me recently actually is a system like StatsD right? The way statsd works is you have a centralized metrics collection service. And you create, and there's clients that do this for you,
Starting point is 00:19:34 but just describing how the protocol works. When you have a metric, like a counter that you want to increment, or maybe a gauge that it's like, yeah, the disk is like 96 percent full. Then you create a very small human readable text snippet, which is like, I think it's like the metric name, and then a pipe, and then a value, and then a pipe, and then the type,
Starting point is 00:19:56 whether it's a gauge or a counter or something like that. I think that's roughly the stats D thing. Then you put that in a datagram, and you send that datagram off to your central collection server and you have no idea whether it got there. And you mean like literally a network packet, a single net fire and forget network packet, UDP. Yes, UDP datagram just goes, and the idea is that this is really useful for metrics
Starting point is 00:20:22 where you don't want to block the sender, right? Like you don't want the sender to be like, I'm waiting to send this metric somewhere. But if it doesn't get to where it's going, it's maybe not the end of the world, right? So that's that is sort of one style. And there are other ways to maybe make that a little bit more reliable. And certainly if you use gauges and things like that more frequently than counters, you can get pretty reliable out of that. But one of the great advantages of that is that the senders
Starting point is 00:20:53 or the receiver doesn't need to know that the senders exist. You can have a situation where it's like a new system comes up and it starts publishing its metrics and the receiver is just like, oh, I guess I have a new thing that I need to worry about. It just receives a datagram from someone else and goes new client. Fantastic. Right. And then there's exactly one piece of configuration, which is in all of the clients where the aggregator is the one receiver is got it. Okay. So that's, that's the presumed. That's the push base.
Starting point is 00:21:21 You're pushing out. Yeah. Yeah. Yeah. Yeah. And then you have systems like Prometheus, where the way Prometheus works is you've got an endpoint. I think it's usually an HTTP endpoint. I think it has to be an HTTP endpoint, actually. Could be wrong about that. But you've got some endpoint that's in your program that is being monitored, that is being observed.
Starting point is 00:21:39 And the Prometheus scraper reaches out to you on some periodic basis and says,, give me your metrics, right? And so internally, you can have a thing where it's not like blocking the hot loop of any part of your execution. It's just sort of stashing the metrics in memory to be available the next time it comes around. But it's just taking this sort of like periodic snapshot of what is going on with with the metrics, right? Now I'm not even talking about like the actual metric collection internally,
Starting point is 00:22:10 because there's like a billion different ways to do that. I'm kind of just talking about like, okay, assume you have a program that's got application level metrics, how does it get to somewhere else other than that machine? Right. And I think these are the sort of two basic ways that I've seen people do it. Absolutely. Push and pull. I mean, we've talked, I think about various UDP based systems before. I mean, we had one at several companies ago that I know you worked on, which was a
Starting point is 00:22:35 metric collection system that was more of the UDP datagram based thing. Obviously stats D is an example of that. It has a lot of benefits. You mentioned the configuration is straightforward. Um, the it's non-blocking for some definition of non-blocking in the publisher. I mean, sending a UDB datagram is kind of a heavyweight activity in some worlds. Uh, but it's straightforward, relatively speaking. And you said, certainly the, the, the stats, the format is very straightforward. So you blast it off.
Starting point is 00:23:13 Obviously the drawbacks are it might not get there, which reminds me of a joke, which I tell you, but you know, it's about UDP. I don't think, I don't think you get it. There's yeah, it might not get there. If it does, if the, if the collector is down or misconfigured, you'll never know, you're just sending it out into the, into the ether literally. And, um, the, the, there could be a bottleneck if you're generating a ton of statistics back to back. If you've got like a, if you try and update your counter on every single
Starting point is 00:23:42 update, then you're sending a blast of relatively heavyweight packets at a machine and that machine has to be able to deal with all of that data. And in fact, you might back up trying to send it. So those are the drawbacks, but it's very, very appealing because, um, also if you're a very short lived application, if you're like a command line client, you might not live long, live long enough to be scraped by a different system. Right. Then let's talk about the pool based systems.
Starting point is 00:24:13 And let me just read that back to you. So in this instance, somehow some centralized system has to know about all of the places that have metrics. And then it is responsible for connecting to them in turn or however, and saying, give me a snapshot of your metrics, please over HTTP or TCP or something like that. So obviously the, the pro points there are, um, you, the collection system is responsible for the period upon, upon which it is collecting these statistics. So it could be like, well, I can do it once a second or once a minute or once an hour.
Starting point is 00:24:49 It doesn't matter as long as you know, I can configure that in one place and you're not being swamped by millions of intermediate values because you only care about it on the cadence that you care about. Um, the drawback is how do you find all your clients? Right. That sounds relatively common. Now I've got another problem. So yeah, okay, I've just read those back to you, but obviously you brought this subject up
Starting point is 00:25:11 because I believe you probably have opinions and I'd be interested in your opinions on those things. I do have opinions. I do wanna make the point though, by the way, about sending the datagrams is that you don't have to do that in process, just as with Prometheus, you're gonna store your metrics in memory and then it's gonna get scraped you're going to store your metrics in memory, and then it's going to get scraped.
Starting point is 00:25:25 You can also store your metrics in memory and then send them out with some cadence over UDP. You can do them in line. You don't have to. Yeah, that makes sense. Yeah. But I am a huge fan. One of the sort of scary bedtime stories that finance dads tell their kids
Starting point is 00:25:51 is the story of Knight Capital and how a trading firm lost hundreds of millions of dollars in 45 minutes, something like that. Yes. And it's a terrible story. And it's funny because I actually, we used to work, you used to work, I work with somebody who actually is very familiar with this. Very familiar. Was directly involved with some of the companies that cleaned up afterwards anyway. And it's funny how much of this has turned into sort of like lore and folklore. Yeah. You know, it's been it's been, you know, kind of, you know, the game of telephone has been told many times. But it is nonetheless true that like, one of the problems that happened there is that they had software running that they did not realize was running, right? They didn't realize that it was doing what it was doing, right? And I
Starting point is 00:26:47 generally feel like I sleep better at night knowing that there's a central server. Everything that is running is at least trying to publish to that central server. And if something comes up unexpectedly, there's at least a chance, probably a very good chance, that those messages will suddenly appear on that central server and it will have the ability at least to detect that something is running that should not be running, right? You can kind of do a little hybrid of both of these things
Starting point is 00:27:21 if you want, you can have like, you know, the central server then reach back out to the sending clients. It can even give them an aggregated act, where it's like, yeah, I've received 300 messages from you in the last minute or something. Just so you know, I'm actually receiving your messages. You can do things like that.
Starting point is 00:27:40 But the thing that really makes me sleep well at night with a lot of these systems is having a way so that if someone were to start a piece of software like on their desktop, or in some test server, or somewhere else, it would at least try to tell someone about it, as opposed to well, it hasn't been added to the central configurations. There's no way we could ever got it. Yeah. I mean, there are different ways of solving that problem.
Starting point is 00:28:07 Obviously one way, because, you know, again, if you try and reach out to a server, but it doesn't come back to you, you still have this problem, right? You know, and in the finance worlds that we're talking about, we have very strict network segregation, which means that you might not be able to send the ping to the central servers to say like, Hey, I'm a production machine. So there's issues of that nature like that. And so I feel that like there is, there's always an incomplete part to this. This is always a slightly of a blind spot here because, um, but in general, a service discovery mechanism. general, a service discovery mechanism that's robust to these is, is useful
Starting point is 00:28:50 whether or not you're pushing information to a centralized server or whether or not you are being scraped by some centralized server. And that seems to me the more, the thing here is in saying like, if you're sending these periodic metric pings to some system, you could notice that something was alive and doing something unexpected. That's kind of begging the question of like, why are you using your metric system to determine the liveness of software? Why don't we have a software liveness indicator?
Starting point is 00:29:15 Maybe you are talking about that as well here, but absolutely. Yeah, I mean, I'm kind of like, I'm talking about this in the context where everything is already broken, right? It's sort of like, both of these systems work great when everything works great.
Starting point is 00:29:27 And it's like when they break, what are some of the different ways in which they break? And you're absolutely right that it's like network partitioning is one way in which the push-based model, the StatsD model doesn't save you because it's like you have a test server that's configured and running in prod and it can't reach the test network.
Starting point is 00:29:44 But then, so there's another sort of solution. There's another solution. There's another potential here, which is if we don't use the fire and forget single UDP datagram thing, and you have instead the TCP connection, then obviously you get the positive connection. Yeah. That you are talking to the centralized server, you get your ticket from it that says, yes, you're okay to run or whatever, you know, those kinds of things. But then, but then you are talking to the centralized server. You get your ticket from it that says, yes, you're okay to run or whatever, you know, those kinds of things. But then, but then you are sort of solving the similar problem to, excuse the dog.
Starting point is 00:30:13 Um, you are solving similar problems to, um, and now I can't even remember what the thing's called now. What is it we use for service discovery? The old company, um, console, console. Yeah. Which is, you know, um, uh, chubby in Google terms, I think is the equivalent. And, you know, it's a, it's a centralized lock manager, but it's sort of a small amount of shared state between things.
Starting point is 00:30:33 And so people can go in and now obviously that's still opt in. And you still have to be part of the console cluster or you have your, your system has to be registered with the console cluster in order for it to be noticed. But that's what it's supposed to be. That's one of the things that's meant to be there for is to say like, Hey, find me all the things that say that they are metrics producers or everything that says I'm a web browser or a web, sorry, web server or that kind of thing. And so that feels like a good solution, but just like my network partition example and whatever, you can still
Starting point is 00:31:02 break it because if you're not in the console cluster, then you're in a partitioned world of your own. Right. And so, yeah, there's not an easy solution to any of these things, but I do wonder if conflating metrics gathering with this is, is a good thing, uh, whether or not, you know, you, you, you just mentioned in passing that this is a useful thing to be able to do. It certainly is. If you get a, uh, You just mentioned in passing that this is a useful thing to be able to do. It certainly is a surprise if you get a... This is one of those things where it's like, this is not a real solution to the problem that you're talking about it being. We have A and B, we're trying to choose between A and B, and I'm like, I think I like A better
Starting point is 00:31:37 than B. And I was like, why? Well, it's like, well, because in certain situations, it'll solve this problem. It's like, well, but in other situations, it won't. It's like, yeah, but that's not why we're talking about A and B. I'm just trying to pick between two options, right? No, that's really interesting. Yeah. Yeah. So it's just, yeah, yeah, no, no, I got it.
Starting point is 00:31:49 And I mean, ultimately it's, it's almost like what if you were to do, if you're doing metrics gathering, the hybrid solution of where, you know, instead of proactively being scraped, you just connect into the central server and then it asks you. So it's still push and pull. Like you connected into it and it knows that you existed and you therefore service discovery is if you town it to pour 8,000 of the central machine, then we care about what information that you have. Um, but you get scraped by it saying, okay, give me what you got.
Starting point is 00:32:24 But obviously that doesn't work over at HTTP, which obviously has convenience methods. Certainly when I'm a developer, uh, it's useful to better hit my own web server and in fact, some of the tests I wrote, uh, involved scraping back over the HTTP port to check that I was actually exposing the metrics that I thought I was exposing when I was writing my own Prometheus endpoints. So I think, yeah. Yeah. And yeah, to say that the Knight Capital legend was purely, and not that you did, but like, there were so many other aspects to that. It was very much the
Starting point is 00:33:02 Swiss cheese, and eventually the holes lined up and one of the things got through. Um, but yes, metrics. Yeah. The, the, the real thing, the real thing here is sort of bringing this back to observability in general a little bit is like, I, I think, I mean, and I do this in the systems that I have, what you probably wanna do in a system that has discovered that it is no longer observable
Starting point is 00:33:29 is to stop. Because it's sort of like the last gasp of like, someone pay attention to me. Yeah. Right? And so you wanna do that in multiple situations. You wanna probably have something like that at startup, like registering with some sort of central discovery service or sending out some sort of message saying like, Hey, I'm starting up. And if you don't have a way to acknowledge that someone
Starting point is 00:33:53 heard you be like, okay, well, then I guess I'll stop then. Like having some mechanism to do that is a great sort of safety mechanism. Along with heart beating to make sure that everyone, both ends are still actually there and like, are you still there? And I don't just mean TCP ones, and I've had some debates with people about this, but I still think this is the way that I do it. Is if you have a system that encounters a fault. So going back to our sort of structured logging, like I've logged an error in the system, and I've logged a error in the system. And then I've got a system that's not there yet. And then I've got a system that's not there yet. And then I've got a system that's not there yet.
Starting point is 00:34:21 And then I've got a system that's not there yet. And then I've got a system that's not there yet it, is if you have a system that encounters a fault, so going back to our sort of structured logging, like I've logged an error or an exception, and I try to send that to somewhere to notify somebody, right, what happens when that fails? I think the right thing to do with a certain amount of retries, like keep retrying, but
Starting point is 00:34:45 like if you retry for some period of time, eventually you probably just want the system to stop. Now that's not universally true for every single system. There are things where it's like, no, this just needs to keep trucking even if it's having failures. But all other things being equal, my base argument is if you have a system that has an error, fine, errors happen. If you have a system that has an error and tries to report its error and it can't, okay, it should keep retrying. But at a certain point, it should just exit.
Starting point is 00:35:16 I would not disagree with you on that. I mean, just to sort of like remind the listener though, that you and I come from a world of finance where there's a lot of regulatory stuff around. If we can't log what we're doing, you know, again, night capital type stuff, if we can't tell somebody that something is up, then the best course of action is to stop doing anything further, log everything you can to disk and then kill the process and be done with it. And hopefully that gets someone's attention, right? Why are we not trading anymore? Oh, it turns out the process self be done with it and hope that that gets someone's attention. Right. Right. Why are we not trading anymore? Oh, it turns out the process self-destructive.
Starting point is 00:35:48 Why is that? Well, there's been a network split and it can't tell us that the positions out, you know, those kinds of things and those are more defensible. But, but yeah, if my pacemaker, um, can't log an error, then maybe I don't want it to stop. Um, but you know, obviously I don't want my home wifi router to turn off because it can't send logs to someplace that I don't care about the logs for.
Starting point is 00:36:09 Right. Exactly. Yeah. So there are ways in, but, but I think as a, as a sensible, um, and even within the finance industry, I think, you know, this is something that I've worked on. Desks where it is okay to not be up and running. Like it's not great. You know, there's going to be some very long meetings that you're going to have to explain yourself, but it's like not in the, if you're not on and trading,
Starting point is 00:36:35 the only thing is, is an opportunity cost. You know, you, you weren't able to make money or whatever. And there are, there are manual ways of trading out of positions and those kinds of things. But if you have obligations to an exchange or downstream clients, then maybe you have to limp on and say, look, it's better for us to continue to be able to provide this service, albeit disrupted, um, but I've never worked on a situation like that, so I'm always down with, yes, you know, like literally my C plus plus exception handling stuff is like log everything you can to disk and then kill minus nine
Starting point is 00:37:08 myself, like, you know, there's, there's no way we can carry on after this point here, right? We are done and dusted. I don't care if like the destructors don't run properly, just kill the process at this point and that's always okay. Yeah. Yeah. I tell you though, uh, just to tie this back to testing, cause why not?
Starting point is 00:37:24 That's where we started. The one piece of code you though, uh, just to tie this back to testing, cause why not? That's where we started. The one piece of code I've never really come up with a great way to test is the code that kills the program. So there is a, at least in a C++ framework I'm familiar with, there is a death test and it works by forking the process and then communicating between the two processes to make sure that this actually kills the process. Now, unfortunately, that's Unix.
Starting point is 00:37:49 Unix being as complicated as it is, there's signal handling and there's like child parent relations and you can still not always get it right, but it's not a bad way of saying this should abort the process, right? Literally kill the process and be done with it. And you go, well, okay. And I will, I'll fork myself here, no light, no snickering in the back. And the child process will do that. And then the parent process monitors to make sure that that's what
Starting point is 00:38:13 happens through some, you know, unix domain thing. So you can write tests for these things. There's never an excuse not to write a test for something, he says, very well aware that I've just spent the last two weeks writing very limitedly tested code, but that's a whole other story. Yeah. All right, friend. I think, yeah, that's probably a good place to call it. Right.
Starting point is 00:38:36 Yeah. This, this expanded from a, I have an idea to 40 minutes worth of, of conversation, which is how it should be. And I'm I enjoyed it But metrics are more useful than you might think And you should keep them unstructured logging is always a choice too. So yeah, it's a choice. That's for sure Alright until next time Until next time. Our theme music is by Inverse Phase. Find out more at inversephase.com

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.