Two's Complement - Observable Metrics
Episode Date: April 10, 2026Matt and Ben explore the intersection of testing, metrics, and observability in performance-critical code. They debate push vs pull metric systems, share war stories from financial trading systems, an...d ponder what to do when your program can't tell anyone it's in trouble.
Transcript
Discussion (0)
I'm Matt Godbolt.
And I'm Ben Reidy.
And this is Tooth Compliment, a programming podcast.
Hey Ben!
Hey Matt.
I am not convinced my levels are right here, so I apologize if the audio is awful on this.
Something isn't going well here.
You think it's too quiet? I think it's too quiet. I mean, how does it sound to you?
It sounds good to me. All right, but I've turned the gain up here and it looks really...
Okay, well, we'll go with that. Anyway, hi! How are you doing, friend? This is a catastrophe for
How are you doing friend?
This is a catastrophe for, for, for editing Matt, as we were just, uh, during the, the opening jingle, we were, we were taking the Mickey out of editing Matt
because he's a jerk, um, but that's fine.
Cause he's not here.
Yeah.
The moment you find that yesterday, Ben is a jerk and tomorrow Ben is the most
responsible person on the face of the earth.
That's he's going to take care of everything.
Cause I'm not, yeah, you know, that's, yeah, that's, yeah, that's a
hundred percent how I see the world too.
Yeah.
So we, we, we had started this as we often do by.
Chatting in a Google meet before this.
And, and then we were like, let's just record, let's do it.
And then we had to switch out and then unfortunately, because we're now
using a separate recording system
that involves fiddling around with web browser settings
and plugging in different microphones, everything.
And now I can't even remember
what it was we were talking about.
So when I just cut over to this,
I was of course in the middle of writing a test
because what else would I be doing?
Of course.
And I was saying how one of the things that I love to do
is test my metrics.
That's right.
And you had made a very interesting point
about performance sensitive code and this technique.
So do you recall what you said?
I do now.
Thank you for being the responsible adult
and remembering from one minute to the next
what on earth is going on.
Tomorrow, Ben, has manifest today. Yeah, it seems so.
It seems so.
So yeah, you had said that you were writing tests for the metrics in your code.
And that's a fantastic thing to do, because if you're relying on those metrics, you
probably want to make sure they're right.
And then I'd said for some areas of performance code.
So certainly in C plus plus land, sometimes there isn't a seam for some areas of performance code, so certainly in C++ land,
sometimes there isn't a seam for you to put, uh, like a, an interface or some kind of like, um, uh, a measuring point in your, uh, your regular code to
write a test around.
So like the classic example I can think of is if you've got a high performance,
like cache of information,
that is like a software cache, just to be clear, then you want to be able to test whether or not
you hit the cache or not. But that's not the cash is there to be transparent, right? It's either
there or it's not there or whatever, or, you know, it fetches a value or maybe, maybe it's more like
a memo ization cache. So, you know, it either computes a value, or it returns the value that
you did before and
The caller doesn't need to care about that and you don't want to have to break your interface
Just to write a test and you don't have to pass in a listener class that says hey on
Cash result because that's all you know
What is it's it's not good for the design of your system
It's maybe not good for the performance, but you probably do want metrics about how often your cache is being hit.
And if you're buying performance code, you've probably written a relatively performance metric
system.
Right.
Yeah.
And then it becomes a natural way of measuring whether your code is testing the things that
you're wanting.
Am I getting a cache hit?
Am I getting a cache miss by looking at the metrics?
Yes.
And so that was where we were when we said, let's record this.
And now that's the end of the podcast.
Thank you for listening everybody.
Thanks everybody.
Outro music plays.
No, and I mean, I think that this is a great specific example of a general thing, which
I think we've talked about on the podcast before of, of what does it mean to have testable code?
What does it mean to have code that is testable?
And there is a sort of a premise that is baked into all of this and it's woven into like
test-driven development and a bunch of other things, which is if you build software that
has a nice interface for some definition of nice, it will be easy to test. And writing the tests helps you create that nice interface for some definition of nice, it will be easy to test.
And writing the tests helps you create that nice interface.
And this is a specific example of that,
because the thing that's nice about this
is the observability.
We want to have code that is observable.
We want to have code where we can know what it's doing.
And the tests in this case is giving you
a very specific thing of, I need to know if I hit the cache.
You need to know that for the test. And you need to know that when you're running your
software and that is the same problem is the exact same problem. And so the tests are there
to debut that.
Yeah, but I suppose specifically in this instance, like a not totally unreasonable API to your whatever this thing cash thing is, is that you return
like a tuple of the thing that you got out of the cache and some status object
that said, did I get it from the cache?
Was it a hit?
Was it a miss?
Sure.
Yeah, that's a reasonable interface in which case now that way.
Well, that's my point, right?
That is a reasonable way to write this, uh, uh,
system, but very specifically in a, oftentimes, if you're like in a very high performance
kind of piece of code that you, you, by writing the interface that way, you have pessimized
the case where you don't care if it was in the cash or not, which is the very common
case. And you're forcing everything through this. But the metrics are something that are sort of a side channel that are something
you still care about and are performing.
And you're now using them to test the inner workings of something that should be
transparent and you in fact want it to be transparent.
And I think there's sort of a, maybe you're saying those are the same things.
Um, I've just found it as being a, uh being an interesting way of saying like there's some internal workings
of a class that I would like to be able to test, but I don't really want to expose it to the outside
world. And I can't expose it to the outside world through either because the performance would
characteristics would be different, but the, well, I can't directly expose it to that as well. And
so this metrics represents an indirect
way of me accessing interesting things that happened in my class.
Yeah, I kind of get what you're saying there, but I think it all hinges on the definition
of outside world, right? Okay. Like the color of the code is, is let's, let's live in a
multiverse for a second here. The color of the code is one world,
but another world is sort of you
as an operator of the software.
And the test can stand in for both of those things.
You don't have to do them all as one thing, right?
The color of the code can be like,
yeah, I asked for this value, I got this value back.
I'm not getting a tuple that indicates
whether it was a cache hit or a cache miss
or some sort of other metadata that rides along with it
because I don't actually care about that, right?
And I shouldn't have to care about that
just for the purposes of testing
to make sure that my caching system works.
But you as an operator of that software,
you as somebody who is going to be, you know,
watching it run and making sure that the performance is good
and making sure that the changes that you've made have taken place effect as you would have expected need another dimension into the multiverse of ways to see what is going on.
And the test can stand in for that too.
There is another actual aspect of this.
I was just ranting to you earlier this week about this actually, when we were at lunch. There is another aspect of this that I think holds here and that is logging. Usually,
what people do is they have a logging system and they just dump things into the logs. It's like,
I've got this variable here, I'll log this out or whatever it is. Right. And, you know, it's the really unfortunate case when it's like,
okay, yeah, we have log statements in the code
for when the terrible thing happens.
And then you go and you look in the logs
because the terrible thing has happened.
And what you see is like the magic value is percent S.
Because your whatever logging thing that you set up
didn't actually capture the value that you wanted because you thought it was templated and it wasn't, or whatever it thing that you set up didn't actually capture the value that you wanted
because you thought it was templated
and it wasn't or whatever it is that happened.
And like the one in a million moment is gone and gone
and you're never gonna see it again, right?
And so-
Are you about to write tests for logging
is what you're about to say.
That is exactly what I'm saying.
This is exactly what I'm saying is that I think
that one of the benefits of structured logging
is that you can approach it in the exact same way that we are talking about these metrics,
right?
They are similar sounding things in a sense.
It's just a different way of structured logging.
One is a counter and the other one is maybe a sequence of events that you've logged.
Exactly.
Exactly right.
But it is exactly this thing of the tests are not just standing in for like the caller
of the code as they usually do, but they are standing in for the sort of tomorrow you,
who's a very responsible person and wants to know what their metrics are and what their
logs are and wants to make sure that they're correct.
And then you can also use both of those dimensions of kind of observability to understand what your code is doing and verify that they're correct. And then you can also use both of those dimensions of
kind of observability to understand what your code is doing and verify that it is correct, right? The tests can operate on both of those dimensions at the same time.
I mean, who among us hasn't written that warning statement like this is weird? And then, you know,
your test coverage says, hey, you never hit the this is weird log line. And you're like, oh,
I should write a test for it. But realistically speaking, what am I going to do?
All it does is log, this is weird.
Right.
Right.
You know, I've, I'm sure you've done this before, you know, even with, you know, most
logging systems are certainly ones that I've interacted with.
You have a test fixture that can capture the log so you can write it and then, then, but
your assertion is something weak like assert this is weird in captured dot log.
Right.
And that's better than a kick in the teeth, but it is not ideal.
And what you're saying is with a more print, you know, certainly in terms of the
text, your mapping and, you know, it makes your test quite brittle, but if you can
have a structured dog, so I think we have talked about structured logging before,
but do you want to just give us a quick recap of what you think of or what right now,
because I know you're in the middle of it all,
what you think of as structured logging.
Yeah, I mean, and I grant that people
have differing takes on this,
and I think you can do it in different ways.
But I think that if I were to try to summarize
all of the different approaches that I've seen
that have been called structured logging,
it is kind of, you alluded to it earlier,
it is treating your logs as a stream of events, right? Sometimes multiple streams of events, like you can think of like the info logs as one
stream and the error logs as a separate stream and the warning logs another stream, or you can
mush them all together and have a heterogeneous thing. But the basic idea is that you are going
to not think of your logs as I'm just puking some text out to standard error or
standard out. It is, you know, there's a stream of events that is coming out of my
system and I can turn those into human readable logs if I want, but I can turn
them into whatever I want because I'm a wizard and I have programming skills and
I can transform a stream of events into anything. And so it solves a number of kind of problems.
And one of them is this sort of case of like making sure that you are actually capturing
the information in your logs that you think you are.
Another one is this sort of case of like, well, how do I make sure that we are responding
to this situation in which I want to do nothing?
And in fact, the thing that sort of kicked off
this whole conversation 10 minutes ago
was me writing a test for a situation
where I was skipping a trade
that I wanted to ignore intentionally
because it was being replayed, right?
Like it was like, oh, I wanna make sure this is item potent.
We've seen this trade already.
I don't wanna publish it again.
So like the correct action is to do nothing, right?
Now in that case, I was making an assertion about a metric, but you could easily imagine
that that could also be a log statement and testing that that testing that nothing has
happened is a very important thing to be able to do.
Right.
And more importantly, discriminating between the nothing has happened because I processed
the event correctly and determined that nothing should happened compared to you didn't call the process event function at all in your test. Therefore,
nothing happened. Right. Yeah. So you can distinguish them when the nothing that happened
is actually something did happen. The something was I bumped a metric saying ignored events plus
plus, or I logged warning. This event was skipped
because it's a replay or whatever it is that you've done. Yeah. That makes a lot
of sense there. It certainly gives you a lot more. It lets you sleep at night a
bit more comfortably because you know, again, how many times have we written
tests where you realize this test is passing and then like scratching your
head like, wait, it's, it's not being run, is it? That's right. I've missed out
test as T set. And now my, my system that it's not being run, is it? That's right. I've missed out test as T set.
And now my, my, my system that looks for only the word test is not actually
running any of these files at all.
Right.
Right.
Right.
The test where it's like, you can comment out all of the code that you
thought you were testing and the test still passed because there's no
assertion in it, right?
It's just like run some code and hope an exception doesn't happen.
Right.
Like those are, those are very unfulfilling tests.
Right. So this gives you a way very unfulfilling tests. Right.
So this gives you a way of measuring some of the, some of those types of events
and quantifying them and saying that this, yeah, gathering confidence that
actually you are doing the thing that you thought that you were doing.
Yeah.
Yeah.
Yeah.
Yeah.
Another thing that you can do with structured logging, which has another
sort of flavor of this, is you have
these moments sometimes where you're trying to test something and you're like, part of
me just wants to reach into the center of this class and pull out this state, but I
don't want to really do that because that's going to break the encapsulation of the class.
I want to be able to refactor this code.
I want to be able to change certain things about this code
without having to change the tests
because that's what refactoring is.
And I don't want to reach into the guts of this class
because that'll make my test less valuable
and make it so that I can't refactor.
But I really want to know what this value is.
And so one of the things that you can do
with structured logging,
which I think is really interesting,
is it gives you a conduit to more carefully and selectively pull pieces of information out
of the internals of a class in a way that doesn't expose all of the guts. It just sort
of exposes like the one little piece of information that you want. And the example of this is
like you're going to have a log statement that says like the queue size is five, right?
Well, it's like, I don't want to reach into the guts of the class
and check to see what the queue size is.
But in the instances where it's important to log what the queue
size is, I can use that as a way to confirm my suspicions
about what it should be.
And you can go another level deep with this if you want to.
And I have, and I don't know if it's generally a good idea,
but I think it's an interesting thing to talk about,
which is when you have structured logs
and you can find a way to do object serialization
in those structured logs in a way that's not totally insane
or sometimes just mildly insane, you
can have complete objects that come out of there
and go into your logging system
and can be reconstituted later.
And the one place where I think I have seen this done
the least insane is with exceptions, right?
Like you have, you know, part of your logging system
where if an exception occurs,
you have a reasonably high confidence serialization system
that allows you to capture that exception,
maybe with some special cases in there
to make sure it's not too big,
or contains a reference to a femoral resource
or some other thing like that,
but you have some confidence
where you can turn it into something,
and then when you're troubleshooting that error later,
you can reconstitute it.
And I think that is a more obvious way
to do this kind of thing,
but I could also see situations in which that that structure logging allows you to sort of in a less brittle way, in a less encapsulation violating way, check to make sure that the internal state of things is what you which is to say, you know, again, the, the internal state in this instance is whether
the cash was hit or not.
And it's just way of exposing that internal state without making it either in the face
of the caller or having to add a whole metric subsystem into the specifically to that cash
and say, did the last thing, all those kinds of things.
So it's a really nice way of, yeah, like kind of side channel
attacking the internal state of your, you know, and slightly
better than, you know, like having the, um, uh, the other
sort of, I guess it, is it an anti-pattern?
Let me see, see what you think.
You know, how, how many times have you written something
that's like, you know, um, get cash for testing the function
that, which is, you know, like you look at it and you say, this uses the same functionality as the untested cache get function that isn't in my,
you know, for testing cache and you kind of look at it and you go like it's three lines,
I think it's fine. Or, you know, sometimes you can implement one in terms of the other and hope,
fingers crossed that the optimizer throws away the fact that in your not test version,
you always discard that kind of side channel
and therefore, you know, it all goes, it nets out.
That's a nice way of doing it.
But, um, yeah.
So that, that, yeah.
Do you think anytime, I mean, I certainly think of it anytime I write a test that has
for testing in it, I do die inside a little bit, but sometimes it's a necessary evil if
I haven't.
Yeah, it's, it's not great, but if I have to choose between adding a little bit of extra complexity to
my code and not being confident that it works, I'm going to go with a little complexity is
worth knowing that it actually works. But if there's a way to do both of those things
at the same time, or do it in a way where that sort of surface area of the for testing
is not only smaller, but also useful for other things, In which case that's a better way to do it.
It loses the testing at that point, right?
It just becomes, yeah, it is just like, Hey, this is a, a, yeah, a window
into this class that is useful.
Exactly.
And the metrics are exemplified, the metrics and all structured logic.
Yeah.
Yeah.
No, that's cool.
Yeah.
Well, that's kind of all we had.
I mean, I was going to say that's what we had planned, but we had no plans.
We were just talking and then we're like, we should probably record this.
So here we are.
We could talk about metrics some more.
I have lots of ideas.
Well, let's do that.
Good ways to do metrics.
Let's do that then.
Yeah.
I didn't want it to peter out awkwardly here.
So one thing that I debate a lot is the sort of, I would say, the difference between
push and pull metrics. So let's contrast two systems in particular as examples here. So one of them
that is kind of top of mind for me recently actually is a system like StatsD right? The way statsd works is you have a centralized metrics
collection service.
And you create, and there's clients that do this for you,
but just describing how the protocol works.
When you have a metric, like a counter
that you want to increment, or maybe a gauge that it's like,
yeah, the disk is like 96 percent full.
Then you create a very small human readable text snippet,
which is like, I think it's like the metric name,
and then a pipe, and then a value,
and then a pipe, and then the type,
whether it's a gauge or a counter or something like that.
I think that's roughly the stats D thing.
Then you put that in a datagram,
and you send that datagram off to
your central collection server and you have no idea whether it got there.
And you mean like literally a network packet, a single net fire and forget network packet,
UDP.
Yes, UDP datagram just goes, and the idea is that this is really useful for metrics
where you don't want to block the sender, right?
Like you don't want the sender to be like, I'm waiting to send this metric somewhere.
But if it doesn't get to where it's going, it's maybe not the end of the world, right?
So that's that is sort of one style. And there are other ways to maybe make that a little bit
more reliable. And certainly if you use gauges and things like that
more frequently than counters,
you can get pretty reliable out of that.
But one of the great advantages of that is that the senders
or the receiver doesn't need to know that the senders exist.
You can have a situation where it's like a new system
comes up and it starts publishing its metrics
and the receiver is just like,
oh, I guess I have a new thing that I need to worry about.
It just receives a datagram from someone else and goes new client. Fantastic. Right. And
then there's exactly one piece of configuration, which is in all of the clients where the aggregator
is the one receiver is got it. Okay. So that's, that's the presumed. That's the push base.
You're pushing out.
Yeah. Yeah. Yeah. Yeah. And then you have systems like Prometheus,
where the way Prometheus works is you've got an endpoint.
I think it's usually an HTTP endpoint.
I think it has to be an HTTP endpoint, actually.
Could be wrong about that.
But you've got some endpoint that's in your program
that is being monitored, that is being observed.
And the Prometheus scraper reaches out to you
on some periodic basis and says,, give me your metrics, right?
And so internally, you can have a thing where it's not like blocking the hot loop of any
part of your execution.
It's just sort of stashing the metrics in memory to be available the next time it comes
around.
But it's just taking this sort of like periodic snapshot of what is going on with with the metrics, right?
Now I'm not even talking about like the actual metric collection internally,
because there's like a billion different ways to do that.
I'm kind of just talking about like, okay, assume you have a program that's
got application level metrics, how does it get to somewhere else other than that
machine? Right.
And I think these are the sort of two basic ways that I've seen people do it.
Absolutely. Push and pull.
I mean, we've talked, I think about various UDP based systems before.
I mean, we had one at several companies ago that I know you worked on, which was a
metric collection system that was more of the UDP datagram based thing.
Obviously stats D is an example of that.
It has a lot of benefits.
You mentioned the configuration is straightforward.
Um, the it's non-blocking for some definition of non-blocking in the publisher.
I mean, sending a UDB datagram is kind of a heavyweight activity in some worlds.
Uh, but it's straightforward, relatively speaking.
And you said, certainly the, the, the stats, the format is very straightforward. So you blast it off.
Obviously the drawbacks are it might not get there, which reminds me of a joke, which I tell you, but you know, it's about UDP.
I don't think, I don't think you get it.
There's yeah, it might not get there.
If it does, if the, if the collector is down or misconfigured, you'll never
know, you're just sending it out into the, into the ether literally.
And, um, the, the, there could be a bottleneck if you're generating a
ton of statistics back to back.
If you've got like a, if you try and update your counter on every single
update, then you're sending a blast of relatively heavyweight packets at a machine and that machine has to be able
to deal with all of that data.
And in fact, you might back up trying to send it.
So those are the drawbacks, but it's very, very appealing because, um, also if you're
a very short lived application, if you're like a command line client, you might not
live long, live long enough to be scraped by a different system.
Right.
Then let's talk about the pool based systems.
And let me just read that back to you.
So in this instance, somehow some centralized system has to know about all
of the places that have metrics.
And then it is responsible for connecting to them in turn or however, and saying,
give me a snapshot of your metrics, please over HTTP or TCP or something like that.
So obviously the, the pro points there are, um, you, the collection system is
responsible for the period upon, upon which it is collecting these statistics.
So it could be like, well, I can do it once a second or once a minute or once an hour.
It doesn't matter as long as you know, I can configure that in one place and you're
not being swamped by millions of intermediate values because you only care about it on the
cadence that you care about.
Um, the drawback is how do you find all your clients?
Right. That sounds relatively common.
Now I've got another problem.
So yeah, okay, I've just read those back to you,
but obviously you brought this subject up
because I believe you probably have opinions
and I'd be interested in your opinions on those things.
I do have opinions.
I do wanna make the point though, by the way,
about sending the datagrams is that you don't have to do
that in process, just as with Prometheus,
you're gonna store your metrics in memory
and then it's gonna get scraped you're going to store your metrics in memory, and then it's going to get scraped.
You can also store your metrics in memory
and then send them out with some cadence over UDP.
You can do them in line.
You don't have to.
Yeah, that makes sense.
Yeah.
But I am a huge fan.
One of the sort of scary bedtime stories that finance dads tell their kids
is the story of Knight Capital and how a trading firm lost hundreds of millions of dollars
in 45 minutes, something like that. Yes. And it's a terrible story. And it's funny because I actually, we used to work, you used to work, I work with somebody who actually is very
familiar with this. Very familiar. Was directly involved with some of the companies that cleaned
up afterwards anyway. And it's funny how much of this has turned into sort of like lore
and folklore. Yeah. You know, it's been it's been, you know, kind of,
you know, the game of telephone has been told many times. But it is nonetheless true that like,
one of the problems that happened there is that they had software running that they did not realize
was running, right? They didn't realize that it was doing what it was doing, right? And I
generally feel like I sleep better at night knowing that there's a central server. Everything that is
running is at least trying to publish to that central server. And if something comes up
unexpectedly, there's at least a chance, probably a very good chance,
that those messages will suddenly appear
on that central server and it will have the ability
at least to detect that something is running
that should not be running, right?
You can kind of do a little hybrid of both of these things
if you want, you can have like, you know,
the central server then reach back out
to the sending clients.
It can even give them an aggregated act,
where it's like, yeah, I've received 300 messages from you
in the last minute or something.
Just so you know, I'm actually receiving your messages.
You can do things like that.
But the thing that really makes me sleep well
at night with a lot of these systems is having a way so that
if someone were to start a piece of software like on their
desktop, or in some test server, or somewhere else, it would at
least try to tell someone about it, as opposed to well, it
hasn't been added to the central configurations. There's no way
we could ever got it. Yeah.
I mean, there are different ways of solving that problem.
Obviously one way, because, you know, again, if you try and reach out to a
server, but it doesn't come back to you, you still have this problem, right?
You know, and in the finance worlds that we're talking about, we have very strict
network segregation, which means that you might not be able to send the ping to the central servers to say like, Hey, I'm a production machine.
So there's issues of that nature like that.
And so I feel that like there is, there's always an incomplete part to this.
This is always a slightly of a blind spot here because, um, but in general, a service discovery mechanism.
general, a service discovery mechanism that's robust to these is, is useful
whether or not you're pushing information to a centralized server or whether or not you are being scraped by some centralized server.
And that seems to me the more, the thing here is in saying like, if you're
sending these periodic metric pings to some system, you could notice that
something was alive and doing something unexpected.
That's kind of begging the question of like,
why are you using your metric system
to determine the liveness of software?
Why don't we have a software liveness indicator?
Maybe you are talking about that as well here,
but absolutely.
Yeah, I mean, I'm kind of like,
I'm talking about this in the context
where everything is already broken, right?
It's sort of like,
both of these systems work great
when everything works great.
And it's like when they break,
what are some of the different ways in which they break?
And you're absolutely right that it's like
network partitioning is one way in which
the push-based model, the StatsD model doesn't save you
because it's like you have a test server
that's configured and running in prod
and it can't reach the test network.
But then, so there's another sort of solution. There's another solution. There's
another potential here, which is if we don't use the fire and forget single UDP datagram thing,
and you have instead the TCP connection, then obviously you get the positive connection.
Yeah.
That you are talking to the centralized server, you get your ticket from it that says, yes,
you're okay to run or whatever, you know, those kinds of things. But then, but then you are talking to the centralized server. You get your ticket from it that says, yes, you're okay to run or whatever,
you know, those kinds of things.
But then, but then you are sort of solving the similar problem to, excuse the dog.
Um, you are solving similar problems to, um, and now I can't even
remember what the thing's called now.
What is it we use for service discovery?
The old company, um, console, console.
Yeah.
Which is, you know, um, uh, chubby in Google terms, I think is the equivalent.
And, you know, it's a, it's a centralized lock manager, but
it's sort of a small amount of shared state between things.
And so people can go in and now obviously that's still opt in.
And you still have to be part of the console cluster or you have
your, your system has to be registered with the console
cluster in order for it to be noticed.
But that's what it's supposed to be. That's one of the things that's meant to be there for
is to say like, Hey, find me all the things that say that they are metrics producers or everything
that says I'm a web browser or a web, sorry, web server or that kind of thing. And so that feels
like a good solution, but just like my network partition example and whatever, you can still
break it because if you're not in the console cluster, then you're in a partitioned world of your own. Right. And so, yeah, there's
not an easy solution to any of these things, but I do wonder if conflating metrics gathering
with this is, is a good thing, uh, whether or not, you know, you, you, you just mentioned
in passing that this is a useful thing to be able to do. It certainly is. If you get
a, uh, You just mentioned in passing that this is a useful thing to be able to do. It certainly is a surprise if you get a...
This is one of those things where it's like, this is not a real solution to the problem
that you're talking about it being.
We have A and B, we're trying to choose between A and B, and I'm like, I think I like A better
than B. And I was like, why?
Well, it's like, well, because in certain situations, it'll solve this problem.
It's like, well, but in other situations, it won't.
It's like, yeah, but that's not why we're talking about A and B. I'm just trying to pick between two options, right?
No, that's really interesting.
Yeah.
Yeah.
So it's just, yeah, yeah, no, no, I got it.
And I mean, ultimately it's, it's almost like what if you were to do, if you're doing
metrics gathering, the hybrid solution of where, you know, instead of proactively being
scraped, you just connect into the central server and then it asks you.
So it's still push and pull.
Like you connected into it and it knows that you existed and you therefore service
discovery is if you town it to pour 8,000 of the central machine, then we care about
what information that you have.
Um, but you get scraped by it saying, okay, give me what you got.
But obviously that doesn't work over at HTTP, which obviously
has convenience methods.
Certainly when I'm a developer, uh, it's useful to better hit my own web
server and in fact, some of the tests I wrote, uh, involved scraping back
over the HTTP port to check that I was actually exposing the metrics that I
thought I was exposing when I was writing my own Prometheus endpoints.
So I think, yeah. Yeah. And yeah, to say that the Knight Capital legend was purely,
and not that you did, but like, there were so many other aspects to that. It was very much the
Swiss cheese, and eventually the holes lined
up and one of the things got through.
Um, but yes, metrics.
Yeah.
The, the, the real thing, the real thing here is sort of bringing this back to
observability in general a little bit is like, I, I think, I mean, and I do this
in the systems that I have, what you probably wanna do in a system
that has discovered that it is no longer observable
is to stop.
Because it's sort of like the last gasp of like,
someone pay attention to me.
Yeah. Right?
And so you wanna do that in multiple situations.
You wanna probably have something like that at startup, like
registering with some sort of central discovery service or sending out some sort of message
saying like, Hey, I'm starting up. And if you don't have a way to acknowledge that someone
heard you be like, okay, well, then I guess I'll stop then. Like having some mechanism
to do that is a great sort of safety mechanism.
Along with heart beating to make sure that everyone, both ends are still actually there and like, are you still there? And I don't just mean TCP ones, and I've had some debates with people about this, but I still think this is the way that I do it.
Is if you have a system that encounters a fault.
So going back to our sort of structured logging, like I've logged an error in the system, and I've logged a error in the system.
And then I've got a system that's not there yet.
And then I've got a system that's not there yet.
And then I've got a system that's not there yet.
And then I've got a system that's not there yet.
And then I've got a system that's not there yet it, is if you have a system that encounters a fault,
so going back to our sort of structured logging,
like I've logged an error or an exception,
and I try to send that to somewhere to notify somebody,
right, what happens when that fails?
I think the right thing to do with a certain amount
of retries, like keep retrying, but
like if you retry for some period of time, eventually you probably just want the system
to stop.
Now that's not universally true for every single system.
There are things where it's like, no, this just needs to keep trucking even if it's having
failures.
But all other things being equal, my base argument is if you have a system that has an error, fine,
errors happen. If you have a system that has an error and tries to report its error and it can't,
okay, it should keep retrying. But at a certain point, it should just exit.
I would not disagree with you on that. I mean, just to sort of like remind the listener though,
that you and I come from a world of finance
where there's a lot of regulatory stuff around. If we can't log what we're doing, you know,
again, night capital type stuff, if we can't tell somebody that something is up, then the
best course of action is to stop doing anything further, log everything you can to disk and
then kill the process and be done with it. And hopefully that gets someone's attention,
right? Why are we not trading anymore? Oh, it turns out the process self be done with it and hope that that gets someone's attention. Right. Right. Why are we not trading anymore?
Oh, it turns out the process self-destructive.
Why is that?
Well, there's been a network split and it can't tell us that the positions out,
you know, those kinds of things and those are more defensible.
But, but yeah, if my pacemaker, um, can't log an error, then maybe I don't want
it to stop.
Um, but you know,
obviously I don't want my home wifi router to turn off because
it can't send logs to someplace that I don't care about the logs for.
Right.
Exactly.
Yeah.
So there are ways in, but, but I think as a, as a sensible, um, and even within
the finance industry, I think, you know, this is something that I've worked on.
Desks where it is okay to not be up and running. Like it's not great.
You know, there's going to be some very long meetings that you're going to have
to explain yourself, but it's like not in the, if you're not on and trading,
the only thing is, is an opportunity cost. You know, you, you weren't able to
make money or whatever. And there are, there are manual ways of trading out of
positions and those kinds of things.
But if you have obligations to an exchange or downstream clients, then
maybe you have to limp on and say, look, it's better for us to continue to be
able to provide this service, albeit disrupted, um, but I've never worked
on a situation like that, so I'm always down with, yes, you know, like literally
my C plus plus exception handling stuff is like log everything you can to disk and then kill minus nine
myself, like, you know, there's, there's no way we can carry on after this point
here, right?
We are done and dusted.
I don't care if like the destructors don't run properly, just kill the
process at this point and that's always okay.
Yeah.
Yeah.
I tell you though, uh, just to tie this back to testing, cause why not?
That's where we started. The one piece of code you though, uh, just to tie this back to testing, cause why not?
That's where we started.
The one piece of code I've never really come up with a great way to test is the
code that kills the program.
So there is a, at least in a C++ framework I'm familiar with, there is a
death test and it works by forking the process and then communicating between
the two processes to make sure that this actually kills the process.
Now, unfortunately, that's Unix.
Unix being as complicated as it is, there's signal handling and there's like
child parent relations and you can still not always get it right, but it's not a
bad way of saying this should abort the process, right?
Literally kill the process and be done with it.
And you go, well, okay.
And I will, I'll fork myself here, no light, no snickering
in the back. And the child process will do that. And then
the parent process monitors to make sure that that's what
happens through some, you know, unix domain thing. So you can
write tests for these things. There's never an excuse not to
write a test for something, he says, very well aware that I've just spent the last two weeks writing
very limitedly tested code, but that's a whole other story.
Yeah.
All right, friend.
I think, yeah, that's probably a good place to call it.
Right.
Yeah.
This, this expanded from a, I have an idea to 40 minutes worth of, of
conversation, which is how it should be.
And I'm I enjoyed it
But metrics are more useful than you might think
And you should keep them unstructured logging is always a choice too. So yeah, it's a choice. That's for sure
Alright until next time
Until next time. Our theme music is by Inverse Phase. Find out more at inversephase.com