Two's Complement - Passing Messages
Episode Date: February 14, 2025Ben and Matt wade into the deep waters of messaging systems, get utterly lost in time synchronization rabbit holes, and discover their new podcast tagline: "We make mistakes so you don't have to." Mat...t celebrates by getting his car stuck where cars shouldn't go.
Transcript
Discussion (0)
We've rebooted it and the website will be back online in 2038. Hey Ben! Hey Matt. How are you doing friend? Dancing to the theme music. We both are. I know
it is yeah again we've said this before but like we've the new the new recording
setup means that we get to play the intro music if only just to get the
timings roughly right and so that we are bopping away and actually thinking of
that in the background not that our listener can see because we never
record the video to this but there's a huge box of nonsense behind me if you can see it over here in this great and so I'm traveling at the moment. I'm at my, my parents'
house in England, uh, which makes this an international recording. And I've been mining
all of the stuff that I left behind as a kid to send either home, which was actually kind of a
miracle thing. I had two large boxes of like cool
Stuff of mine that I'd left here because you know, I wasn't planning on being in the States more than a couple of years
But 14 years on I'm like, well, maybe I should take this home now
And I just put I stuck a sticker on the side of these boxes and I took them to a local shop
Magic happened and then my wife texted me a picture of them on my front porch in America
It's like I don't that's how I know how that's how that works.
But I was still so excited to see my things get there in like a day and a half.
It was brilliant.
But now I'm sending all of this stuff off to the person who composed our theme music,
Inverse Phase, Brendan Becker for his museum.
In fact, he's got great news.
He has just raised enough money to buy his own property and he's moving his museum of call, like games and weird and wonderful
computer tech paraphernalia over the years into Pittsburgh.
So at some point next year, I will definitely be going to Pittsburgh to see him.
And the amazing things anyway, weirdest thing.
What are we talking about today, Ben?
Not that today.
Today, I'm going to be going to Pittsburgh.
I'm going to be going to Pittsburgh.
I'm going to be going to Pittsburgh. I'm going to be going to Pittsburgh. I'm going to be going to Pittsburgh. I'm going to be going to Pittsburgh to see him and the amazing things anyway. Weirdest thing. What are we talking about
today, Ben?
Today, today, in addition to that, in addition to cool
museums, we are talking about here, right? Museum us Medusa, I
don't know. Medusa. We are talking about messaging systems.
Messing. Yes. So when. There are various systems in the world that you design as.
Some service somewhere is going to send you a reasonably small message,
and you're going to process that message,
and then you're going to send one or maybe more,
maybe zero messages out to another thing,
and the whole architecture of the system is
designed with just this message passing in mind.
Oftentimes, when you have systems like this,
you have distributed computing problems,
you have reproducibility concerns that you need to think about.
So I thought it would be a good idea to talk about some of the things in our experience,
having systems like this,
and we can talk about maybe what some of those systems were
just for context, but in our experience,
building systems like this,
what are some things that you should do
and what are some things that you should definitely not do?
Interesting, yes.
So we're talking about, you mentioned small message there. So we're not talking like bulk data thing here. We're talking what would be an example of what would be like a canonical example of this kind of system.
Yeah. Well, I think starting right off with some of the things that you should not do is I don't think that you should put gigabytes of data into something and call it a message.
I'm extremely speaking it's but you mean in these kinds of systems, you wouldn't want to call
in these that is something that I would be skeptical of if if
someone was like, well, can't we just take this like, you know,
three gigabyte file and stick it in there. It's like, so and
again, message. So we're talking things like broadly something
which could be using, uh, something
like Kafka as a sort of mental model of like, Hey, you're going to just put a
sequence of messages somewhere and, or, or it could be, um, some other system,
but I'm just putting Kafka in my head for now.
It's like something that probably most of the audience space might have heard of.
Um, yeah.
And then obviously that's a great example of something you shouldn't do is putting
a massive, massive, massive message into a message queue system.
They're usually not good at larger pieces of data.
Um, it, sometimes your recipients will, um, want to discard some messages.
And if you curse them to download a three gigabyte file, just to discover
that they don't want it, that's not good behavior. So, you know, the typical solution I can think
of of that is that you normally put bulk data somewhere else, be it like say an S3 bucket
or some shared file system or some other system, and then you send a message that says, hey,
there exists some big data over somewhere else that you can get hold of. Is that the
kind of thing?
Yeah. Yes. That's what I've done as well is you have basically a
pointer to some other large piece of data, whether it's a
file in object storage, or maybe even like, you know, one thing
I've seen is like embedding a SQL query that's, you know,
bitemporal. So when you run it, you always get the same
results, you can put that in the message and be like, Oh, there's
some data available here if you want to query it, right. But, but, but like embedding the core idea here is it don't, don't put a bunch of data
into a messaging system, whether that's just a system that's passing messages or a queue,
right?
Like something like Kafka or some other type of queue instead put in something that
allows you to fetch the, the consumers of that stream to fetch that data if and when
they want based on maybe some metadata that you include in the message.
You've obviously by doing that, you have added another system to an otherwise straightforward
system like I would need to mock out if I was testing this, both the retrieval system
separately from the message queue system.
There's an allure to saying, hey, let's just throw it in the one system and then everything's
a message and we don't need anything outside. So I can see why,
but there is a blurry line. Like we've threw out three giga data is like, that's too much.
But maybe, you know, 300K, I don't know. Yeah, right. Yeah. So all of that, I think starts to
become a lot more context sensitive. And maybe it's worthwhile talking about like some of the systems that we have
built, paint a little picture of some of this context and be able to talk about
the trade-offs that we're talking about here in those contexts.
Yeah.
Okay.
So yeah, what, what, what kind of systems have we, what do you want to start with?
What would you're happy talking about first?
Well, I mean, I can kind of go,
you know, the three main systems that I think of that I built
that are like this are there was a sort of infrastructure and
monitoring system that I built at a trading firm. And then at
that same trading firm, I actually worked on Yes, that is
that is like pantomiming the logo of the
system that we built. And at that same firm, I actually also built a trading system for
event trading. So this is like discrete events that are happening in the world. And news is
an example of that.
Right, so election results come in kind of thing.
And you're like, hey, if this person wins
and the market moves this way, or if some,
if a drug gets-
Tweets, we would trade tweets, things like that.
Press releases, those kinds of things.
Right.
And that was extremely latency sensitive, right?
Like that trade is basically like, you're racing the speed of light.
And so that had its own speciality.
You and everybody else know that if the Fed puts the interest rate up,
then the market will react in a particular way,
and you want to either take advantage of it or, you know,
or protect your own position, whatever.
But yeah, yeah, yeah, interesting.
Yeah. So like, you know, in that example, like a queue is just right out.
Like you can't queue and that's not going to work.
And then probably the third one was the system that
we collectively built at Coinbase.
I was thinking about that one, yeah.
Which was an exchange, right?
Like Coinbase hired Matt and I and a few other people
to build a replacement for their cloud-based exchange.
And what happened with that is a big long story,
which is maybe another podcast, but nonetheless.
Or not.
Yeah, right, or not.
You can read about it on the internet if you want.
How about that?
I think that's the best way to put it, yes.
I think that's the best way to do that.
But nonetheless, we built an exchange, right? And that is very much a system like this where
you're passing messages around. Right. So those are the three that sort of spread to my mind.
Concretely, for those, you know, an exchange in this instance is a service where many people are
sending messages into the system to buy and sell a commodity. In this instance, various
buy and sell a commodity, in this instance, various cryptocurrency coins and things. We had to process those and we had to process them fairly and we had to process them at
the lowest latency that was reasonable and very, very, very reliably.
We used a very interesting design of a messaging system at the very core, the very guts of
how it all fitted together to give us certain properties that we wanted
to be able to tell our clients that we had,
you know, like fairness and guarantees
over certain things, which was very interesting.
Yeah, no, those are cool.
Where do you want to start?
Do you want to start with the monitoring system or?
Well, those are mine.
Are there any others that you can kind of
throw into the mix here?
I mean, I think in general, receiving market data itself, that is the information that
exchanges then that the exhaust from an exchange, so the publicly visible information for some
definition of public about what's going on in any particular market is disseminated as a set of
discrete messages that
is ordained to you. You get a PDF from the exchange and they say this is how we're going to do it,
but you have to be able to sort of keep up and read and process that. So you get,
there is a message processing system there. So that's the thing I have the most experience with,
but I don't get the choice of designing it. I just have to make sure I hit the spec of the,
of what's going on there. So I don't think of
them. I don't think of that as in the same way as the other things. So let's let's stick with your
your and I'll see if anything rings a bell with something that I have done.
Okay. But yeah, so examples of things to do and not do so, you know, in the in the sort of like,
latency constrained world that I was living in with that event trade
and I would imagine in other places
where you have latency constraints,
you need to be very careful about the messages at rest,
right?
So a more dysfunctional form of this, I think,
is you're building a messaging system,
but in the middle of your messaging system,
you put a database.
So you write data into the database. And then you have
some other thing that is pulling data out of the database. And it's like maybe got like a cursor
or something where it's like, you know, I'm at like, row, actually, that log just lives in a
database. And you've got, yeah, you're just following down and insert on one side and select
the next thing. Yeah, yeah. And the other.
And the terrible thing about those designs
is that they kind of mostly work a little bit.
So it's easy to trick yourself into thinking
that you have something that will scale.
And you're like, oh, yeah, this database scales,
I don't know, whatever.
It's some cloud database and it scales infinitely.
Or I've got some cluster of these things
and I can just scale it out horizontally.
But there's not really any magic there. infinitely, right? Or I've got, you know, some cluster of these things, and I can just scale it out horizontally. But like, you
know, there's not really any magic there. If you've just got
one table, and you're writing things into the one table, and
you have lots of things reading from the one table, you need to
really understand what that database is intended to do and
what it's capable of doing. And maybe ask the question in that
case, you know, do we need something more like Kafka?
Do we need something more?
Right.
Because you're, I mean, not to throw anything in your way, but no, a good
friend of mine once suggested that using a sequence of numbered files is a
perfectly reasonable way of sending messages between systems.
And that's true as well.
So I don't think you're saying that a database is not a solution to some
problems, but certainly when latency is important, you've got too much non determinism and there's too many
moving parts. So what do you do if you have a latency sensitive application that needs to be
able to react as fast as you possibly can, and you still want it to be a message passing system?
possibly can, and you still want it to be a message passing system.
I mean, so, you know, again, we're calling on some of our prior experiences here.
Um, not storing the messages, right?
Like having the sender and the receiver directly sending messages to each other, either over, uh, you know, TCP or some sort of reliable multicast protocol, which,
you know, you can Google various options there and see what you
like.
I was going to say that's the whole episode.
Yeah, right. Is a great way to sort of reduce that latency. It
does put constraints on the consumers, depending on exactly
how you do it, to either not create back pressure or to deal
with that back pressure in some way. Like, you know, the
fundamental question to ask is if the consumer doesn't consume
the data, what happens? Right? Where does it live? Does it get dropped? Does it get stuffed
somewhere else that it reads later? And how would it ever possibly catch up? So there's all sorts of
concerns to think about there. But fundamentally, if you've got something where you've got some
latency constraint, I think attacking that problem as I'm going
to write my messages into some sort of storage thingy and then read them back out again,
you just need to be really careful about what kind of latency that's going to introduce
and maybe just going directly better.
Right.
And I suppose in the limit, if you can do this, which obviously we've kind of glossed over
already, being on the same physical computer means that you can use shared memory transport type
things and a queue that lives only in memory. So there is a, there's a queue, but like only because
you have to have somewhere to put it, you know, so a double buffer, even in the limit of like,
I'm writing to this thing in process A and process B is just waiting for the OK to read
from it as soon as it's finished being written to. But all the things that I've been thinking
about so far have all been some network traffic has happened between a more distributed system
than something that can be literally co-located. Because of course, and even more of a limiting
case, they're in the same thread and they just literally have memory mapped in this, you know, they're just a global variable
is being said or whatever, a shared variable, I should say. Yeah, so storing the data is sort
of orthogonal to, or sort of durability of the data, you don't always need durability,
something like Kafka will always give you durability. And as you say, that's the thing
that stores it kind of first.
And then everybody gets a copy of it from the brokers that have already stored it.
There's a quorum based here and everyone's got, you know, it, it, we know
that if a message has been sent, if before anyone sees it, some configurable
amount of durability has taken place such that, you know, that message has not been
lost and will definitely be there again, if you have to go back and get it.
And then there's something on the backend as well, where you can say, I know that this message
definitely got processed by at least one of the people that were supposed to do anything with
this message. And so that's really, really good when you're talking about things like financial
transactions and other things where you like, it's absolutely needs to happen. We need to have a
journal of record and that journal is more important than the latency here we have. In the case of your
event trade, presumably, if you dropped a message or if they're again, back pressure related things
here, maybe dropping the message is okay, because it's better to not hold up the fast people
by having that one slower consumer than it is and have that message being missed by that consumer, then it is to cause them to potentially to fire an order too late or some other issue there.
Right. Yeah. And another actually interesting dimension of that particular system,
which I think is worth talking about, is that the messages were not sequenced.
We had lots of different messages coming in from different data centers that were all hitting the same system.
And it didn't really matter what sequence they arrived in, right?
Oh, that is interesting.
But oftentimes it is very useful to be able to sequence a stream of messages.
Yes.
Because that allows you to do things like create a state machine.
And then any consumer of that stream should be able to reproduce the same
state of the state machine from the sequence of events. And obviously, a classic example of this
in finance is building a book, but there are lots of situations in which you want to have a sequence
stream of events that you can use to reproduce state in any consumer that sees that stream.
Right. This is like log structured journals of information, like databases and things. You just need to be able to process them in strict sequence now.
And again, when you, that's okay.
So like you mentioned building, building a book in, in our world, which is taking
this multicast data that flows from the exchange and applying it, um, as the set
of modifications to an empty state to bring your world up to date with whatever
orders are flying around and
are currently active and you absolutely have to apply them in the right sequence
or else things go horribly wrong.
Um, but in that instance, there is a single producer, at least for any one
book, there is exactly one producer that is, can give you a sequenced number.
And therefore you can see if the messages arrive in order.
And so that's, that's an easy proposition. And again, for those folks who are thinking like TCP, again, if you've got a single
connection, it's TCP one end to the other, then again, the, the, the messages that are
being sent aren't going to be reordered anyway, that's a property of the, of the
transport, but in general for the kind of UDP messages that we talk about in finance,
that's not true.
And you need to be able to see if you either have received messages out of order or
you've seen the messages that we talk about in finance, that's not true. And you need to be able to see if you either receive messages out of order,
or you've seen that you in fact miss one that you need to go and get it from some other other place.
So that's an intro interesting property again of message. So we've already talked about durability
is one sort of dimension. Another dimension is like, what are the constraints on reproducibility and sequencing that kind of sort of go hand in hand. So just
to sort of take another point here, there's something like Kafka by putting it through
a broker, somebody who's responsible, at least for a single stream in Kafka, right? You have
also, as well as the durability guarantees, you have got like a single place of record
where the ordering is kind of set in stone.
And so a subsequent read of that will give you back the things in the same order that everyone
saw it in. And that's a useful property in some cases. But going back to your event trade,
you're saying that that's something that you could actually tolerate. And in fact,
you didn't want to take the hit for receiving from multiple systems.
Right. The sequencing process would just slow that down. So we couldn't do it, right? You have to
just design the system to be tolerant of that. But I think something that's really important
to understand, and this is true of Kafka, this might be just like a general cap there thing,
of like, if you're going to get a sequence stream of events, then it can be very difficult to build a system that can
scale horizontally with that constraint. Something has to be the sequencer.
The arbiter of what time things happen, which came first.
Right. Yeah.
In the particular case of Kafka, I forget topic versus stream and exactly how that is,
but it's like the thing that gives you
that ordering guarantee cannot scale horizontally.
That is, yes, the stream within a topic.
So topics can have multiple streams
and those streams are kind of a unit by which they are
given to individual members of the Kafka cluster.
And of course you can have multiple processes
and threads and whatever.
So essentially by sending to a single stream,
you're sending to a single stream, you're sending to a single, like single source, sorry, single
destination. And that's the thing that gets to decide. But there's only one of them. If you need,
if you need to go faster, you need two of them. And now suddenly, you're no longer,
did you have this nice guarantee of, of a total ordering? And that's what we're talking about
here, a total ordering. Yeah, yeah, yeah, Yeah. Yeah. So there's some important trade offs to consider.
Why not just use the time as the total ordering?
Well, how much time do you have? Because
well, you said you said you had an hour. So
well, so to start with, what precision? Because all of the
precision you choose, you're going to get some amount of
collision, right? These two events happened at the same
nanosecond, which comes first? I don't know. Right? Yeah. Yeah.
No, exactly. Right. Like that's not a deterministic sort order.
Right. And if the if you look, yeah, you think that never happens.
Oh, yeah. You know, that's that doesn't you know, birthday paradox kind of thing means that it happens a little bit more often than you would otherwise naively think. But yeah, it's still I I'm gonna admit here. We did use nanoseconds since 1970 as a like a global key for packets arriving in one of the products
I worked on a number of companies ago. And the solution there was a post process, arbitrarily
picked one of them if it found two that had the same and just added one nanosecond until
it didn't, until it didn't match anymore. Right. Right. Right. Right. Right. It's pragmatically,
it mostly never happens, but when it does, it really blows your system up. So yeah. Yes.
And then it's so how much precision? Great great question and you know, you and I have been fortunate enough to work in the
finance industry where we already like to have accurate time. So getting a somewhat accurate to
within low digits of nanoseconds time is is feasible for us. But for most people, that isn't
an option. You can get milliseconds at best. Right.
NTP will get you within plus or minus 15, maybe 20 milliseconds, you know, better than
two people synchronizing their watch in an old spy movie, but not that much better.
Yeah, yeah, yeah.
And I do think it's sort of that false precision problem that leads you into this trap where
you're just like, well, this nanosecond precision timestamp, what are the odds?
Like they can't even physically arrive at the same time, like the photons don't move like that.
It's like, okay, but then what happens when your clocks are
just off, right? Like you're just they're just not that
precise. And so you get two things that have the same
timestamp because your clocks just aren't that precise.
Right. And you know, when as soon as you have more than one
cable, the photons don't move that way. But you can have two
parallel streams of photons that do arrive at exactly the same time.
So you do.
It can.
It can and does happen.
Yep.
Yep.
Yep.
Yeah.
So yeah, you can't just use time.
And anyway, whose time are we talking about?
Because right, right.
You know, we're getting into the whole problem.
This is a whole other category of this, which is clock domains, right?
Like synchronizing time
between multiple computers is hard. It requires thought and
oftentimes specialized equipment. And if you just sort
of take it for granted that all clocks everywhere the same,
you're you're setting yourself up for a lot of hurt, like the
hurt is coming for you. But so anytime
that you're going to be comparing time, you need to be
thinking about what is the source of those clocks and how
precise are they and how accurate are they? And how are
you going to deal with the differences between them? And
what are those differences? What can they be? And you know, what
are the things there? So it can go all the way from, you know,
we've got a GPS antenna that's sitting
on the top of the building,
and we know the precise geographic coordinates
of that antenna, and we know how long the cable is
from that antenna to all of the various servers
that are using that antenna to synchronize their time.
And from the length of those cables,
we can compute the drift from the received signal
and the antenna to each of the individual computers, right?
And unless you're taking that level of precaution
or something kinda like it,
I would not trust any nanosecond timestamp
to be greater or less than anything else, right?
You've missed out even some bits there.
Like when we were doing stuff at previous companies,
there would be a rubidium- oscillator with a very high, you know,
there's an oven that's got like rubidium at some temperature and
it's used and that's the thing that you synchronize with the
GPS and everything synchronizes to that with some complicated
protocol and yeah, well, no, I say a complicated, this is my
favorite protocol. I remember one of our network engineers
saying to me, yeah, we use PPS to synchronize the, the, the, the, the master clock with the individual, like clocks on each of the
machines. And I'm like, PPS, wow, what, what, what's that pro? You know,
cause I've heard of NTP and I've heard of PTP and PPS.
And he's like, it sounds for, it stands for pulse per second.
And it's like, literally it goes five volts once a second on the second.
I'm like, Oh, right. That's once a second on the second and like, Oh,
right, that's the protocol.
Just on and off.
Got it.
It's a good simple protocol.
It's a simple protocol.
But yeah, again, you talk about the lead, you know, the cables were very carefully measured
and very carefully designed to not to be understandable how long the delays they brought in.
So yeah, yeah, yeah, it's complicated.
Right, Right. And
reasonable people could disagree because yeah, you're you're you can have a data center full
of things that uses your discipline for clock synchronization, which you're maybe happy
with. But if you take a message from say, an exchange and the exchange is, hey, this
happened at this point in time, you have to trust their ability to manage that. If you
want to say, well, why don't we use their clocks?
They you know, whatever we're doing on our side, forget it. Let's just use the clocks from the remote people. We have been through this process. You're like, well, that makes sense. You know,
they surely have done something sane. And then of course, what if they haven't? I mean, not that
would throw aspersions at our friends who have a difficult job maintaining these systems. But like,
yeah, things have gone wrong before. And then suddenly you're thrown into a world of hurt because time went backwards
by tens of nanoseconds. And you're like, no, I always expect time to go forward because,
you know, that's one of the few truths along with taxes and death is like forwards.
Nope, you think it does. But I mean, I think that raises a really good point, which is one way that you can get around
this time synchronization difficulty is to never use the system time of the computers
that are in the messaging system and embed time in the messages, right?
And then the ultimate source of the messages is the thing that has to have a reasonably
accurate time.
But the sense of time for all of the downstream system
just comes from that.
And that is really important if you
want to do what we were kind of talking about earlier, where
you have a sequence of messages and you're
trying to reconstitute state based
on that sequence of messages.
If there's any sort of time processing that has to happen,
then embedding the time in the messages allows you to reconstitute that state retroactively, right?
So you can go back and you can replay the messages from three months ago and reconstitute whatever state that you have,
even if it depends on time, because it doesn't depend on the clock of the computer that's just running the simulation or the reproduction.
It's extracting that time from the messages
itself. So you always get exactly the same result.
Yeah. No, I mean, so just to take a temporary diversion here, this is one of the things
that in the code base that I was working on, we use different types for the different types
of time. So they were literally not comparable or convertible between each other without
like an explicit thing I could search for in the code saying like we're doing this we're crossing clock domains right now
I am trying to look at the current time as measured by whatever process has given me
the time on my computer and I'm comparing it to the message time that was embedded in
the message through some mechanism and I have to know that that comes with this huge bag
of caveats it's sometimes useful to do it because one thing you might want to do is measure
the skew between the two, just to graph it somewhere or just to keep track of it
or just to alert if it gets more than a few hundred milliseconds or something out.
So you do want to be able to do it, but you definitely don't want to be able to
do it just by saying time t equals clock dot now minus message dot time.
It should be no, that's a, that's a syntax error,
right? The thing's going to, could not, is going to fail to compile there. You have to do some work
here. And you know, that's, um, that's always been a worthwhile thing I've found to do. And even
within a computer, you know, like there are different clocks, you've got monotonic clocks
that are guaranteed to not go backwards. You've got clocks that try and like adjust because of
like the NTP drifters, they're readjusting themselves. You've got clocks that try and like adjust because of like the NTP drift as
they're readjusting themselves. You've got like the CPU cycle counter, which is measured in its own
domain. So this is something that's useful to have more generally. Gosh, this is really going off
topic, isn't it? This is great. But no, it's a really important thing to know about the,
and I mean, I think it's worth saying as well, just because it's cool that it is possible to get networking
hardware to add a timestamp onto the end of packets that flow through it. So there are
certain switches that you can configure. You can plug them into this PPS and get them to synchronize
with your very accurate timestamp. And then every message that flows through that switch gets a
payload on the back of each packet tacked on after like the end of what would normally be the UDV packet
or the TCP packet or whatever. And you need to use exotic mechanisms to go and actually
pull those bytes out, but they are there. And then you can have like a source of truth
that maybe the edge of your network as things come in from the outside world, you say, well,
this is where we're going to timestamp
it. And that's useful for both reconstituting the sequence in
which they arrived at the edge, which is not necessarily the
order that they arrived at you, because cables can vary within
the system and roots within your system can vary, but it gives
you something to measure things by and in particular, when
you're doing some of those more latency sensitive things that
we've been talking about, having a sort of ground truth comparison,
that you can look at that timestamp for the thing that came in and look at the timestamp of your
message that went out of of the system, you've got like, that's literally how long it took warts and
all every network hop. Anyway, that's one of the many sources of clock domains. And we were talking
about clock domains in the context of ordering. So yeah, yeah, well, and that actually brings up
another topic, which is that time stamping is an example
of something else that is a really good practice, which is tracing, right? As each, as the message
flows through your system, and as it's being processed at each stage, it is quite often useful
to be able to embed in the message or maybe as a wrapper around the message,
depending on how you do it, information about the tracing. And that can be useful for performance.
It can be useful for error.
Debugging.
Debugging. Yeah. Just general observability, figuring out like, hey, this message failed
to process. Why? Where did it stop? What problems did it run into? Or it
was really slow to process. Why? What was the bottleneck? What was the slow part?
And sometimes you'll do things like creating some sort of identifier at the point of ingestion
or message creation, and then you can have an external system that refers to the message
as it flows through using that identifier, or sometimes you're literally just adding information into the message object as it's flowing through.
Now, incidentally, is what we used the nanosecond timestamp for, because obviously the hardware
on the outside would put this nanosecond timestamp on every packet.
We're like, well, that's a unique identifier, except when it isn't.
But most of the time it is.
And then it gives you this sort of unique ID, this sort of like trace ID, which is
carries information in its own right, because it's the time that it arrived as
well, but yeah, not always, unfortunately, not always unique.
No, I've variously seen this as, you know, provenance or tracing or, or causality
or, uh, you know, there's, and I'm sure that I know that the open telemetry
projects I keep being pointed at and, uh, and I'm sure that I know that the open telemetry
projects I keep being pointed at and I'm going to start looking at that soon. I've keep meaning to,
they seem to have a whole bunch of stuff around the telemetry of more just generally of systems,
but I wonder if they have something that also talks of or can be used to correlate. That's
another one correlation IDs and things like that, or one event and like
the, the causality as it traces through your system and you see all the different
events, I mean, even on just like a website, just seeing that someone clicked a
button and caused an error and you're like, well, that the backend error was
caused by this click over here is useful.
Anyway, sorry, again, off really off base here, but yeah,
no, I mean, I think these are all these are all dimensions of
consideration you need to be thinking about. If you're going
to build systems like this, right?
Yeah, we've talked about various dimensions so far of messages, we
talked about like durability, we talked about sequencing, we've
talked about now tracing, which sort of a
determinism, what are the, you know, very, we, we opened with, you know, don't put giant
areas of data, giant blocks of data into your messages. And we said, be very careful about
which clocks you use. What are the considerations are there? I mean, so how would your
monitoring system does it? What? Let's, let's think, so how would your monitoring system does it?
Let's think a little bit about the monitoring system.
So that had a very, very high set of inputs.
Like essentially it was a centralized monitoring system
for the whole company's services.
All the services could send all the stats they wanted to it
and you had to deal with it.
Yeah, I'll tell you one thing,
one mistake that we made,
and this is, you know,
good judgment comes from experience
and experience comes from bad judgment.
And so listeners, I hope that you get to benefit
from all of the bad judgment of the people on this podcast
and the hard-won experience.
And so when I say like,
you need to be careful about clock domains
and you need to think about like
where your source of time is.
One of the great mistakes that we made very early on
in that project, and it's something that just haunted us
forever, is we allowed people who were sending messages
to the system.
So the idea behind the system is that you'd have,
you know, external clients that could send, you know,
telemetry data or I mean, basically anything like prices,
internal application metrics, whatever they wanted,
they could send data to the system.
It worked a little bit like stats D,
if you've ever used stats D.
Yeah, it's sort of like Prometheus type things that,
but it's a lot more,
it was designed for more real time stuff
rather than like once a minute, once a second.
Yes, yes, yes.
It was very much like that.
The idea behind the system was like,
it's cool that Grafana has a chart that updates once a minute. Yes, yes. The idea behind the system was like, you know, it's cool and that
Grafana has a chart that updates once a minute, but we need something that can update many times
per second because it's monitoring trading systems. And if something happens, we need to know about
it right now. So like human time. But one of the great mistakes that we made with the system was
allowing people to put their own timestamps on those messages. That was a terrible idea.
An absolutely terrible idea.
An absolutely terrible, easy to do. I can see why you'd want to be able to do this.
You know, like I find this quite often with things like, um, the,
like our Prometheus setup, because you know, Hey, I've got, uh, a build.
I want to like measure my build time and I want to post it.
And then sometimes I want to go, actually, I want to go back in time and
like run the last hundred builds one day apart from each other.
And I want to populate some data in the database so that I don't just have now data.
I have historic data once I thought I want it right.
And so how bad would it be to let me post stuff that's in the past to you so that I can write my data like, you know, it's a reasonable thing to want to do.
So what was the drawback?
What was the what what made you rue that decision?
Well, because inevitably people want to be able to say like, oh,
and also give me the list of all the messages that were delivered on this day.
And now that's just wrong.
Because your timestamp and my timestamp don't line up for whatever reason.
Right. It could be that you post or pre or post dated your thing,
but you did the calculation wrong.
It could be that like what you actually want
when you say the delivery data that was delivered
on that day was the delivered that data was
that was delivered on that day and not like whatever
timestamp it had because that came out of your log file
or whatever.
This comes back to almost like the by, by, by
temporary thing. It's like, you know,
there's the time that I got it.
And that's the kind of knowledge time.
When did I know that you said that you wanted this thing?
That's one timestamp.
And then the other timestamp is what time did you say that you wanted this thing
to be known as of or related to?
Sorry.
Uh, and in almost all situations, those two times are coincident or so
close that nobody cares, but not always.
Right.
And I think that's the one that a harder thing.
I don't know if we've, have we ever talked about by
temporality? Maybe we have.
I, I don't know.
We must've done in passing.
I don't care.
That's a whole interesting world as well.
You know, like it's, it's, uh, yeah.
You want to say on this day, what messages did you send me?
And then you want to say on this day, what samples fall in this window, which is different from
when did you tell me about those samples? Right? That's a very
again, they're mostly the same. But yeah, that's okay.
Yeah, if I had it to do over again, what I would have said is
no, you cannot specify the timestamp. But you can and this
was true already, you can put whatever data you want in your
message. And you can query based on any of that data so if you want to have your own.
Log time stamp or ingestion time stamp or whatever you can add that as a field to your message my system will be blissfully ignorant of it other than it's another field that you can do stuff with and you can do whatever you want with that timestamp.
Yeah, that is your piece of data to do with UISH, but we know when it arrived with us, that's all we're going to keep as the primary thing that we can...
Yeah, yeah, yeah.
Yes. Also, speaking of timestamps, please, please, please do not put localized timestamps
in your messages. It's a long, it's a it's it's it can be nano
precision, it can be millisecond precision, it can be second
precision, I don't care. But it's a number, please just put a
number in there. Don't put some parsed string with a timezone
offset.
No, and store it in UTC. Yes, this kind of thing or some well
defined never changing thing. I think I don't
know to what extent it's an open secret or not, but a very large web search company, to this day,
to the best of my understanding, still logs everything in West Coast time, which means that
it's logs and the graphs that go with it have twice a year, either a big gap or a weird back double backing on themselves type of thing.
Um, and it's just the cost of changing it is so high that it hasn't been done.
But yeah, you, there are times, there's a time and a place for localized time.
And it is in application level things.
And it is in application level things. If you're saying, if you're trying to talk about what time did a trade happen on a particular exchange,
it is useful to specify it in the local time of that exchange, say,
because you know that exchange opens at 830 local time on that day and closes at 330 local time on that day.
But if you have to sit and try and work out or do anything other than compare
with arithmetic operations,
straightforward arithmetic operations on a 64 bit number, then you've do it.
You're doing something wrong.
If you have to kind of work out what day that was and then, Oh,
is it daylight savings or not? Oh no, that, wait a second.
That was in Europe, wasn't it? And they don't do daylight savings.
Absolutely. Absolutely. Like, that was in Europe, wasn't it? And they don't do daylight savings. Ah! Absolutely, absolutely.
Like religion and politics, time localization
should only be discussed in the home.
Like, the international standard is a 64-bit number
and only when you're displaying it or like viewing it
or making a report, do you ever take that 64-bit number
and turn it into some localized time that is localized for the person who is viewing it?
Right. Yes. The system perhaps.
Whatever it is. Yes. No, then that makes sense. Yeah. I think that is that.
And then, yeah, nanoseconds since 1970 is not a bad thing to fit into 64 bits.
That'll get you to, I can't remember when, but it was, you know, it's far enough in the future.
At least right now, I don't have to worry about it before I retire. Although
that is, you know, I'm an old man. So maybe, maybe the younger folk will have to worry
about it. But, but there are any number of ways of storing time better than that. Or,
you know, yeah, you can pick your own epoch, right? You don't have to be 1970 is convenient
if it is because then you can use the Unix date command
to kind of move back and forth.
In fact, one of the first things I do,
I check in all my dot files.
Sorry, this is another sidetrack,
but one of the like fish functions of the shell that I use
is to convert numbers from an epoch time
to like a displayable time and backwards, right?
So I can do epoch and then just type a number
and then it based on however many new digits it's got, it guesses whether it's millis,
micros or nanos, and then it prints it out in my current time zone. And it is the single most
useful thing. I know people go to epochconverter.com, which drives me bonkers to see, you know,
why would you go to our website with all these flashing ads and things on it just to convert
some numbers when it's like something that command line can I can do but on the other hand, it's
Yeah.
But or you can just open up a JavaScript console and your favorite browser and paste the timestamp
into new date new date.
Yeah, that's true.
I'll also give it to you.
That's a great one.
Yeah, I'm remembering that one.
That was even more portable than mine.
It's super, super convenient most of the time.
Another thing to think about here, and this is kind of getting back to,
I was saying, don't put a database in the middle of your messaging system, right?
Generally, sometimes it's fine. And as you said before, sometimes it's just a file. But like, okay, if I can't do that, then how am I supposed to bridge the gap? Because there will
almost certainly be a gap between the world of stream processing systems and batch processing systems.
At some point, someone's gonna wanna run a database query
or something on your data.
And how do you handle that?
And also this kind of ties into a durability thing
where it's like, if you don't have a system like Kafka
or some other sort of durable queue
in the middle of your system
to kind of keep track of the history, you know, you just have, you know, UDP packets,
or you have something else, like, what should be responsible for sort of keeping the, the,
the, the, you know, historical record of everything that has ever happened, right?
Right. So, which obviously some people don't need, and that's fine. If you're, if you're,
if you're a video game server and you've got the player positions that are being updated, then maybe you don't need a log for all time. But if you're working in
finance, it's generally a good idea to keep everything forever for all time in case somebody
comes and asks you a very awkward question about what happened. Yeah, yeah. And this ties in also
to another thing that we were talking about about reproducing state for state machines. So it's like, it's, you know, the cool idea is like,
all right, I'm gonna take my messages,
I'm gonna pass them into some system that processes them.
There will be no other information
that goes into the state machine
other than the messages itself.
And therefore I can completely reproduce the state
from the sequence of messages.
It's like, yes, that's cool,
but what happens when you have seven years worth of messages
and you have to start at the beginning?
That seems bad.
So one of the things that you typically do is you have something that is consuming the
stream of messages whose purpose is to store them and also potentially snapshot them.
So you have something that is consuming the messages, it's writing them into some persistent
store, maybe it's even like transforming them into like something that can fit into it like
a database table or some other format that is nice for bulk processing.
And another thing that it might be doing is running this sort of state machine and taking
a snapshot at some regular interval and then putting that into the storage as well.
So that when you need to reproduce the state for some particular point in time,
rather than having to play all seven years worth of messages through your system,
you can jump to, you know, a prior,
but recent snapshot and then load that state into your system and then only
replay the messages forward from there.
And that will be much faster and much more efficient.
Right, right, right.
Provided there exists a sensible snapshot format, which is an interesting...
So I think this has now sort of moved into what I think of as like a log structured journal of like, you know,
like you have some database or in memory representation of the world that you update through seeing these events.
For some things, so for example, to build the set of live orders on an exchange, that is the prices of like Google and all the people that are trying to buy and sell Google.
You can unambiguously snapshot that state and go, okay, this is what.
can unambiguously snapshot that state and go, okay, this is what, um, this is a, at this point in time at nine in the morning, this is the everyone's orders.
And now if you just load up this 9am, you can carry on.
You don't have to load it up, you know, the 7am ones and, or the
whole, from the whole day, that's fine.
Right.
But as soon as you start getting to things that have state that is like non-trivial,
now it becomes a function of the processor of that state.
So let me give you an example.
What if you were keeping some kind of exponential moving average of some
of the factors of that, that depends on how long the window of your exponential
is and some other properties of that.
What, what do you count, which kinds of information go into that or don't.
And now you've got a complicated piece of state that is arguably different
for every client. You know, maybe some people care about a 10 day look back
and other people want to, you know, a five minute look back.
And so that gets kind of tricky.
But yeah, I don't know. I don't know where I'm going with this now.
But like if it just it's not as straightforward for application domains,
if they have any kind of state that is, that requires some
history in order to get to the point other than the like, the pure individual like ad
remove of say,
Yeah, yeah, yeah.
unambiguous stuff. Yeah.
Yeah. And the that state can get quite large because of these constraints. And I think
this is something that is really important to think about because this kind
of snapshotting becomes very important when you think about error recovery. And there's two
dimensions of error recovery that I think we can talk about here. One is you've got some consumer
of this stream and it's crashed and now you want to restart it. What state do you need to restart it, right? What state do you need to let it sort of rejoin the stream?
Again, do you have to go back to the beginning of time
and process seven years with the messages
for your system to restart?
That's going to be bad, right?
So-
We'll fix it next year.
Yeah, right.
Yeah, only guess was, yeah.
We've rebooted it and the website will be back online
in 2038.
We've rebooted it, and the website will be back online in 2038.
So you have to think about the state
if you want to be able to recover.
And you need to think about how you can reasonably
snapshot that state if you want to be able to spin something
back up and have it sort of rejoin the stream, right?
And so I think you have to consider that
from the very beginning. Like, how big is the state? How often can we snapshot it? What is our sort of acceptable amount of downtime here for these various things? You know, is it like an hour? Is it a minute? Is it, you know, a month? And how are we going to be able to rejoin this processing? Otherwise, we can never turn the software off.
to rejoin this processing. Otherwise, we can never turn this software off.
Right, which is an option.
Um, just don't write any bugs.
I don't have any hardware faults.
Um, yes, we'll be golden.
Yeah, yeah, yeah, yeah.
Um, another dimension to think about with with fault
talents with these systems are poisoned messages.
Yeah, so that is a very common situation
where there's a bug in your system or a bug in
a producer system perhaps and you receive a message that you can't process. And redundancy
here will not save you. You can have like 10 redundant systems that are all consuming the stream
and processing the messages so that if one runs out of memory or whatever, the other nine are there.
But if they all have the same bug
and they all get the same message,
the whole point of the distributed state machine
is that they are all going to do the same thing, which is not
process your message.
They're all crash.
I mean, they might crash.
All kinds of manner of problems can happen here.
So one common approach to dealing with these things
is creating what's called a dead letter queue.
So you have a message that comes in and your system cannot process it, but it's able to detect that it can't process it.
Maybe it raises an error. Maybe there's some validation step, whatever it is.
And it's like, I can't process this message.
So what I'm going to do is I'm going to take it. I'm going to put it into another queue.
Another thing, a stream of messages, messages called the dead letter queue.
And it's going to sit there until somebody does something with it.
Now the first thing that you want to do with it is send some kind of notification or alert
or something.
Someone's phone go off.
Yes.
Like, you know, somebody's getting paid.
It's like, ah, we just got a message.
We don't know how to process.
Right. somebody's getting paged. It's like, ah, we just got a message we don't know how to process, right?
But if you do that, then depending on the state machine
that you're trying to reproduce, if you have one,
or just the message processing that you're doing,
it can sometimes be OK to say, OK, I'm going to take this message.
I'm going to put it in the dead letter queue.
And then I'm just going to keep going, right?
I'm going to pretend like I never even got this message,
because it's malformed, or there's some other problematic thing with it, and I'm just going to keep going, right? I'm going to pretend like I never even got this message because it's
malformed or it's, it's, there's some other problematic thing with it. And I'm just going to keep going.
You can obviously run into situations where there's just a bug in your code.
And this is a message that you need to process and you didn't process it
correctly and now you're, now you're doing wrong.
Yeah.
Right.
Um, but there are also situations in which you have one of these messages and
it is truly something that is
Malformed and can be ignored was never supposed to be created in the first place and now you can just continue on
Having this in the dead letter queue a common pattern that I have used with great effect is being able to basically
Redrive those dead letter queue messages back into the main queue
Oh, if sequencing doesn't
matter.
Right.
If sequencing matters, then you can't do this.
Right?
But if you have a system where there's no sequencer or there is a sequencer, but it
doesn't really matter all that much, then you can take these messages and be like, all
right, we got this message we don't know how to handle.
It went into the dead letter queue.
We're now going to change the code so that it can handle this message in
some way, redeploy that, and then redrive the message back
into the queue so that it can be correctly processed and flow all
the way through. Right. And that is a really nice way to handle
it if you can.
If you're able to do that, then yes, that's a really and that's
so particularly, for example, if this was some, you know, holiday
booking stream of information, you're like a centralized holiday
booking thing. And then someone comes in and they've just booked
some suite and the some, the price is higher than you've ever
hit before and some internal issue happens and you're like, Oh,
damn, you know, we can't book this for them because it's, it's
legitimately $100,000 a night thing.
And that just overflows something we're done. But you're like, this is really valuable business.
Ben, can you hot fix that very, very quickly, write a test, fix the test, deploy the thing, and
then we're going to put it back in again. And then the booking goes through, albeit 30 minutes, an
hour late, at least it gets done and you capture the revenue and everyone's happy.
And you know, it's, uh, uh, yeah, that, that seems like a really nice way to heal the system
in that instance. But obviously sometimes it can be a legitimate bug or a malformed message or
something, something like that. Um, yeah. And you have to be able to deal with it. Yeah. Cause as
you say, fault tolerance was, was a dimension that you talked about. So another
dimension for message processing systems is that like things go
wrong, computers go wrong, and it's entirely reasonable to have
more than one person, well, not one person more than one system,
listening to this stream of messages, and independently
processing them and updating them. And then if the machine
breaks, well, you've got two more of them, and that's okay. And then you have to have a
system behind that system that determines what the actual
outcome of any particular update was. But you've got fault
tolerance by scaling through a messaging system. And that's
that that's a really interesting solution. And part of the
solution that we put together at the aforementioned cryptocurrency
trading place, which was a really interesting solution for a
number of things that we were doing, wasn't it? It allowed us to do rolling updates of the code
because we could have a quorum of five machines doing the same processing and then take two of
them out of the system, upgrade them and then put them back in again and then run them in silent
mode and check
that everyone's still agreed on everything that was happening. And then only when we were confident
that we hadn't introduced a new bug, we could add them back into the pool and then start rolling
over the other three. And there you go. Now you can do rolling upgrades and you're never down.
Hooray. It let us do things like have different configurations of those computers, be it through the different JVM settings or different hardware or whatever, such that if one of them process the message faster than the other or one that had to GC say or one of them was doing some jit work or whatever, we could make sure that as long as two or three came up with a good answer that we were happy with, the other two could be slower and that's fine.
And that meant that we could hide some of our tail latencies in the quorum, which was,
you know, so we got all these wonderful, and obviously, yeah, if we had an equipment failure,
then you know, two or three of those machines could die and the machine and the site would
stay up and we'd be able to process transactions and everything.
And that was super cool. And was, was definitely
eye opening to me working there in terms of like, Hey, you get a lot of benefits from
doing it this way. That's great.
Yeah. Yeah. I had that, that same experience. And we had done some things like that at,
uh, my, my previous company when we were basically intentionally creating races between systems
because we were trying to get them to run as fast as possible.
And it created an opportunity to make the system more
fault tolerant, where you'd have multiple parallel things that
are all processing the same stuff.
And the first one to finish wins.
And so if there's some variation in the latency
because of some operating system level thing or garbage collection, because some of this was Java
or something else had happened, right?
Or one of them was just offline and was losing every race
because it just wasn't processing anything.
It was all fine, right?
I think one of the more interesting things from that
is if you want to be tolerant of certain types of failures,
you know, like gamma ray burst type stuff where bits just get flipped, is if you want to be tolerant of certain types of failures,
like gamma ray burst type stuff where bits just get flipped, then the number of systems that you need to do this is three.
The number of counting is three,
because you need to have two of them, not one, not two.
Five is right out.
And five is actually kind of fine in this case,
but you need at least three, right? right out. And five, five is actually kind of fine in this case, but right. Um, because if you have
two and one says the answer is a and the other says the answer is B, you don't know which is
right. You need, you need three so that you can compare. Okay. Two of them say it's a, and one of
them says it's B. So B is suspect. And if you have five and four of them say it's a, and one of them
says it's B, then that's even better. Right. But you don't do.
Yeah. So we're running the same version of the code.
It's time to start looking through your radiation hardening protocol for what on earth happened or check the D mesh for any kind of uncorrectable error, memory errors and things of that nature.
But but yeah, that's, yeah. I think I've just looked at the time and we've been, you know,
given that we hadn't really got a plan, which is, you know, as regular,
our regular listener will know is how we do this. We have,
we've covered quite a lot of ground,
although I don't know that we covered our intended topic exactly as I would have
done if we'd have written out something before,
because we went on so many tangents, but in a good way, like we talked about
time, we talked about durability, we talked about scalability, um, and all
these things come out of a message based system or can come out of a message
basis, especially if you have this sort of like journal based thing where you
say the sequence of messages is the only input into my state machine and I can
trivially start from the beginning of time and get to exactly the same state. Or we
can snapshot if we know what the internal state is important,
different points along that time and have the best of all worlds,
which is which is super cool. Yeah. So I think by way of
saying maybe we should stop here is what I buy I bring all that
up. So this has been super cool. And definitely some deep memories there
from previous companies coming up there.
Bringing back the hard lessons of systems past.
We make mistakes so you don't have to.
It's our new tagline on this.
It's our new tagline, okay.
We've certainly made, I've definitely made plenty of mistakes
as well you know.
As I shared on social media, media, a picture of me driving
a car through a place where cars shouldn't go and was to do, I
got the car wedged. Yeah, you have to look at the gigabytes
versus worth of car and you're in your messaging. My didn't get
2.9 gigabyte. I drive Yeah, it didn't work very well. Anyway.
Yeah. Well, I think we should leave it there my friend. Thank you as ever for for joining me in this endeavor of trying to
I don't know what we're doing. Try to what be entertaining and enjoy ourselves and hopefully be interesting and useful to other people too.
Yeah, yeah, this is a good one.
Alright, friend, until next time.
You've been listening to Tooth's Compliment, a programming podcast by Ben Reidy and Matt Godfob. Alright, Frank, until next time. by Inverse Phase. Find out more at inversephase.com