Two's Complement - Passing Messages

Starting point is 00:00:00 We've rebooted it and the website will be back online in 2038. Hey Ben! Hey Matt. How are you doing friend? Dancing to the theme music. We both are. I know it is yeah again we've said this before but like we've the new the new recording setup means that we get to play the intro music if only just to get the timings roughly right and so that we are bopping away and actually thinking of that in the background not that our listener can see because we never record the video to this but there's a huge box of nonsense behind me if you can see it over here in this great and so I'm traveling at the moment. I'm at my, my parents' house in England, uh, which makes this an international recording. And I've been mining all of the stuff that I left behind as a kid to send either home, which was actually kind of a

Starting point is 00:01:02 miracle thing. I had two large boxes of like cool Stuff of mine that I'd left here because you know, I wasn't planning on being in the States more than a couple of years But 14 years on I'm like, well, maybe I should take this home now And I just put I stuck a sticker on the side of these boxes and I took them to a local shop Magic happened and then my wife texted me a picture of them on my front porch in America It's like I don't that's how I know how that's how that works. But I was still so excited to see my things get there in like a day and a half. It was brilliant.

Starting point is 00:01:32 But now I'm sending all of this stuff off to the person who composed our theme music, Inverse Phase, Brendan Becker for his museum. In fact, he's got great news. He has just raised enough money to buy his own property and he's moving his museum of call, like games and weird and wonderful computer tech paraphernalia over the years into Pittsburgh. So at some point next year, I will definitely be going to Pittsburgh to see him. And the amazing things anyway, weirdest thing. What are we talking about today, Ben?

Starting point is 00:02:00 Not that today. Today, I'm going to be going to Pittsburgh. I'm going to be going to Pittsburgh. I'm going to be going to Pittsburgh. I'm going to be going to Pittsburgh. I'm going to be going to Pittsburgh. I'm going to be going to Pittsburgh to see him and the amazing things anyway. Weirdest thing. What are we talking about today, Ben? Today, today, in addition to that, in addition to cool museums, we are talking about here, right? Museum us Medusa, I don't know. Medusa. We are talking about messaging systems.

Starting point is 00:02:23 Messing. Yes. So when. There are various systems in the world that you design as. Some service somewhere is going to send you a reasonably small message, and you're going to process that message, and then you're going to send one or maybe more, maybe zero messages out to another thing, and the whole architecture of the system is designed with just this message passing in mind. Oftentimes, when you have systems like this,

Starting point is 00:02:53 you have distributed computing problems, you have reproducibility concerns that you need to think about. So I thought it would be a good idea to talk about some of the things in our experience, having systems like this, and we can talk about maybe what some of those systems were just for context, but in our experience, building systems like this, what are some things that you should do

Starting point is 00:03:17 and what are some things that you should definitely not do? Interesting, yes. So we're talking about, you mentioned small message there. So we're not talking like bulk data thing here. We're talking what would be an example of what would be like a canonical example of this kind of system. Yeah. Well, I think starting right off with some of the things that you should not do is I don't think that you should put gigabytes of data into something and call it a message. I'm extremely speaking it's but you mean in these kinds of systems, you wouldn't want to call in these that is something that I would be skeptical of if if someone was like, well, can't we just take this like, you know, three gigabyte file and stick it in there. It's like, so and

Starting point is 00:03:59 again, message. So we're talking things like broadly something which could be using, uh, something like Kafka as a sort of mental model of like, Hey, you're going to just put a sequence of messages somewhere and, or, or it could be, um, some other system, but I'm just putting Kafka in my head for now. It's like something that probably most of the audience space might have heard of. Um, yeah. And then obviously that's a great example of something you shouldn't do is putting

Starting point is 00:04:24 a massive, massive, massive message into a message queue system. They're usually not good at larger pieces of data. Um, it, sometimes your recipients will, um, want to discard some messages. And if you curse them to download a three gigabyte file, just to discover that they don't want it, that's not good behavior. So, you know, the typical solution I can think of of that is that you normally put bulk data somewhere else, be it like say an S3 bucket or some shared file system or some other system, and then you send a message that says, hey, there exists some big data over somewhere else that you can get hold of. Is that the

Starting point is 00:05:03 kind of thing? Yeah. Yes. That's what I've done as well is you have basically a pointer to some other large piece of data, whether it's a file in object storage, or maybe even like, you know, one thing I've seen is like embedding a SQL query that's, you know, bitemporal. So when you run it, you always get the same results, you can put that in the message and be like, Oh, there's some data available here if you want to query it, right. But, but, but like embedding the core idea here is it don't, don't put a bunch of data

Starting point is 00:05:30 into a messaging system, whether that's just a system that's passing messages or a queue, right? Like something like Kafka or some other type of queue instead put in something that allows you to fetch the, the consumers of that stream to fetch that data if and when they want based on maybe some metadata that you include in the message. You've obviously by doing that, you have added another system to an otherwise straightforward system like I would need to mock out if I was testing this, both the retrieval system separately from the message queue system.

Starting point is 00:06:01 There's an allure to saying, hey, let's just throw it in the one system and then everything's a message and we don't need anything outside. So I can see why, but there is a blurry line. Like we've threw out three giga data is like, that's too much. But maybe, you know, 300K, I don't know. Yeah, right. Yeah. So all of that, I think starts to become a lot more context sensitive. And maybe it's worthwhile talking about like some of the systems that we have built, paint a little picture of some of this context and be able to talk about the trade-offs that we're talking about here in those contexts. Yeah.

Starting point is 00:06:37 Okay. So yeah, what, what, what kind of systems have we, what do you want to start with? What would you're happy talking about first? Well, I mean, I can kind of go, you know, the three main systems that I think of that I built that are like this are there was a sort of infrastructure and monitoring system that I built at a trading firm. And then at that same trading firm, I actually worked on Yes, that is

Starting point is 00:07:03 that is like pantomiming the logo of the system that we built. And at that same firm, I actually also built a trading system for event trading. So this is like discrete events that are happening in the world. And news is an example of that. Right, so election results come in kind of thing. And you're like, hey, if this person wins and the market moves this way, or if some, if a drug gets-

Starting point is 00:07:33 Tweets, we would trade tweets, things like that. Press releases, those kinds of things. Right. And that was extremely latency sensitive, right? Like that trade is basically like, you're racing the speed of light. And so that had its own speciality. You and everybody else know that if the Fed puts the interest rate up, then the market will react in a particular way,

Starting point is 00:07:54 and you want to either take advantage of it or, you know, or protect your own position, whatever. But yeah, yeah, yeah, interesting. Yeah. So like, you know, in that example, like a queue is just right out. Like you can't queue and that's not going to work. And then probably the third one was the system that we collectively built at Coinbase. I was thinking about that one, yeah.

Starting point is 00:08:20 Which was an exchange, right? Like Coinbase hired Matt and I and a few other people to build a replacement for their cloud-based exchange. And what happened with that is a big long story, which is maybe another podcast, but nonetheless. Or not. Yeah, right, or not. You can read about it on the internet if you want.

Starting point is 00:08:41 How about that? I think that's the best way to put it, yes. I think that's the best way to do that. But nonetheless, we built an exchange, right? And that is very much a system like this where you're passing messages around. Right. So those are the three that sort of spread to my mind. Concretely, for those, you know, an exchange in this instance is a service where many people are sending messages into the system to buy and sell a commodity. In this instance, various buy and sell a commodity, in this instance, various cryptocurrency coins and things. We had to process those and we had to process them fairly and we had to process them at

Starting point is 00:09:13 the lowest latency that was reasonable and very, very, very reliably. We used a very interesting design of a messaging system at the very core, the very guts of how it all fitted together to give us certain properties that we wanted to be able to tell our clients that we had, you know, like fairness and guarantees over certain things, which was very interesting. Yeah, no, those are cool. Where do you want to start?

Starting point is 00:09:35 Do you want to start with the monitoring system or? Well, those are mine. Are there any others that you can kind of throw into the mix here? I mean, I think in general, receiving market data itself, that is the information that exchanges then that the exhaust from an exchange, so the publicly visible information for some definition of public about what's going on in any particular market is disseminated as a set of discrete messages that

Starting point is 00:10:06 is ordained to you. You get a PDF from the exchange and they say this is how we're going to do it, but you have to be able to sort of keep up and read and process that. So you get, there is a message processing system there. So that's the thing I have the most experience with, but I don't get the choice of designing it. I just have to make sure I hit the spec of the, of what's going on there. So I don't think of them. I don't think of that as in the same way as the other things. So let's let's stick with your your and I'll see if anything rings a bell with something that I have done. Okay. But yeah, so examples of things to do and not do so, you know, in the in the sort of like,

Starting point is 00:10:41 latency constrained world that I was living in with that event trade and I would imagine in other places where you have latency constraints, you need to be very careful about the messages at rest, right? So a more dysfunctional form of this, I think, is you're building a messaging system, but in the middle of your messaging system,

Starting point is 00:11:01 you put a database. So you write data into the database. And then you have some other thing that is pulling data out of the database. And it's like maybe got like a cursor or something where it's like, you know, I'm at like, row, actually, that log just lives in a database. And you've got, yeah, you're just following down and insert on one side and select the next thing. Yeah, yeah. And the other. And the terrible thing about those designs is that they kind of mostly work a little bit.

Starting point is 00:11:30 So it's easy to trick yourself into thinking that you have something that will scale. And you're like, oh, yeah, this database scales, I don't know, whatever. It's some cloud database and it scales infinitely. Or I've got some cluster of these things and I can just scale it out horizontally. But there's not really any magic there. infinitely, right? Or I've got, you know, some cluster of these things, and I can just scale it out horizontally. But like, you

Starting point is 00:11:45 know, there's not really any magic there. If you've just got one table, and you're writing things into the one table, and you have lots of things reading from the one table, you need to really understand what that database is intended to do and what it's capable of doing. And maybe ask the question in that case, you know, do we need something more like Kafka? Do we need something more? Right.

Starting point is 00:12:07 Because you're, I mean, not to throw anything in your way, but no, a good friend of mine once suggested that using a sequence of numbered files is a perfectly reasonable way of sending messages between systems. And that's true as well. So I don't think you're saying that a database is not a solution to some problems, but certainly when latency is important, you've got too much non determinism and there's too many moving parts. So what do you do if you have a latency sensitive application that needs to be able to react as fast as you possibly can, and you still want it to be a message passing system?

Starting point is 00:12:43 possibly can, and you still want it to be a message passing system. I mean, so, you know, again, we're calling on some of our prior experiences here. Um, not storing the messages, right? Like having the sender and the receiver directly sending messages to each other, either over, uh, you know, TCP or some sort of reliable multicast protocol, which, you know, you can Google various options there and see what you like. I was going to say that's the whole episode. Yeah, right. Is a great way to sort of reduce that latency. It

Starting point is 00:13:12 does put constraints on the consumers, depending on exactly how you do it, to either not create back pressure or to deal with that back pressure in some way. Like, you know, the fundamental question to ask is if the consumer doesn't consume the data, what happens? Right? Where does it live? Does it get dropped? Does it get stuffed somewhere else that it reads later? And how would it ever possibly catch up? So there's all sorts of concerns to think about there. But fundamentally, if you've got something where you've got some latency constraint, I think attacking that problem as I'm going

Starting point is 00:13:45 to write my messages into some sort of storage thingy and then read them back out again, you just need to be really careful about what kind of latency that's going to introduce and maybe just going directly better. Right. And I suppose in the limit, if you can do this, which obviously we've kind of glossed over already, being on the same physical computer means that you can use shared memory transport type things and a queue that lives only in memory. So there is a, there's a queue, but like only because you have to have somewhere to put it, you know, so a double buffer, even in the limit of like,

Starting point is 00:14:20 I'm writing to this thing in process A and process B is just waiting for the OK to read from it as soon as it's finished being written to. But all the things that I've been thinking about so far have all been some network traffic has happened between a more distributed system than something that can be literally co-located. Because of course, and even more of a limiting case, they're in the same thread and they just literally have memory mapped in this, you know, they're just a global variable is being said or whatever, a shared variable, I should say. Yeah, so storing the data is sort of orthogonal to, or sort of durability of the data, you don't always need durability, something like Kafka will always give you durability. And as you say, that's the thing

Starting point is 00:15:04 that stores it kind of first. And then everybody gets a copy of it from the brokers that have already stored it. There's a quorum based here and everyone's got, you know, it, it, we know that if a message has been sent, if before anyone sees it, some configurable amount of durability has taken place such that, you know, that message has not been lost and will definitely be there again, if you have to go back and get it. And then there's something on the backend as well, where you can say, I know that this message definitely got processed by at least one of the people that were supposed to do anything with

Starting point is 00:15:30 this message. And so that's really, really good when you're talking about things like financial transactions and other things where you like, it's absolutely needs to happen. We need to have a journal of record and that journal is more important than the latency here we have. In the case of your event trade, presumably, if you dropped a message or if they're again, back pressure related things here, maybe dropping the message is okay, because it's better to not hold up the fast people by having that one slower consumer than it is and have that message being missed by that consumer, then it is to cause them to potentially to fire an order too late or some other issue there. Right. Yeah. And another actually interesting dimension of that particular system, which I think is worth talking about, is that the messages were not sequenced.

Starting point is 00:16:20 We had lots of different messages coming in from different data centers that were all hitting the same system. And it didn't really matter what sequence they arrived in, right? Oh, that is interesting. But oftentimes it is very useful to be able to sequence a stream of messages. Yes. Because that allows you to do things like create a state machine. And then any consumer of that stream should be able to reproduce the same state of the state machine from the sequence of events. And obviously, a classic example of this

Starting point is 00:16:50 in finance is building a book, but there are lots of situations in which you want to have a sequence stream of events that you can use to reproduce state in any consumer that sees that stream. Right. This is like log structured journals of information, like databases and things. You just need to be able to process them in strict sequence now. And again, when you, that's okay. So like you mentioned building, building a book in, in our world, which is taking this multicast data that flows from the exchange and applying it, um, as the set of modifications to an empty state to bring your world up to date with whatever orders are flying around and

Starting point is 00:17:25 are currently active and you absolutely have to apply them in the right sequence or else things go horribly wrong. Um, but in that instance, there is a single producer, at least for any one book, there is exactly one producer that is, can give you a sequenced number. And therefore you can see if the messages arrive in order. And so that's, that's an easy proposition. And again, for those folks who are thinking like TCP, again, if you've got a single connection, it's TCP one end to the other, then again, the, the, the messages that are being sent aren't going to be reordered anyway, that's a property of the, of the

Starting point is 00:17:55 transport, but in general for the kind of UDP messages that we talk about in finance, that's not true. And you need to be able to see if you either have received messages out of order or you've seen the messages that we talk about in finance, that's not true. And you need to be able to see if you either receive messages out of order, or you've seen that you in fact miss one that you need to go and get it from some other other place. So that's an intro interesting property again of message. So we've already talked about durability is one sort of dimension. Another dimension is like, what are the constraints on reproducibility and sequencing that kind of sort of go hand in hand. So just to sort of take another point here, there's something like Kafka by putting it through

Starting point is 00:18:33 a broker, somebody who's responsible, at least for a single stream in Kafka, right? You have also, as well as the durability guarantees, you have got like a single place of record where the ordering is kind of set in stone. And so a subsequent read of that will give you back the things in the same order that everyone saw it in. And that's a useful property in some cases. But going back to your event trade, you're saying that that's something that you could actually tolerate. And in fact, you didn't want to take the hit for receiving from multiple systems. Right. The sequencing process would just slow that down. So we couldn't do it, right? You have to

Starting point is 00:19:09 just design the system to be tolerant of that. But I think something that's really important to understand, and this is true of Kafka, this might be just like a general cap there thing, of like, if you're going to get a sequence stream of events, then it can be very difficult to build a system that can scale horizontally with that constraint. Something has to be the sequencer. The arbiter of what time things happen, which came first. Right. Yeah. In the particular case of Kafka, I forget topic versus stream and exactly how that is, but it's like the thing that gives you

Starting point is 00:19:48 that ordering guarantee cannot scale horizontally. That is, yes, the stream within a topic. So topics can have multiple streams and those streams are kind of a unit by which they are given to individual members of the Kafka cluster. And of course you can have multiple processes and threads and whatever. So essentially by sending to a single stream,

Starting point is 00:20:04 you're sending to a single stream, you're sending to a single, like single source, sorry, single destination. And that's the thing that gets to decide. But there's only one of them. If you need, if you need to go faster, you need two of them. And now suddenly, you're no longer, did you have this nice guarantee of, of a total ordering? And that's what we're talking about here, a total ordering. Yeah, yeah, yeah, Yeah. Yeah. So there's some important trade offs to consider. Why not just use the time as the total ordering? Well, how much time do you have? Because well, you said you said you had an hour. So

Starting point is 00:20:41 well, so to start with, what precision? Because all of the precision you choose, you're going to get some amount of collision, right? These two events happened at the same nanosecond, which comes first? I don't know. Right? Yeah. Yeah. No, exactly. Right. Like that's not a deterministic sort order. Right. And if the if you look, yeah, you think that never happens. Oh, yeah. You know, that's that doesn't you know, birthday paradox kind of thing means that it happens a little bit more often than you would otherwise naively think. But yeah, it's still I I'm gonna admit here. We did use nanoseconds since 1970 as a like a global key for packets arriving in one of the products I worked on a number of companies ago. And the solution there was a post process, arbitrarily

Starting point is 00:21:30 picked one of them if it found two that had the same and just added one nanosecond until it didn't, until it didn't match anymore. Right. Right. Right. Right. Right. It's pragmatically, it mostly never happens, but when it does, it really blows your system up. So yeah. Yes. And then it's so how much precision? Great great question and you know, you and I have been fortunate enough to work in the finance industry where we already like to have accurate time. So getting a somewhat accurate to within low digits of nanoseconds time is is feasible for us. But for most people, that isn't an option. You can get milliseconds at best. Right. NTP will get you within plus or minus 15, maybe 20 milliseconds, you know, better than

Starting point is 00:22:10 two people synchronizing their watch in an old spy movie, but not that much better. Yeah, yeah, yeah. And I do think it's sort of that false precision problem that leads you into this trap where you're just like, well, this nanosecond precision timestamp, what are the odds? Like they can't even physically arrive at the same time, like the photons don't move like that. It's like, okay, but then what happens when your clocks are just off, right? Like you're just they're just not that precise. And so you get two things that have the same

Starting point is 00:22:36 timestamp because your clocks just aren't that precise. Right. And you know, when as soon as you have more than one cable, the photons don't move that way. But you can have two parallel streams of photons that do arrive at exactly the same time. So you do. It can. It can and does happen. Yep.

Starting point is 00:22:50 Yep. Yep. Yeah. So yeah, you can't just use time. And anyway, whose time are we talking about? Because right, right. You know, we're getting into the whole problem. This is a whole other category of this, which is clock domains, right?

Starting point is 00:23:02 Like synchronizing time between multiple computers is hard. It requires thought and oftentimes specialized equipment. And if you just sort of take it for granted that all clocks everywhere the same, you're you're setting yourself up for a lot of hurt, like the hurt is coming for you. But so anytime that you're going to be comparing time, you need to be thinking about what is the source of those clocks and how

Starting point is 00:23:32 precise are they and how accurate are they? And how are you going to deal with the differences between them? And what are those differences? What can they be? And you know, what are the things there? So it can go all the way from, you know, we've got a GPS antenna that's sitting on the top of the building, and we know the precise geographic coordinates of that antenna, and we know how long the cable is

Starting point is 00:23:54 from that antenna to all of the various servers that are using that antenna to synchronize their time. And from the length of those cables, we can compute the drift from the received signal and the antenna to each of the individual computers, right? And unless you're taking that level of precaution or something kinda like it, I would not trust any nanosecond timestamp

Starting point is 00:24:15 to be greater or less than anything else, right? You've missed out even some bits there. Like when we were doing stuff at previous companies, there would be a rubidium- oscillator with a very high, you know, there's an oven that's got like rubidium at some temperature and it's used and that's the thing that you synchronize with the GPS and everything synchronizes to that with some complicated protocol and yeah, well, no, I say a complicated, this is my

Starting point is 00:24:37 favorite protocol. I remember one of our network engineers saying to me, yeah, we use PPS to synchronize the, the, the, the, the master clock with the individual, like clocks on each of the machines. And I'm like, PPS, wow, what, what, what's that pro? You know, cause I've heard of NTP and I've heard of PTP and PPS. And he's like, it sounds for, it stands for pulse per second. And it's like, literally it goes five volts once a second on the second. I'm like, Oh, right. That's once a second on the second and like, Oh, right, that's the protocol.

Starting point is 00:25:07 Just on and off. Got it. It's a good simple protocol. It's a simple protocol. But yeah, again, you talk about the lead, you know, the cables were very carefully measured and very carefully designed to not to be understandable how long the delays they brought in. So yeah, yeah, yeah, it's complicated. Right, Right. And

Starting point is 00:25:25 reasonable people could disagree because yeah, you're you're you can have a data center full of things that uses your discipline for clock synchronization, which you're maybe happy with. But if you take a message from say, an exchange and the exchange is, hey, this happened at this point in time, you have to trust their ability to manage that. If you want to say, well, why don't we use their clocks? They you know, whatever we're doing on our side, forget it. Let's just use the clocks from the remote people. We have been through this process. You're like, well, that makes sense. You know, they surely have done something sane. And then of course, what if they haven't? I mean, not that would throw aspersions at our friends who have a difficult job maintaining these systems. But like,

Starting point is 00:26:02 yeah, things have gone wrong before. And then suddenly you're thrown into a world of hurt because time went backwards by tens of nanoseconds. And you're like, no, I always expect time to go forward because, you know, that's one of the few truths along with taxes and death is like forwards. Nope, you think it does. But I mean, I think that raises a really good point, which is one way that you can get around this time synchronization difficulty is to never use the system time of the computers that are in the messaging system and embed time in the messages, right? And then the ultimate source of the messages is the thing that has to have a reasonably accurate time.

Starting point is 00:26:44 But the sense of time for all of the downstream system just comes from that. And that is really important if you want to do what we were kind of talking about earlier, where you have a sequence of messages and you're trying to reconstitute state based on that sequence of messages. If there's any sort of time processing that has to happen,

Starting point is 00:27:02 then embedding the time in the messages allows you to reconstitute that state retroactively, right? So you can go back and you can replay the messages from three months ago and reconstitute whatever state that you have, even if it depends on time, because it doesn't depend on the clock of the computer that's just running the simulation or the reproduction. It's extracting that time from the messages itself. So you always get exactly the same result. Yeah. No, I mean, so just to take a temporary diversion here, this is one of the things that in the code base that I was working on, we use different types for the different types of time. So they were literally not comparable or convertible between each other without

Starting point is 00:27:42 like an explicit thing I could search for in the code saying like we're doing this we're crossing clock domains right now I am trying to look at the current time as measured by whatever process has given me the time on my computer and I'm comparing it to the message time that was embedded in the message through some mechanism and I have to know that that comes with this huge bag of caveats it's sometimes useful to do it because one thing you might want to do is measure the skew between the two, just to graph it somewhere or just to keep track of it or just to alert if it gets more than a few hundred milliseconds or something out. So you do want to be able to do it, but you definitely don't want to be able to

Starting point is 00:28:17 do it just by saying time t equals clock dot now minus message dot time. It should be no, that's a, that's a syntax error, right? The thing's going to, could not, is going to fail to compile there. You have to do some work here. And you know, that's, um, that's always been a worthwhile thing I've found to do. And even within a computer, you know, like there are different clocks, you've got monotonic clocks that are guaranteed to not go backwards. You've got clocks that try and like adjust because of like the NTP drifters, they're readjusting themselves. You've got clocks that try and like adjust because of like the NTP drift as they're readjusting themselves. You've got like the CPU cycle counter, which is measured in its own

Starting point is 00:28:50 domain. So this is something that's useful to have more generally. Gosh, this is really going off topic, isn't it? This is great. But no, it's a really important thing to know about the, and I mean, I think it's worth saying as well, just because it's cool that it is possible to get networking hardware to add a timestamp onto the end of packets that flow through it. So there are certain switches that you can configure. You can plug them into this PPS and get them to synchronize with your very accurate timestamp. And then every message that flows through that switch gets a payload on the back of each packet tacked on after like the end of what would normally be the UDV packet or the TCP packet or whatever. And you need to use exotic mechanisms to go and actually

Starting point is 00:29:35 pull those bytes out, but they are there. And then you can have like a source of truth that maybe the edge of your network as things come in from the outside world, you say, well, this is where we're going to timestamp it. And that's useful for both reconstituting the sequence in which they arrived at the edge, which is not necessarily the order that they arrived at you, because cables can vary within the system and roots within your system can vary, but it gives you something to measure things by and in particular, when

Starting point is 00:29:58 you're doing some of those more latency sensitive things that we've been talking about, having a sort of ground truth comparison, that you can look at that timestamp for the thing that came in and look at the timestamp of your message that went out of of the system, you've got like, that's literally how long it took warts and all every network hop. Anyway, that's one of the many sources of clock domains. And we were talking about clock domains in the context of ordering. So yeah, yeah, well, and that actually brings up another topic, which is that time stamping is an example of something else that is a really good practice, which is tracing, right? As each, as the message

Starting point is 00:30:34 flows through your system, and as it's being processed at each stage, it is quite often useful to be able to embed in the message or maybe as a wrapper around the message, depending on how you do it, information about the tracing. And that can be useful for performance. It can be useful for error. Debugging. Debugging. Yeah. Just general observability, figuring out like, hey, this message failed to process. Why? Where did it stop? What problems did it run into? Or it was really slow to process. Why? What was the bottleneck? What was the slow part?

Starting point is 00:31:11 And sometimes you'll do things like creating some sort of identifier at the point of ingestion or message creation, and then you can have an external system that refers to the message as it flows through using that identifier, or sometimes you're literally just adding information into the message object as it's flowing through. Now, incidentally, is what we used the nanosecond timestamp for, because obviously the hardware on the outside would put this nanosecond timestamp on every packet. We're like, well, that's a unique identifier, except when it isn't. But most of the time it is. And then it gives you this sort of unique ID, this sort of like trace ID, which is

Starting point is 00:31:45 carries information in its own right, because it's the time that it arrived as well, but yeah, not always, unfortunately, not always unique. No, I've variously seen this as, you know, provenance or tracing or, or causality or, uh, you know, there's, and I'm sure that I know that the open telemetry projects I keep being pointed at and, uh, and I'm sure that I know that the open telemetry projects I keep being pointed at and I'm going to start looking at that soon. I've keep meaning to, they seem to have a whole bunch of stuff around the telemetry of more just generally of systems, but I wonder if they have something that also talks of or can be used to correlate. That's

Starting point is 00:32:22 another one correlation IDs and things like that, or one event and like the, the causality as it traces through your system and you see all the different events, I mean, even on just like a website, just seeing that someone clicked a button and caused an error and you're like, well, that the backend error was caused by this click over here is useful. Anyway, sorry, again, off really off base here, but yeah, no, I mean, I think these are all these are all dimensions of consideration you need to be thinking about. If you're going

Starting point is 00:32:49 to build systems like this, right? Yeah, we've talked about various dimensions so far of messages, we talked about like durability, we talked about sequencing, we've talked about now tracing, which sort of a determinism, what are the, you know, very, we, we opened with, you know, don't put giant areas of data, giant blocks of data into your messages. And we said, be very careful about which clocks you use. What are the considerations are there? I mean, so how would your monitoring system does it? What? Let's, let's think, so how would your monitoring system does it?

Starting point is 00:33:25 Let's think a little bit about the monitoring system. So that had a very, very high set of inputs. Like essentially it was a centralized monitoring system for the whole company's services. All the services could send all the stats they wanted to it and you had to deal with it. Yeah, I'll tell you one thing, one mistake that we made,

Starting point is 00:33:45 and this is, you know, good judgment comes from experience and experience comes from bad judgment. And so listeners, I hope that you get to benefit from all of the bad judgment of the people on this podcast and the hard-won experience. And so when I say like, you need to be careful about clock domains

Starting point is 00:34:02 and you need to think about like where your source of time is. One of the great mistakes that we made very early on in that project, and it's something that just haunted us forever, is we allowed people who were sending messages to the system. So the idea behind the system is that you'd have, you know, external clients that could send, you know,

Starting point is 00:34:20 telemetry data or I mean, basically anything like prices, internal application metrics, whatever they wanted, they could send data to the system. It worked a little bit like stats D, if you've ever used stats D. Yeah, it's sort of like Prometheus type things that, but it's a lot more, it was designed for more real time stuff

Starting point is 00:34:39 rather than like once a minute, once a second. Yes, yes, yes. It was very much like that. The idea behind the system was like, it's cool that Grafana has a chart that updates once a minute. Yes, yes. The idea behind the system was like, you know, it's cool and that Grafana has a chart that updates once a minute, but we need something that can update many times per second because it's monitoring trading systems. And if something happens, we need to know about it right now. So like human time. But one of the great mistakes that we made with the system was

Starting point is 00:34:58 allowing people to put their own timestamps on those messages. That was a terrible idea. An absolutely terrible idea. An absolutely terrible, easy to do. I can see why you'd want to be able to do this. You know, like I find this quite often with things like, um, the, like our Prometheus setup, because you know, Hey, I've got, uh, a build. I want to like measure my build time and I want to post it. And then sometimes I want to go, actually, I want to go back in time and like run the last hundred builds one day apart from each other.

Starting point is 00:35:27 And I want to populate some data in the database so that I don't just have now data. I have historic data once I thought I want it right. And so how bad would it be to let me post stuff that's in the past to you so that I can write my data like, you know, it's a reasonable thing to want to do. So what was the drawback? What was the what what made you rue that decision? Well, because inevitably people want to be able to say like, oh, and also give me the list of all the messages that were delivered on this day. And now that's just wrong.

Starting point is 00:35:57 Because your timestamp and my timestamp don't line up for whatever reason. Right. It could be that you post or pre or post dated your thing, but you did the calculation wrong. It could be that like what you actually want when you say the delivery data that was delivered on that day was the delivered that data was that was delivered on that day and not like whatever timestamp it had because that came out of your log file

Starting point is 00:36:17 or whatever. This comes back to almost like the by, by, by temporary thing. It's like, you know, there's the time that I got it. And that's the kind of knowledge time. When did I know that you said that you wanted this thing? That's one timestamp. And then the other timestamp is what time did you say that you wanted this thing

Starting point is 00:36:31 to be known as of or related to? Sorry. Uh, and in almost all situations, those two times are coincident or so close that nobody cares, but not always. Right. And I think that's the one that a harder thing. I don't know if we've, have we ever talked about by temporality? Maybe we have.

Starting point is 00:36:49 I, I don't know. We must've done in passing. I don't care. That's a whole interesting world as well. You know, like it's, it's, uh, yeah. You want to say on this day, what messages did you send me? And then you want to say on this day, what samples fall in this window, which is different from when did you tell me about those samples? Right? That's a very

Starting point is 00:37:11 again, they're mostly the same. But yeah, that's okay. Yeah, if I had it to do over again, what I would have said is no, you cannot specify the timestamp. But you can and this was true already, you can put whatever data you want in your message. And you can query based on any of that data so if you want to have your own. Log time stamp or ingestion time stamp or whatever you can add that as a field to your message my system will be blissfully ignorant of it other than it's another field that you can do stuff with and you can do whatever you want with that timestamp. Yeah, that is your piece of data to do with UISH, but we know when it arrived with us, that's all we're going to keep as the primary thing that we can... Yeah, yeah, yeah.

Starting point is 00:37:56 Yes. Also, speaking of timestamps, please, please, please do not put localized timestamps in your messages. It's a long, it's a it's it's it can be nano precision, it can be millisecond precision, it can be second precision, I don't care. But it's a number, please just put a number in there. Don't put some parsed string with a timezone offset. No, and store it in UTC. Yes, this kind of thing or some well defined never changing thing. I think I don't

Starting point is 00:38:28 know to what extent it's an open secret or not, but a very large web search company, to this day, to the best of my understanding, still logs everything in West Coast time, which means that it's logs and the graphs that go with it have twice a year, either a big gap or a weird back double backing on themselves type of thing. Um, and it's just the cost of changing it is so high that it hasn't been done. But yeah, you, there are times, there's a time and a place for localized time. And it is in application level things. And it is in application level things. If you're saying, if you're trying to talk about what time did a trade happen on a particular exchange, it is useful to specify it in the local time of that exchange, say,

Starting point is 00:39:16 because you know that exchange opens at 830 local time on that day and closes at 330 local time on that day. But if you have to sit and try and work out or do anything other than compare with arithmetic operations, straightforward arithmetic operations on a 64 bit number, then you've do it. You're doing something wrong. If you have to kind of work out what day that was and then, Oh, is it daylight savings or not? Oh no, that, wait a second. That was in Europe, wasn't it? And they don't do daylight savings.

Starting point is 00:39:44 Absolutely. Absolutely. Like, that was in Europe, wasn't it? And they don't do daylight savings. Ah! Absolutely, absolutely. Like religion and politics, time localization should only be discussed in the home. Like, the international standard is a 64-bit number and only when you're displaying it or like viewing it or making a report, do you ever take that 64-bit number and turn it into some localized time that is localized for the person who is viewing it? Right. Yes. The system perhaps.

Starting point is 00:40:08 Whatever it is. Yes. No, then that makes sense. Yeah. I think that is that. And then, yeah, nanoseconds since 1970 is not a bad thing to fit into 64 bits. That'll get you to, I can't remember when, but it was, you know, it's far enough in the future. At least right now, I don't have to worry about it before I retire. Although that is, you know, I'm an old man. So maybe, maybe the younger folk will have to worry about it. But, but there are any number of ways of storing time better than that. Or, you know, yeah, you can pick your own epoch, right? You don't have to be 1970 is convenient if it is because then you can use the Unix date command

Starting point is 00:40:47 to kind of move back and forth. In fact, one of the first things I do, I check in all my dot files. Sorry, this is another sidetrack, but one of the like fish functions of the shell that I use is to convert numbers from an epoch time to like a displayable time and backwards, right? So I can do epoch and then just type a number

Starting point is 00:41:04 and then it based on however many new digits it's got, it guesses whether it's millis, micros or nanos, and then it prints it out in my current time zone. And it is the single most useful thing. I know people go to epochconverter.com, which drives me bonkers to see, you know, why would you go to our website with all these flashing ads and things on it just to convert some numbers when it's like something that command line can I can do but on the other hand, it's Yeah. But or you can just open up a JavaScript console and your favorite browser and paste the timestamp into new date new date.

Starting point is 00:41:32 Yeah, that's true. I'll also give it to you. That's a great one. Yeah, I'm remembering that one. That was even more portable than mine. It's super, super convenient most of the time. Another thing to think about here, and this is kind of getting back to, I was saying, don't put a database in the middle of your messaging system, right?

Starting point is 00:41:53 Generally, sometimes it's fine. And as you said before, sometimes it's just a file. But like, okay, if I can't do that, then how am I supposed to bridge the gap? Because there will almost certainly be a gap between the world of stream processing systems and batch processing systems. At some point, someone's gonna wanna run a database query or something on your data. And how do you handle that? And also this kind of ties into a durability thing where it's like, if you don't have a system like Kafka or some other sort of durable queue

Starting point is 00:42:24 in the middle of your system to kind of keep track of the history, you know, you just have, you know, UDP packets, or you have something else, like, what should be responsible for sort of keeping the, the, the, the, you know, historical record of everything that has ever happened, right? Right. So, which obviously some people don't need, and that's fine. If you're, if you're, if you're a video game server and you've got the player positions that are being updated, then maybe you don't need a log for all time. But if you're working in finance, it's generally a good idea to keep everything forever for all time in case somebody comes and asks you a very awkward question about what happened. Yeah, yeah. And this ties in also

Starting point is 00:42:59 to another thing that we were talking about about reproducing state for state machines. So it's like, it's, you know, the cool idea is like, all right, I'm gonna take my messages, I'm gonna pass them into some system that processes them. There will be no other information that goes into the state machine other than the messages itself. And therefore I can completely reproduce the state from the sequence of messages.

Starting point is 00:43:19 It's like, yes, that's cool, but what happens when you have seven years worth of messages and you have to start at the beginning? That seems bad. So one of the things that you typically do is you have something that is consuming the stream of messages whose purpose is to store them and also potentially snapshot them. So you have something that is consuming the messages, it's writing them into some persistent store, maybe it's even like transforming them into like something that can fit into it like

Starting point is 00:43:49 a database table or some other format that is nice for bulk processing. And another thing that it might be doing is running this sort of state machine and taking a snapshot at some regular interval and then putting that into the storage as well. So that when you need to reproduce the state for some particular point in time, rather than having to play all seven years worth of messages through your system, you can jump to, you know, a prior, but recent snapshot and then load that state into your system and then only replay the messages forward from there.

Starting point is 00:44:23 And that will be much faster and much more efficient. Right, right, right. Provided there exists a sensible snapshot format, which is an interesting... So I think this has now sort of moved into what I think of as like a log structured journal of like, you know, like you have some database or in memory representation of the world that you update through seeing these events. For some things, so for example, to build the set of live orders on an exchange, that is the prices of like Google and all the people that are trying to buy and sell Google. You can unambiguously snapshot that state and go, okay, this is what. can unambiguously snapshot that state and go, okay, this is what, um, this is a, at this point in time at nine in the morning, this is the everyone's orders.

Starting point is 00:45:10 And now if you just load up this 9am, you can carry on. You don't have to load it up, you know, the 7am ones and, or the whole, from the whole day, that's fine. Right. But as soon as you start getting to things that have state that is like non-trivial, now it becomes a function of the processor of that state. So let me give you an example. What if you were keeping some kind of exponential moving average of some

Starting point is 00:45:32 of the factors of that, that depends on how long the window of your exponential is and some other properties of that. What, what do you count, which kinds of information go into that or don't. And now you've got a complicated piece of state that is arguably different for every client. You know, maybe some people care about a 10 day look back and other people want to, you know, a five minute look back. And so that gets kind of tricky. But yeah, I don't know. I don't know where I'm going with this now.

Starting point is 00:45:57 But like if it just it's not as straightforward for application domains, if they have any kind of state that is, that requires some history in order to get to the point other than the like, the pure individual like ad remove of say, Yeah, yeah, yeah. unambiguous stuff. Yeah. Yeah. And the that state can get quite large because of these constraints. And I think this is something that is really important to think about because this kind

Starting point is 00:46:25 of snapshotting becomes very important when you think about error recovery. And there's two dimensions of error recovery that I think we can talk about here. One is you've got some consumer of this stream and it's crashed and now you want to restart it. What state do you need to restart it, right? What state do you need to let it sort of rejoin the stream? Again, do you have to go back to the beginning of time and process seven years with the messages for your system to restart? That's going to be bad, right? So-

Starting point is 00:46:56 We'll fix it next year. Yeah, right. Yeah, only guess was, yeah. We've rebooted it and the website will be back online in 2038. We've rebooted it, and the website will be back online in 2038. So you have to think about the state if you want to be able to recover.

Starting point is 00:47:12 And you need to think about how you can reasonably snapshot that state if you want to be able to spin something back up and have it sort of rejoin the stream, right? And so I think you have to consider that from the very beginning. Like, how big is the state? How often can we snapshot it? What is our sort of acceptable amount of downtime here for these various things? You know, is it like an hour? Is it a minute? Is it, you know, a month? And how are we going to be able to rejoin this processing? Otherwise, we can never turn the software off. to rejoin this processing. Otherwise, we can never turn this software off. Right, which is an option. Um, just don't write any bugs.

Starting point is 00:47:50 I don't have any hardware faults. Um, yes, we'll be golden. Yeah, yeah, yeah, yeah. Um, another dimension to think about with with fault talents with these systems are poisoned messages. Yeah, so that is a very common situation where there's a bug in your system or a bug in a producer system perhaps and you receive a message that you can't process. And redundancy

Starting point is 00:48:14 here will not save you. You can have like 10 redundant systems that are all consuming the stream and processing the messages so that if one runs out of memory or whatever, the other nine are there. But if they all have the same bug and they all get the same message, the whole point of the distributed state machine is that they are all going to do the same thing, which is not process your message. They're all crash.

Starting point is 00:48:34 I mean, they might crash. All kinds of manner of problems can happen here. So one common approach to dealing with these things is creating what's called a dead letter queue. So you have a message that comes in and your system cannot process it, but it's able to detect that it can't process it. Maybe it raises an error. Maybe there's some validation step, whatever it is. And it's like, I can't process this message. So what I'm going to do is I'm going to take it. I'm going to put it into another queue.

Starting point is 00:49:04 Another thing, a stream of messages, messages called the dead letter queue. And it's going to sit there until somebody does something with it. Now the first thing that you want to do with it is send some kind of notification or alert or something. Someone's phone go off. Yes. Like, you know, somebody's getting paid. It's like, ah, we just got a message.

Starting point is 00:49:22 We don't know how to process. Right. somebody's getting paged. It's like, ah, we just got a message we don't know how to process, right? But if you do that, then depending on the state machine that you're trying to reproduce, if you have one, or just the message processing that you're doing, it can sometimes be OK to say, OK, I'm going to take this message. I'm going to put it in the dead letter queue. And then I'm just going to keep going, right?

Starting point is 00:49:42 I'm going to pretend like I never even got this message, because it's malformed, or there's some other problematic thing with it, and I'm just going to keep going, right? I'm going to pretend like I never even got this message because it's malformed or it's, it's, there's some other problematic thing with it. And I'm just going to keep going. You can obviously run into situations where there's just a bug in your code. And this is a message that you need to process and you didn't process it correctly and now you're, now you're doing wrong. Yeah. Right.

Starting point is 00:49:59 Um, but there are also situations in which you have one of these messages and it is truly something that is Malformed and can be ignored was never supposed to be created in the first place and now you can just continue on Having this in the dead letter queue a common pattern that I have used with great effect is being able to basically Redrive those dead letter queue messages back into the main queue Oh, if sequencing doesn't matter. Right.

Starting point is 00:50:28 If sequencing matters, then you can't do this. Right? But if you have a system where there's no sequencer or there is a sequencer, but it doesn't really matter all that much, then you can take these messages and be like, all right, we got this message we don't know how to handle. It went into the dead letter queue. We're now going to change the code so that it can handle this message in some way, redeploy that, and then redrive the message back

Starting point is 00:50:52 into the queue so that it can be correctly processed and flow all the way through. Right. And that is a really nice way to handle it if you can. If you're able to do that, then yes, that's a really and that's so particularly, for example, if this was some, you know, holiday booking stream of information, you're like a centralized holiday booking thing. And then someone comes in and they've just booked some suite and the some, the price is higher than you've ever

Starting point is 00:51:16 hit before and some internal issue happens and you're like, Oh, damn, you know, we can't book this for them because it's, it's legitimately $100,000 a night thing. And that just overflows something we're done. But you're like, this is really valuable business. Ben, can you hot fix that very, very quickly, write a test, fix the test, deploy the thing, and then we're going to put it back in again. And then the booking goes through, albeit 30 minutes, an hour late, at least it gets done and you capture the revenue and everyone's happy. And you know, it's, uh, uh, yeah, that, that seems like a really nice way to heal the system

Starting point is 00:51:51 in that instance. But obviously sometimes it can be a legitimate bug or a malformed message or something, something like that. Um, yeah. And you have to be able to deal with it. Yeah. Cause as you say, fault tolerance was, was a dimension that you talked about. So another dimension for message processing systems is that like things go wrong, computers go wrong, and it's entirely reasonable to have more than one person, well, not one person more than one system, listening to this stream of messages, and independently processing them and updating them. And then if the machine

Starting point is 00:52:21 breaks, well, you've got two more of them, and that's okay. And then you have to have a system behind that system that determines what the actual outcome of any particular update was. But you've got fault tolerance by scaling through a messaging system. And that's that that's a really interesting solution. And part of the solution that we put together at the aforementioned cryptocurrency trading place, which was a really interesting solution for a number of things that we were doing, wasn't it? It allowed us to do rolling updates of the code

Starting point is 00:52:54 because we could have a quorum of five machines doing the same processing and then take two of them out of the system, upgrade them and then put them back in again and then run them in silent mode and check that everyone's still agreed on everything that was happening. And then only when we were confident that we hadn't introduced a new bug, we could add them back into the pool and then start rolling over the other three. And there you go. Now you can do rolling upgrades and you're never down. Hooray. It let us do things like have different configurations of those computers, be it through the different JVM settings or different hardware or whatever, such that if one of them process the message faster than the other or one that had to GC say or one of them was doing some jit work or whatever, we could make sure that as long as two or three came up with a good answer that we were happy with, the other two could be slower and that's fine. And that meant that we could hide some of our tail latencies in the quorum, which was,

Starting point is 00:53:51 you know, so we got all these wonderful, and obviously, yeah, if we had an equipment failure, then you know, two or three of those machines could die and the machine and the site would stay up and we'd be able to process transactions and everything. And that was super cool. And was, was definitely eye opening to me working there in terms of like, Hey, you get a lot of benefits from doing it this way. That's great. Yeah. Yeah. I had that, that same experience. And we had done some things like that at, uh, my, my previous company when we were basically intentionally creating races between systems

Starting point is 00:54:23 because we were trying to get them to run as fast as possible. And it created an opportunity to make the system more fault tolerant, where you'd have multiple parallel things that are all processing the same stuff. And the first one to finish wins. And so if there's some variation in the latency because of some operating system level thing or garbage collection, because some of this was Java or something else had happened, right?

Starting point is 00:54:49 Or one of them was just offline and was losing every race because it just wasn't processing anything. It was all fine, right? I think one of the more interesting things from that is if you want to be tolerant of certain types of failures, you know, like gamma ray burst type stuff where bits just get flipped, is if you want to be tolerant of certain types of failures, like gamma ray burst type stuff where bits just get flipped, then the number of systems that you need to do this is three. The number of counting is three,

Starting point is 00:55:15 because you need to have two of them, not one, not two. Five is right out. And five is actually kind of fine in this case, but you need at least three, right? right out. And five, five is actually kind of fine in this case, but right. Um, because if you have two and one says the answer is a and the other says the answer is B, you don't know which is right. You need, you need three so that you can compare. Okay. Two of them say it's a, and one of them says it's B. So B is suspect. And if you have five and four of them say it's a, and one of them says it's B, then that's even better. Right. But you don't do.

Starting point is 00:55:45 Yeah. So we're running the same version of the code. It's time to start looking through your radiation hardening protocol for what on earth happened or check the D mesh for any kind of uncorrectable error, memory errors and things of that nature. But but yeah, that's, yeah. I think I've just looked at the time and we've been, you know, given that we hadn't really got a plan, which is, you know, as regular, our regular listener will know is how we do this. We have, we've covered quite a lot of ground, although I don't know that we covered our intended topic exactly as I would have done if we'd have written out something before,

Starting point is 00:56:23 because we went on so many tangents, but in a good way, like we talked about time, we talked about durability, we talked about scalability, um, and all these things come out of a message based system or can come out of a message basis, especially if you have this sort of like journal based thing where you say the sequence of messages is the only input into my state machine and I can trivially start from the beginning of time and get to exactly the same state. Or we can snapshot if we know what the internal state is important, different points along that time and have the best of all worlds,

Starting point is 00:56:52 which is which is super cool. Yeah. So I think by way of saying maybe we should stop here is what I buy I bring all that up. So this has been super cool. And definitely some deep memories there from previous companies coming up there. Bringing back the hard lessons of systems past. We make mistakes so you don't have to. It's our new tagline on this. It's our new tagline, okay.

Starting point is 00:57:19 We've certainly made, I've definitely made plenty of mistakes as well you know. As I shared on social media, media, a picture of me driving a car through a place where cars shouldn't go and was to do, I got the car wedged. Yeah, you have to look at the gigabytes versus worth of car and you're in your messaging. My didn't get 2.9 gigabyte. I drive Yeah, it didn't work very well. Anyway. Yeah. Well, I think we should leave it there my friend. Thank you as ever for for joining me in this endeavor of trying to

Starting point is 00:57:51 I don't know what we're doing. Try to what be entertaining and enjoy ourselves and hopefully be interesting and useful to other people too. Yeah, yeah, this is a good one. Alright, friend, until next time. You've been listening to Tooth's Compliment, a programming podcast by Ben Reidy and Matt Godfob. Alright, Frank, until next time. by Inverse Phase. Find out more at inversephase.com

Two's Complement - Passing Messages

Ben and Matt wade into the deep waters of messaging systems, get utterly lost in time synchronization rabbit holes, and discover their new podcast tagline: "We make mistakes so you don't have to." Mat...t celebrates by getting his car stuck where cars shouldn't go.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.