Signals and Threads - Clock synchronization with Chris Perl

Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstream. I'm Ron Minsky. Today we're going to talk about a deceptively simple topic, clock synchronization. I think there's nothing like trying to write computer programs to manipulate time to convince you that time is an incredibly complicated thing. And it's complicated in like 16 different ways, from time zones to leap seconds to all sorts of other crazy things. But one of the really interesting corners of this world is how do you get all of the clocks on your big computer network to roughly agree with each other? In other words, clock synchronization. So we're going to talk about that with Chris Pearl, who's a sysadmin who's worked at Jane Street since 2012.

Starting point is 00:00:46 Chris is better than anyone I have ever worked with at diving into the horrible details of complex systems and understanding how they work and how they can be made to work better. And he's done a lot of work here, specifically on clock synchronization, and has, in the course of that, redone our entire system for doing clock synchronization. So he's had an opportunity to really learn a lot about the topic. So Chris, to get started, can you give us just a quick overview of how computer clocks work in the first place? So I guess the rough gist is something like you have some oscillator, a little crystal

Starting point is 00:01:19 effectively that's inside the computer that is oscillating at some frequency, and that's driving an interrupt that the operating system is going to handle in some level. Like, you know, there's probably lots of details here that I'm just skipping over, but that's driving an interrupt that's going to happen in the operating system. Then the operating system is using that to derive its notion of time. And so if you have like a really high quality oscillator and like those time interrupts happen at the right rate so that you're tracking real time, that might just happen.

Starting point is 00:01:42 And if your oscillator is very good and very stable, it could actually just be pretty close to the correct time just by virtue of that. But the truth is that most computers come with fairly bad oscillators and they change their frequencies for various reasons like heat. So if you are using your computer to compile the Linux kernel or something like that, that could change the heat profile, change the frequency of the oscillator, and actually change how well you're doing of keeping real time. When we naively think of clock synchronization as people, we think of it as, I'm going to go set my clock. I'm going to look at what time it is and adjust my clock to match whatever real time is. But you're actually talking about a different thing here. You're talking

Starting point is 00:02:16 not just about setting what the right time is right now, but keeping that time correct, essentially keeping the rate at which time is going forward in sync. Correct. You'd love it if you could get like a really, really high quality oscillator for super cheap in all your computers, and then you wouldn't need a lot of adjustment to get them to keep the correct time, but that would be really expensive. You can buy such things, they just cost a lot of money. So you say that heat and various other things that are going on the computer will cause this rate at which time is appearing to march forward inside of your computer to drift around. How accurate are these? Can you give me a numerical sense of how far these things drift away? The stuff that we run, we capture some of these statistics. We see machines that have a

Starting point is 00:02:57 frequency correction applied to them of say 50 parts per million, which is like microseconds per second. So that works out to roughly a couple seconds per day is like how you would wind up drifting off. But I'm sure that if you had like a super old desktop under your desk that you stole from your parents or something, and you were trying to rebuild into a Linux box, you might have worse numbers than that. A sort of relatively current generation server from a well-known vendor, you're talking somewhere around 50 to 100 microseconds per second that they can sort of walk off out of alignment. Okay. So clock synchronization is the process of trying to get all of those clocks that you have across your whole data center and across multiple data centers to be in sync with each other. Is that the right way of thinking about it?

Starting point is 00:03:39 I think so. In sync is like an interesting thing to say, right? You would like that if you were able to instantaneously ask two random servers on your network, what time it was at the same exact point in time, if you could somehow magically do that, that they would agree to some relatively small margin of error. And I think that's kind of what we mean by clock synchronization, that if you could somehow magically freeze time and go ask every single computer on your network, what time do you think it is that they would all roughly agree to within some error bound that you can define? Right. And this basic model actually assumes that there is a well-defined notion of what it means to be instantaneously at the same time, which isn't exactly true because of relativity

Starting point is 00:04:15 and stuff like that, but we're going to mostly ignore that. So I guess one property that you're highlighting here is having the clocks agree with each other. And that's part of it. But there's another piece, right? Which is having the clocks agree with each other. And that's part of it. But there's another piece, right? Which is having the clocks agree with some external reference. There's some notion of like, what does the world think the time is? So where does that external reference come from? I'm not an expert on this stuff, but I'll give you this sort of 10,000 foot view. You have various physics laboratories all over the world, like NPL in the UK and other places across the world, they all have measurements of what they think time is using things like hydrogen measures and sort of very like accurate atomic methods. They contribute all of that stuff to a single source who kind of averages it or does some sort of weighting to come up with like what the correct time is. And then you kind of magic that over to the Air Force, who then sends it up to the GPS constellation. And GPS has a mechanism for getting time from

Starting point is 00:05:10 the GPS satellites down to GPS receivers. And so if you're a person who runs a computer network, and you're interested in synchronizing your clocks to a relatively high degree of accuracy with something like UTC, which is effectively Greenwich Mean Time. It is just sort of like the current time without time zones applied. If you're interested in doing that, what you can do is you can just go out to a vendor and you can buy a thing called a GPS appliance, which can hook up to a little antenna that goes onto the roof. It can receive this signal from the GPS constellation and basically gives you out time. And the accuracy there, it's something like, you know, maybe a 100 nanoseconds or so

Starting point is 00:05:45 is like how accurate you're going to get from GPS. So you've got this sort of atomic measurements being fed up to a GPS constellation down to GPS receivers that you as a operator of a computer network can buy. And for the purposes of this conversation, we're going to treat those GPS receivers as the received wisdom as to what time it is.

Starting point is 00:06:02 And our job is to figure out how inside of a computer network, you make all of the different devices agree with each other and agree with that external reference. Correct. Why is it important? What does having synchronized clocks help you do?

Starting point is 00:06:15 If you put yourself in the shoes of financial regulatory authority and you have all these different participants out there doing stuff with computer systems and something weird happens and you'd like to come up with a total ordering of events of what led to this crazy thing or what led to this good thing, who knows, but you want to have a total ordering of events. If people don't have good clock synchronization to some external source,

Starting point is 00:06:37 you can't compare the timestamp from participant A to the timestamp from participant B. So if you were to decree, everybody must have time that is within some error bound. If these timestamps are within that error bound, well, then I can't be sure about the ordering. But if they're farther away than that, then I can be sure about the ordering. I can know which one came first and which one came second. And that can be very useful. So that's a motivation that's very specific to our industry. But don't people in other industries care a lot about clock synchronization too? I would have thought that there are other reasons that would drive you to want to synchronize the machines on a network. Oh, sure. There's lots of different things. I mean, just like a general sysadmin-y topic,

Starting point is 00:07:10 a lot of times you want to gather logs from all the systems on your computer network and you want to analyze them for various reasons. Maybe it's because you're concerned about intruders, or maybe it's because you're just trying to understand the way things are functioning. And if your clocks aren't synchronized, it's very hard to kind of understand things that might've happened on system B and how they relate to system A because the two timestamps are just not, you just can't compare them if they're not synchronized. And I suppose there are also some distributed systems, algorithmic reasons to want clocks. Certainly some kind of distributed algorithms end up using clocks as ways of breaking ties

Starting point is 00:07:40 between systems. And so that requires at least some reasonable level of synchronization. For sure. There's also other network protocols that are widely used that require clock synchronization, but much less precise levels of clock synchronization, right? Like Kerberos is a widely used authentication protocol. And that requires that the clocks be synchronized to like within five minutes. And the idea there is to like thwart replay attacks and stuff like that. So making sure that somebody can obtain your credentials from a couple of days ago and use them again kind of a thing. So there it's like the error bars are very wide,

Starting point is 00:08:09 but there's still some sort of synchronization necessary. Right, and I guess that's a general theme when thinking about synchronization is different applications require different levels of synchronization, but more closely synchronized never hurts. There's definitely trade-offs as you start to get into the lower bounds,

Starting point is 00:08:24 but like, yeah, if it were all free, sure. I'd like to have them exactly the same. How do people normally approach this problem of clock synchronization? What are the standard tools out there? Most people, you just kind of run whatever your distribution shipped as an NTP daemon. So NTP stands for the network time protocol. And it is a protocol that up until not that long ago, I just kind of assumed used some kind of magic. It knows how to talk to some servers on the internet or some local servers

Starting point is 00:08:48 that you probably then having talking to servers on the internet. And it synchronizes your clocks with those servers. It's exchanged some packets. Maybe it takes a little while, maybe a few minutes, maybe longer. You probably don't understand exactly why, but eventually your clocks are like relatively in sync to within maybe, you know, tens or so of milliseconds. Can you give us a tour of how NTP actually works? Like I said, for a long time, I just kind of assumed it was magic and didn't really think too hard about it. And then at some point I got tasked within Jane Street to actually look at some of this

Starting point is 00:09:16 stuff and try and meet some requirements that were a little bit harder than the sort of standard, you know, tens of milliseconds synchronization. So I actually went and just like, was like, okay, well, how does NTP do this from first principles, right? Like, let's go read some of the papers from David Mills. Let's just go see if we can actually reason this out ourselves. At the end of the day, it's just four timestamps. There's a lot more complicated stuff around it, but like the sort of core of it is these

Starting point is 00:09:38 four timestamps. Let's say I'm the client and you're the server. First, I send you a packet, but before I do, I write down a timestamp. When you receive that packet, you write down a timestamp. Then you send me a reply, and before you do, you write down a timestamp. Finally, when I receive that reply, I write down a timestamp. It may not seem that groundbreaking, but with just those four timestamps, I can compute two important numbers, the offset and the delay. The offset is how far my clock is off from yours. So if you think it's 12 PM and I think it's 12.05 p.m., then the offset would be five minutes. The delay is how long it took those packets to traverse the

Starting point is 00:10:08 network. To compute those numbers, you basically take a system of equations. And for me, an important aspect was actually writing down with a piece of paper and a pencil and solving these equations myself. It was understanding that there's a sort of huge assumption in this protocol, that the delay for the first packet, where I timestamped and then you did, and the delay for the second packet, where you timestamped and then I did, the assumption is that those times are the same. And if they're not the same, they introduce what's called error. And that is a sort of very important aspect. That is an assumption that is made such that you can actually solve those equations to get the offset and the delay. Can you maybe explain what it is about the calculation that depends on the symmetry packet to get from you to me.

Starting point is 00:11:08 And you're like, well, what do I do with this information? And you say, well, what if I just assume that those two delays are equal? And if I assume that those two delays are equal, well, then I can start rearranging the various pieces of the equation. And then that's how you can actually solve for the delay in the offset. What's the role of the two timestamps on the server side? So if you ask me what time it is, I write down when I receive it, and then I write down the time where I send it back. You could imagine a simpler protocol with just three timestamps. Then you just assume that that time that I wrote down happened in the middle

Starting point is 00:11:40 of that interval, the interval between the time you sent the first message and received the second message. How do you know when in the middle is, right? the interval between the time you sent the first message and received the second message. How do you know when in the middle is, right? There's lots of vagaries that happen with operating systems. Like if you sort of timestamp it on either end, like as soon as you receive the packet, you timestamp it, and then maybe you have to do some work. And then right before you send it back, you timestamp it. And that's sort of how you get closest to those differences I mentioned representing

Starting point is 00:12:01 the actual network delay from one to the other. And I guess an extra assumption that you're making here is that in that period between the first timestamp and the second timestamp, you had better assumed that the rate at which the clock is going forward is about right. I think that throws another error term into the equation. It's, I think, typically extremely small, right? It certainly seems like something you can in practice ignore, because if you just look at the number of parts per million or whatever that you were talking about in terms of how much drift there is in a real computer clock, I think that is in fact pretty tiny. Right. But you've got the correction being applied by the time daemon that's running on the computer, which is keeping the clock in sync. side of this communication is also taking time from somewhere, either a reference clock or some sort of higher upstream stratum in NTP, like clocks that are better than it, something like a GPS receiver. And it has applied a sort of correction to the operating system to say, hey,

Starting point is 00:12:54 I currently believe that the frequency is off by this much. Please correct that when you hand me back time. So I feel like your biggest, I guess, to your point of being able to ignore it in practice, your biggest concern would be if in between those two timestamps, something massive changed, like the temperature rose or dropped by like many, many degrees or something such that that frequency correction was now just wildly incorrect. Okay. So we have now a network protocol, put a timestamp, send a message, another timestamp, another time. So you get it back. Now the computer that started all this has some estimate for how much its clock is off. What does it do then? In the simple world, you could just set your time. You could just sort of say, and the time should be X. But that's not generally how most network

Starting point is 00:13:34 time protocol daemons work. What they'll do is they'll take a number of samples from a single server, but many times you have multiple servers configured. So you'll take many samples from multiple servers, and you'll sort of apply different criteria to these things to decide if you should even consider them. I think the reference implementation of NTPD has this notion of a popcorn spike, where if you're offset, you know, if you've gotten back 30 samples and they all kind of look about the same, but then you get one that's wildly off, you just kind of say like, I'm going to ignore that one because likely that was just due to some crazy queuing somewhere in the network or something like that. You can sort of think of this as a kind of voting algorithm.

Starting point is 00:14:08 Like you have a bunch of different oracles that are telling you things about what time is. You kind of bring them all together and have some way of computing an aggregate assumption about what the time currently is that tries to be robust to errors and drop outliers. Yeah, I think that's right. You try to pick out the people who are lying to you, right? Some of those servers you might be talking to upstream might just be telling you incorrect things. They're generally referred to in sort of NTP parlance as false tickers. And the ones who are not false tickers are true chimers. I'm not sure why exactly these are the names, but these are some of the names you might see if you're looking around the internet. So you try and pick out the ones that are telling you the truth. You apply some other

Starting point is 00:14:44 various heuristics to them to try and figure out which ones you think are the best, right? Which ones maybe have the sort of smallest error bars, even though you might think that these are decent sources to use. Some of them might have wider error bars than others, right? Your samples may represent a sort of wider range than the other ones. So you try and figure out which ones are the best. And then you use that to sort of tell your operating system to effectively speed up or slow down its frequency correction for like how off it is and try and sort of remove that error over time you don't just like abruptly adjust the time that the system thinks it is most time daemons will not aggressively step the clock the reason for that is that most applications do not enjoy when the time just changes drastically especially not when it changes drastically in the negative direction. This highlights another property you want

Starting point is 00:15:28 out of your clocks, which we haven't touched on yet, which is we said we want our clocks to be accurate. Your criterion for what it means for them to be right is you go to them and ask them what time it is, and they give numbers pretty close to each other. But there's another property you want, which is you want the clocks to, in a micro sense, advance about a second per second. And you especially want it to never go backwards because there are lots of algorithms on a computer which are making the assumption implicitly and naively, reasonably that the clocks only go forward. And lots of things can get screwed up when you allow clocks to jump backwards. Right. So a way that you can maintain that property that you just mentioned

Starting point is 00:16:03 while still correcting is simply effectively tell the operating system like, hey, slow down. I want to have time slow down effectively such that this error gets removed, but I don't have to actually step time backwards and make applications sad. in a serious way was I was asked to look at Jane Street's clock synchronization on its Windows machines. I wrote a small program that sent little NTP packets to the Windows machines, which knew the NTP protocol and responded appropriately. And they had the four timestamps on them. And instead of trying to compute the average, I could compute an upper and lower bounds on how far the clock sync was off and generate a little graph to see what was going on. And I remember being quite surprised to discover that if you graphed how far off the clocks were, you would see this weird sawtooth behavior where the clocks would go out of sync and then bang, they would be in sync again, and then slowly out of sync and then in sync again. And so that's because the Windows NTP daemon we were running was configured

Starting point is 00:17:01 to just smash the clock to the correct value and not do any of the kind of adjusting of the rate, which I think if I remember correctly, that's called slewing, which is I think a term I've heard in no other context. Yep, that is correct. Okay. So NTP lets you in fact do the slewing in an appropriate way. So you can keep the rates pretty close to the real-time rates, but still over time, slowly converge things back if they are far apart. In practice, how quickly can NTP bring clocks that are desynchronized back together? At least with some of the newer time daemons. So I don't know what the default is for the reference implementation.

Starting point is 00:17:35 I know with some of the newer daemons like Krony, the default, it takes 12 seconds to remove one second of error. So depending on how far away you were, you can sort of like work it out, right? 12 seconds to remove one second of error. So depending on how far away you were, you can sort of like work it out, right? 12 seconds to remove one second. So if you were a day, it'd be 86,400 times 12, which is a lot of seconds. So that's actually quite fast, which means the rate at which the clock moves forward can be off by on the order of 10%, which is a pretty aggressively pushing on the gas there. And these knobs are adjustable. If you really want to, you can sort of change the max rate at which they will attempt to make these adjustments.

Starting point is 00:18:06 So we had clock synchronization working on our systems before you started in 2012, and yet you needed to redo it. What were we doing, and why did we have to redo it? So we did what I'm sure lots of people do. We discussed GPS appliances before, so we have some GPS appliances which are bringing us accurate-ish time. And then we pointed a bunch of time servers at those GPS appliances using NTP. And we pointed a bunch of clients at those time servers. And we sort of dusted our hands off and said, ah, done. There was no real requirements around what is the maximum error? You know, are we trying to maintain anything? You know, if you look at any given time, can you tell us how far off any given system is from, say, UTC?

Starting point is 00:18:42 And so that served us fine for a while. The main motivation for some of the work that was done was a bunch of different financial regulations that happened in Europe, but one of them specifically had to do with time synchronization. And what it said was that you have to be able to show that your clocks are in sync with UTC to within 100 microseconds. So the 100 microsecond number was the big change. At first, when we first heard this requirement, it's like, well, maybe we're good. We don't actually know at the moment. Maybe we're just fine. Okay. And so we looked at it and were we just fine? No, definitely not. So it was, I think I said it before, but most systems were a couple hundred

Starting point is 00:19:18 microseconds out. But the real problem or one of the real problems was that they would bounce all over the place. Sometimes they could be relatively tight, say 150 microseconds, but various things would happen on the system that could disturb them and knock them, say, 400 or 500 microseconds out of alignment. If a bunch of systems all start on a given computer at the same time and they all start using the processors very aggressively, that'll fundamentally change the heat profile of the system. As the heat profile changes, the frequency will change.

Starting point is 00:19:46 And then the time daemon might have a harder time keeping the correct time because the frequency is no longer what it was before. And it has to sort of figure it out as it goes. So I started sort of investigating, okay, how can we solve this problem? Like, what do we have to do? And sort of just started looking into various different things. So I didn't know at the beginning of all of this, like, can we solve this problem with NTP? Is NTP capable of solving this problem? Or do we have to use some different, newer, better protocol, right? Because NTP has been around for

Starting point is 00:20:14 a long time. What did you find? I definitely did the dumb thing, right? And I went to Google and I said, how do you meet MIFID 2 time compliance regulations or, you know, something along those lines and probably many different sort of combinations of those words to try and find all the good articles. If you just do that, what you find out is that everybody tells you you should be using PTP, which is the precision time protocol. It's a completely different protocol. And if you go read on the internet, you'll see that it is capable of doing much better, quote unquote, better time synchronization than NTP. But nobody really tends to give you reasons. Lots of people will say things like NTP is good to milliseconds. PTP is good to microseconds, but without any sort of information backing that. So if you just do that, you're like, well,

Starting point is 00:20:54 we should clearly just run PTP. No problem. Let's just do that. So I did a bunch of research trying to figure out, is that a good idea? So the first thing I also want to understand was what is magic about PTP? What makes it so much better than NTP such that you can say these things like NTP is good to milliseconds, PTP is good to microseconds. Where does the precision of the precision time protocol come from? Exactly. And what I found actually surprised me to some extent. The protocol is a little different. The sort of like who sends what messages when is a little bit different. It involves multicast, which is different. But at the end of the day, it's those same four timestamps that are being used to do the calculation, which I found a bit surprising. I was sort of like, huh, if it's the same four timestamps, more or less, what is it about PTP

Starting point is 00:21:38 that makes it much more accurate? And what I was able to find is I think it's basically three things. It's that many, many hardware vendors support hardware timestamping with PTP, right? So your actual network cards, we sort of talked about the package showing up at the network, it having to raise an interrupt, the CPU having to get involved, schedule the application, right? You do all this stuff and then eventually get a timestamp taken. PTP with hardware timestamping, as soon as that packet arrives at the network interface card, it can record a timestamp for it.

Starting point is 00:22:04 And then when it hands the packet up to the application, it says like, here's your packet. Oh, and by the way, here's the timestamp that came with it. And so we were talking before about trying to move those timestamps as close as you could, such that they actually, the difference of them represented the delay from the client to the server and from the server to the client. So if you push them sort of down the stack to the hardware, it means that you're going to have much more accurate timestamps and you have a much better chance that those things are actually symmetric, meaning you're taking good time. And it also removes a lot of the other uncertainty in taking those timestamps, such as scheduling delay, interrupt delay,

Starting point is 00:22:38 other processes competing for CPU on the box, stuff like that. So that's one thing. So you have hardware timestamping is like a PTP thing. The other thing is the frequency of updates. So I think by default, PTP sends its sync messages to clients once every second, whereas at least for the reference implementation of NTPD, I believe the lowest you can turn that knob for how often should I query my server is once every eight seconds. So you have the hardware timestamping, you have the frequency of updates. And then the other bit of it is the fact that lots of switches. So I think PTP was basically designed with the idea that you'd have all of the sort of network contributing to your time distribution. So all of your switches can also get involved and help you move time across a network while understanding their own delays that they're adding to it. Right. And so they can kind of remove their own delays and help you move time accurately across a network. At least that's kind of the intent of PTP. The idea is, I guess, you can do in some sense the moral equivalent of what NTP does with like the two middle timestamps, right?

Starting point is 00:23:33 There are two timestamps in NTP that come from the server that's reporting time. It's like when it receives and then when it sends out, and you get to subtract out from the added noise that gap between those two timestamps. And then the idea is you can do this over and over again across the network. And so delays and noise are introduced by, for example, queuing on the switch would go away. Like you would essentially know how much those delays were. And as a result, you could potentially be much more accurate. Yep. I think that's roughly the conclusion I came to that that's what makes PTP more accurate than NTP, which was surprising to me. And then I did a bunch of research and was

Starting point is 00:24:08 talking to various people in the industry at various conferences and stuff. And there was some agreement that you can make NTP also very accurate. You just have to sort of control some of these things, right? So there are, in addition to being able to do hardware timestamping with PTP packets, some cards these days support the ability to hardware timestamp all the packets. And if your machine is just acting as an NTP server and most of the packets it receives are NTP packets, well then you're effectively timestamping NTP packets. Some cards also will timestamp just NTP packets. They can sort of recognize them and timestamp only those. But it was sort of like, okay, if we have the right hardware, we can get the timestamping bit of it. That's kind of an interesting thing. With the

Starting point is 00:24:48 different NTPD implementation, so crony being the other implementation I'm talking about, as opposed to the reference one, you can tune that knob for how frequently you should pull your server. I think as much as like 16 times a second, there's a bit of like diminishing returns there. It's not always better to go lower. Point being, you can tune it to at least match sort of what PTP's default is of once a second. And the more I dug and the more I talked to people, the more people told me, hey, you definitely do not want to involve your switches in your time distribution. If you can figure out a way to leave them out of it, you should do so. I was happy to hear that in some ways because because right now,

Starting point is 00:25:25 the reliability or the sort of the responsibility of the time distribution kind of lies with one group, and that's fine. When you then have this responsibility shared across multiple groups, right, it becomes a lot more complicated. Every switch upgrade, suddenly you're concerned, well, is it possible that this new version of the firmware you're putting on that version of that particular switch has a bug related to this PTP stuff and is causing problems. So given all of that, I started to believe that it was possible that we could solve this problem of getting within 100 microseconds using NTP. And I sort of set out to try and see if I could actually do that. It seems like in some sense, the design of PTP where it relies for its extra accuracy on these switches violates this old end-to-end property

Starting point is 00:26:06 that people talk about as the design of the internet of trying to get as much of the functionality as you can around the edge of the system. And I think that is motivated by a lot of the kind of concerns you're talking about of you have more control over the endpoints and you want to rely on the fabric for as little as possible.

Starting point is 00:26:22 I guess the other reason you want to rely on the fabric, it's not just that there are different groups and like, oh, it's hard to work with people in different areas and coordinate. It's also in various ways, fundamentally harder to manage networks than it is to manage various other kinds of software. But the reality is in many organizations, in many contexts, a lot of getting the network right involves an extremely good, extremely well-trained, extraordinarily careful human being just going in and changing the configs and not getting it wrong. It's kind of a terrifying system. And the less complexity you can ask the people who have to go in and do this terrifying

Starting point is 00:26:57 thing of modifying the network, the less complexity you can ask them to take care of, the better. I mean, that's a very true point. And another aspect of it is having less black box systems involved. So Krony is an open source project. We can sort of inspect the code and see what it's doing and understand how it behaves. The GPS appliances are not, but the idea of having less black box systems

Starting point is 00:27:14 where, hey, that's really strange. Why did we see this spike? We have absolutely no idea. The amount we could minimize that kind of stuff, the better. Right. The primary currency of the sysadmin is inspectability. Yes.

Starting point is 00:27:25 You want to be able to go in and figure out what the hell is happening. Yes. Huge proponent of things that you can inspect and debug. You talked a bunch about hardware timestamping, and I have a kind of dumb mechanical question about all this, which is you talked about essentially software processes keeping clocks up to date. You have this NTP daemon that knows how to adjust rates of things and stuff. But then you talked about the NIC going in timestamping packets. So does the NIC have a hardware clock and the motherboard have a hardware clock of the CPU? How are these clocks related to each other? What's actually going on when you're doing hardware timestamping?

Starting point is 00:28:00 Yeah, the NIC also has a hardware clock. Is that a different time? Do you have to run NTP between the NIC and the host system? I think that would be challenging, but like, yes, you can use a thing from the PTP, the Linux PTP project to move time from a network card to the system. It's called PHC to Sys. That's just a thing you can do, right? Like you have time on your network cards. You can move that time to the system. You can move it from the system to another network card. You can kind of ship this time around in various ways. But yes, the cards themselves do also have a clock on them that you're also keeping in sync. So another thing you mentioned about PTP

Starting point is 00:28:32 is that it uses multicast. So I've had the chance to sit down and talk at length with Brian Nogito in a previous episode of this podcast about multicast. And I'm curious what role multicast plays here in the context of PTP. The whole idea is that at the root of this time distribution tree, you have what's known as a grandmaster, and you can have multiple grandmasters, and a grandmaster is just something that doesn't need to be getting PTP time from PTP. It's like, you know, the GPS reference or something else. You have this grandmaster, you can have multiple ones.

Starting point is 00:28:58 There's a thing called the best master clock algorithm for the participants of PTP to determine which of them is the best one to act as the grandmaster at any given time. And then the idea is that you multicast out these packets to say, here's my first timestamp. And it just makes it easier on the network. You know, as a PTP client, you just have to come on and start listening for these multicast messages. And then you can start receiving time as opposed to you having to actually reach out and be configured to go talk to the server. You can just sort of have less configuration and sort of start receiving these time updates

Starting point is 00:29:26 from the Grandmaster. Got it. So you think it's mostly a kind of zero configuration kind of story. It also makes it easier for the Grandmaster. You don't have to maintain these connections. You don't have to have all these sockets open. You just sort of have like one socket that you're kind of multicasting out. It's not 100% true because there's a delay request and a delay response message that's

Starting point is 00:29:43 involved in all this too. And it's also actually kind of strange. I think this was changed in the most recent version of PTP, but technically the way it works is the grandmaster sends this multicast message that is a synchronization message, which contains one of those timestamps. When the client receives it, it actually sends a multicast message back that says, hey, I'd like a delay request. And then when the grandmaster receives that, it sends out another multicast message that says like, here's the delay response, which is like kind of insane when you think about it, right? Because you're kind of involving all of these other potential peers that are listening in

Starting point is 00:30:15 on the network with like your exchange. And you can configure some of these open source projects like the Linux PTP project, which uses a daemon called PTP4L. You can configure it to do a hybrid model where it receives the sync message as a multicast message. But then since it knows where that message came from, it just does a unicast delay request and then delay response, which makes a lot more sense. Yeah. The base behavior you're describing sounds pathological, right? Essentially quadratic. You get everyone sends a message to everyone. That is not usually a recipe for good algorithmic complexity. I'm not sure why it was designed

Starting point is 00:30:48 that way. It could be that the original people were sort of thinking that you'd have these smaller domains where you have these boundary clocks where sort of you're multicasting, you're sort of limited to how many people you're talking to. But I kind of agree. The sort of default behavior seems a little crazy to me. And that's why in our case where we are using PTP, we're using it in a small area of the world. We have it configured to do that hybrid thing where the actual sync message comes in multicast, but the delay request and the delay response wind up being unicast. There's a major thing that I haven't touched on here, which is that like NTP, I mentioned it before, you have multiple servers and you kind of have this built-in notion of redundancy, right?

Starting point is 00:31:21 Where you're sort of comparing all the times from the different servers and you're trying to figure out which of them are false tickers, right? And so any of them misbehave, the protocol kind of has this built-in notion of trying to call them out and ignore them. With PTP, if we're talking about the single grandmaster, that would be a GPS appliance. And unfortunately, we have found black box GPS appliances to be less than ideal. It would be fine if you're just talking about a straight failure scenario, right? We have a GPS appliance. Maybe we have two of them. They have agreed amongst themselves

Starting point is 00:31:48 who is the grandmaster. One of them goes offline. The other one notices that. It picks up, starts broadcasting. Like that would be a perfectly fine world and I wouldn't be too concerned about it. But the thing that we've seen happen

Starting point is 00:31:57 is like we want to perform maintenance on a GPS appliance because its compact flash card is running out of life and we need to actually replace that. When you go to shut it down, it happens to send out a PTP packet that is like complete crazy pants, like just absolutely bonkers, makes no sense whatsoever. The timestamp is off the charts.

Starting point is 00:32:15 And we've had GPS appliances do things like that. And so part of my thinking through this was, you know, geez, at the end of the day, I really don't want to be pulled back to a single GPS appliance that is providing time to potentially large swaths of the network. Because if it goes crazy, there's no real provisions in PTP for finding the crazy person. Everybody will just follow those crazy timestamps wherever they lead. At least for a while. It sounds like there's a way of eventually deciding someone else is the right one to pay attention to, but it means for short periods of time,

Starting point is 00:32:47 you may end up just like listening to the wrong guy. I'm not an expert in exactly what's involved in the best master clock algorithm, but I thought what was in the best master clock algorithm was simply about how good your clock was. And so if you were sitting there saying, I have the best clock, it's fantastic. But then you were telling people the completely wrong time because you had some kind of a bug or misconfiguration. You would continue to operate in that mode indefinitely. That's fascinating. What it sounds like is, despite the fact that PTP is newer and in some ways shinier and in some ways having fundamentally better capabilities for some aspects of what it's doing, it also just threw out a bunch of the careful engineering

Starting point is 00:33:21 that had gone into NTP over a long period. Because NTP has significantly more robust ways of combining clocks than what you're describing for PTP. Yes, that was my interpretation of looking at all this stuff. It feels like we threw out a lot of the safety, and that makes me super nervous based on my experience with these devices. So here we are. We have an NTP solution that's not working well enough, and a PTP solution that's kind of terrifying. So where'd you go from there? So we're trying to build a proof of concept. At the end of the day, we sort of figured, all right, we have these GPS appliances. We talked about hardware timestamping before on the GPS appliances and how they can't hardware timestamp the NTP packets. So that's problematic. We thought,

Starting point is 00:34:01 how can we move time from the GPS appliances off into the rest of the network? And so we decided that we could use PTP to move time from the GPS appliances to a set of Linux machines. And then on those Linux machines, we could leverage things like hardware timestamping and the NTP interleaved mode to move the time from those machines onto machines further downstream. The NTP interleaved mode, just to give a short overview of what that means, when you send a packet, if you get it hardware timestamped on transmission, the way you use that hardware timestamp is you get it kind of looped back to you as an application. So I transmit a packet, I get the hardware timestamp after the packet's already gone out the door. That's not super useful from an NTP point of view, because really you wanted the other side to receive that hardware timestamp. And so the

Starting point is 00:34:49 interleaved mode is sort of a special way in which you can run NTP, in that when you transmit your next NTP packet, you send that hardware timestamp that you got for the previous transmission. And then the other side, each side can use those. I don't want to get into too much of the details of how that works, but it allows you to get more accuracy and to leverage those hardware timestamps on transmission. I see. And this was already built into existing NTP clients, this ability to take advantage of hardware timestamps by moving them to the next interaction.

Starting point is 00:35:16 That's not a thing you had to invent. Nope. It's existed for a while. And I think even with the reference NTP implementation, it can leverage timestamps taken at the driver level to do something similar. But Krony adds the ability to actually leverage hardware timestamps in the same fashion, sort of sending them in the next message so that they can calculate a more accurate difference. Because hardware timestamps are a relatively new invention of all of this, right? When NTP was designed, I don't think there was any devices that did hardware timestamping.

Starting point is 00:35:41 I think that is true. And as I was saying before, when this all first came to fruition, the things that supported hardware timestamping were PTP specific. Okay. So now you have an infrastructure where there's a bunch of GPS devices, a layer of Linux servers, which get time via PTP from those GPS devices, and then act as NTP servers for the rest of the network. So maybe I missed this before, but why does that first layer have to use PTP rather than NTP? The major reason is that the GPS appliances, apropos of what we were just talking about, the GPS appliances will hardware timestamp their PTP because they have dedicated cards for it,

Starting point is 00:36:23 but they don't hardware timestamp their NTP. So the quality of time that you're getting off of the GPSs, if you're talking NTP to them, like if you just remove the time servers and you have the clients talk directly to the GPS appliances, for example, it's just going to be a lot lower quality. And to be honest, I don't know if they support the interleaved mode of NTP. Like it's not something I ever really dug into. It sort of goes back to that black box thing of like, well, we can configure this thing in such a way that it spits out hardware timestamp PTP and be relatively confident that it's doing that job. But anything more esoteric gets a little dicey. Got it. And you solve the false ticker problem by basically having each Linux server that's acting as a kind of NTP, marrying each one of those to an individual GPS device. So if that GPS device is crazy, then the Linux server will say crazy things,

Starting point is 00:37:04 but then things internally on the network are going to, in a fault-tolerant way, combine together multiple of these Linux servers and be able to use the high-accuracy way of getting data from those servers. That's exactly right. We constrain the servers that any given client can pick using various sort of knobs within Krony

Starting point is 00:37:21 because we want to meet certain requirements. And so we would like to ensure that any given client is going to talk to its local NTP server, as opposed to one that is say, you know, 600 mics away somewhere. Because as soon as you go to talk to that one that's 600 mics away, you introduce a lot of potential error. And so what we do is we force the NTP clients to talk to their local servers. And then we also configure them to talk to a bunch of other servers, which are sort of too far away to get very high accurate time. But we use them just as you described to sort of call out and figure out if either of the two local ones

Starting point is 00:37:51 have gone crazy. If both of the two local ones have gone crazy, well, we're kind of out of luck. How well did this all work in practice? It worked surprisingly well. So sort of designing the system, coming up with a system that can do this stuff and remain fault tolerant and all that is one thing. But then there's also the other thing of like, show me that you're within 100 microseconds of UTC. So that required understanding, well, what are the errors? And that comes back to the sort of asymmetry question and understanding things like if the NTP daemon is not accepting updates from its server for whatever reason, because maybe it thinks the server is crazy or because it thinks that the samples it just took are incorrect. Like maybe you had a popcorn spike or something like that. It'll do a thing where it'll basically increment a number that represents its uncertainty about how much it might be drifting while not getting good updates from its server. And so you kind of have to add together all these different errors. You have that one, you have the known error introduced by the actual time daemon, the time daemon, what it knows how far off it is.

Starting point is 00:38:47 And then you have that round trip time divided by two that I mentioned. So if you take all that, add it together, you have to do a similar thing for the PTP segment I mentioned. And then you have to add on the 100 nanoseconds for the GPS that I mentioned. You can add all that together. Most of our servers, we can show that we are absolutely no worse error than about 35 microseconds. Most of the time, assuming not some extenuating circumstances.

Starting point is 00:39:05 So a design choice that we made in this whole thing was your best bet for getting good time to clients is to have a dedicated network for it, right? Dedicated NIC, dedicated network, have it be quiet, nice, no interference. But that's expensive and annoying and nobody really wants to do that. It's expensive in a few ways. It's expensive in the physical hardware you have to provision, but it's also just expensive in the net complexity of the network, right? I think there's a lot of reasons why we want to try and keep the network simple. And having a whole

Starting point is 00:39:30 separate network just sounds like a disaster from a management perspective. Agree, right. So I was like, I really don't want to go down that road. So we sort of said like, well, let's see what happens. And so I was just saying, you know, most of the time we can attest that we are better than 35 microseconds, you know, 35 mics of error, worst case. But there are situations where you can cause that to fall over. You can, for example, we have some clients that don't support hardware timestamping. They're just older generation servers. They don't have NICs with hardware timestamping. If on those things, you just saturate the NIC for five minutes solid, you're probably going to get your error bounds outside of a hundred mics, just going to happen. But on newer machines that do support hardware timestamping, you can do things like that

Starting point is 00:40:07 and still stay within 50 mics of UTC, which is pretty cool. Some of this is built upon the fact that we know we have a very smart networking team. We're confident in the ship that they run and the way our networks are built and that kind of stuff. And that lends something to the not wanting to build a dedicated time network. And we think we can get by without it. So that's sort of where we ended up. Around 35 mics.

Starting point is 00:40:28 I want to say it's 35 to 40 mics for systems that don't have hardware timestamping on the client side, and closer to 20 mics for systems that do have hardware timestamping on the client side. And as I mentioned, the systems that do have hardware timestamping on the client side are kind of more robust to problems, you know, to just things that people might do. You know, maybe somebody's debugging something and they want to pull a 10 gig core dump off of a machine.

Starting point is 00:40:49 They're not thinking about the timestamping on the machine right now. Like they're focused on their job to try and actually figure out what happened with that system. So the other aspect of all of this was reporting on it and showing it, right? How do we surveil to show that we are actually in compliance? And so for that, we took what we think is a relatively reasonable approach, which is we sample. So there's kind of no interval at which you can sample, which is small enough. If you want to be like absolutely sure that all the time you were never out of compliance, right? You could say, well, what's a reasonable sample every five

Starting point is 00:41:16 seconds? Nah, that's definitely too much. Okay. Every one second, maybe that's fine. Every hundred milliseconds, right? Like, so where do you stop? So we sort of decided that for machines that go out of compliance, it is likely that if we sampled every 10 seconds, we would pick them up because it's not like there's these crazy perverse spikes that sort of jump in there and then disappear. It is more like somebody's SCPing a large file across the network

Starting point is 00:41:37 or something is misconfigured somewhere. And then therefore it is a sort of persistent state that sticks around for a while. So we sample every 10 seconds, pulling out these sort of various stats I mentioned about what represents the total error. And then we pump that into effectively a big database all day long. And then at various times throughout the day, we surveil that database and look for any systems that are not sort of meeting their time obligations.

Starting point is 00:41:58 We hold different systems at different levels of accuracy. So after all of this, not that I want to call this into existence, but imagine that there's a new version of European regulations, MIFID 3 comes out and says, now you have to be within 10 microseconds. Assuming that technology is still as it is now, what would you have to do to get the next order of magnitude in precision? Not this. So I think probably you'd want to go to something like PTP, but probably not just PTP directly. There's a thing called White Rabbit, which is kind of like some PTP extensions, basically. I think it might actually be completely formalized in the most recent PTP specification. But that is a combination of roughly PTP with synchronous Ethernet. So

Starting point is 00:42:42 synchronous Ethernet allows you to get syntonization across the network. So you can sort of make sure that the frequencies are the same. Can I ask you what the word syntonization means? Just basically means that those two, the frequencies are in sync. So it doesn't mean that we have the same time, but it means that we are sort of advancing at the same rate. I see. So there are techniques essentially that let you get the rates the same without necessarily getting the clocks in sync first. Correct. And it's my understanding that White Rabbit sort of uses this idea that you can have the rates the same with PTP to work out some additional constraints that they can solve and get sub

Starting point is 00:43:13 nanosecond time synchronization. I think we would have to put a lot more thought into the reliability and redundancy story. I sort of discounted PTP because it didn't necessarily have the best reliability redundancy story. I sort of discounted PTP because it didn't necessarily have the best reliability redundancy story. It's not to say we couldn't have figured out a way to make it work for us. We almost certainly could have. You could have two grandmasters, one sitting there as the primary doing its normal operation, one sitting there as a standby. And if the primary one goes crazy for some reason, you could have some automated tooling or something that an operator could use to take it out of service and bring the secondary into service and only have

Starting point is 00:43:47 maybe a minor service disruption. I can imagine us doing that work, but given the problem we were trying to solve, it seemed like not necessary. We can solve this problem using this existing technology, but I do think if we had to go much lower, like you said, an order of magnitude lower, we'd have to start looking at something else. Well, thank you so much. This has been really fun. I really enjoyed learning about how the whole wild world of clock synchronization is knit together. Well, thank you very much. It was a pleasure being here. It's a pleasure talking about these things. It's fun to try and solve these interesting, challenging problems. You can find links related to the topics that Chris and I discussed,

Starting point is 00:44:24 as well as a full transcript and glossary at signalsandthreads.com thanks for joining us and see you next week

Signals and Threads - Clock synchronization with Chris Perl

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.