Signals and Threads - Clock synchronization with Chris Perl
Episode Date: October 14, 2020Clock synchronization, keeping all of the clocks on your network set to the “correct” time, sounds straightforward: our smartphones sure don’t seem to have trouble with it. Next, keep them all a...ccurate to within 100 microseconds, and prove that you did -- now things start to get tricky. In this episode, Ron talks with Chris Perl, a systems engineer at Jane Street about the fundamental difficulty of solving this problem at scale and how we solved it.You can find the transcript for this episode along with links to things we discussed on our website.
Transcript
Discussion (0)
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstream.
I'm Ron Minsky.
Today we're going to talk about a deceptively simple topic, clock synchronization.
I think there's nothing like trying to write computer programs to manipulate time to convince you that time is an incredibly complicated thing.
And it's complicated in like 16 different ways, from time zones to leap seconds to all sorts of other crazy things. But one of
the really interesting corners of this world is how do you get all of the clocks on your big
computer network to roughly agree with each other? In other words, clock synchronization.
So we're going to talk about that with Chris Pearl, who's a sysadmin who's worked at Jane Street since 2012.
Chris is better than anyone I have ever worked with at diving into the horrible details of
complex systems and understanding how they work and how they can be made to work better.
And he's done a lot of work here, specifically on clock synchronization, and has, in the
course of that, redone our entire system for doing clock synchronization.
So he's had an opportunity to really learn a lot about the topic.
So Chris, to get started, can you give us just a quick overview of how computer clocks
work in the first place?
So I guess the rough gist is something like you have some oscillator, a little crystal
effectively that's inside the computer that is oscillating at some frequency, and that's
driving an interrupt that the operating
system is going to handle in some level.
Like, you know, there's probably lots of details here that I'm just skipping over, but that's
driving an interrupt that's going to happen in the operating system.
Then the operating system is using that to derive its notion of time.
And so if you have like a really high quality oscillator and like those time interrupts
happen at the right rate so that you're tracking real time, that might just happen.
And if your oscillator is very good and very stable, it could actually just be pretty close to the correct time just by virtue of that.
But the truth is that most computers come with fairly bad oscillators
and they change their frequencies for various reasons like heat.
So if you are using your computer to compile the Linux kernel or something like that,
that could change the heat profile, change the frequency of the oscillator,
and actually change how well you're doing of keeping real time. When we naively think of clock synchronization as people, we think of it as,
I'm going to go set my clock. I'm going to look at what time it is and adjust my clock to match
whatever real time is. But you're actually talking about a different thing here. You're talking
not just about setting what the right time is right now, but keeping that time correct,
essentially keeping the rate at which time is going forward in sync. Correct. You'd love it if you could get like a really, really high quality oscillator for super
cheap in all your computers, and then you wouldn't need a lot of adjustment to get them to keep the
correct time, but that would be really expensive. You can buy such things, they just cost a lot of
money. So you say that heat and various other things that are going on the computer will cause
this rate at which time is appearing to march forward inside of your computer to drift
around. How accurate are these? Can you give me a numerical sense of how far these things drift
away? The stuff that we run, we capture some of these statistics. We see machines that have a
frequency correction applied to them of say 50 parts per million, which is like microseconds
per second. So that works out to roughly a couple seconds per day is like how you would wind up drifting off. But I'm sure that if you had like
a super old desktop under your desk that you stole from your parents or something, and you were trying
to rebuild into a Linux box, you might have worse numbers than that. A sort of relatively current
generation server from a well-known vendor, you're talking somewhere around 50 to 100
microseconds per second that they can sort of walk off out of alignment. Okay. So clock synchronization is the process of trying
to get all of those clocks that you have across your whole data center and across multiple data
centers to be in sync with each other. Is that the right way of thinking about it?
I think so. In sync is like an interesting thing to say, right? You would like that if you were
able to instantaneously ask two random servers on your network, what time it was at the same exact point
in time, if you could somehow magically do that, that they would agree to some relatively small
margin of error. And I think that's kind of what we mean by clock synchronization,
that if you could somehow magically freeze time and go ask every single computer on your network,
what time do you think it is that they would all roughly agree to within some error bound that you can define?
Right. And this basic model actually assumes that there is a well-defined notion of what it
means to be instantaneously at the same time, which isn't exactly true because of relativity
and stuff like that, but we're going to mostly ignore that. So I guess one property that you're
highlighting here is having the clocks agree with each other. And that's part of it. But there's
another piece, right? Which is having the clocks agree with each other. And that's part of it. But there's another piece, right? Which is having the clocks agree with some external reference. There's some notion of like,
what does the world think the time is? So where does that external reference come from?
I'm not an expert on this stuff, but I'll give you this sort of 10,000 foot view. You have
various physics laboratories all over the world, like NPL in the UK and other places across the world, they all have measurements of what they think time is using things like hydrogen measures and sort of very like accurate atomic methods.
They contribute all of that stuff to a single source who kind of averages it or does some sort of weighting to come up with like what the correct time is.
And then you kind of magic that over to the Air Force, who then sends it up to the GPS constellation. And GPS has a mechanism for getting time from
the GPS satellites down to GPS receivers. And so if you're a person who runs a computer network,
and you're interested in synchronizing your clocks to a relatively high degree of accuracy
with something like UTC, which is effectively Greenwich Mean Time. It is just sort of like
the current time without time zones applied. If you're interested in doing that, what you can do
is you can just go out to a vendor and you can buy a thing called a GPS appliance, which can
hook up to a little antenna that goes onto the roof. It can receive this signal from the GPS
constellation and basically gives you out time. And the accuracy there, it's something like,
you know, maybe a 100 nanoseconds or so
is like how accurate you're going to get from GPS.
So you've got this sort of atomic measurements
being fed up to a GPS constellation
down to GPS receivers
that you as a operator of a computer network can buy.
And for the purposes of this conversation,
we're going to treat those GPS receivers
as the received wisdom as to what time it is.
And our job is to figure out
how inside of a computer network,
you make all of the different devices
agree with each other
and agree with that external reference.
Correct.
Why is it important?
What does having synchronized clocks help you do?
If you put yourself in the shoes
of financial regulatory authority
and you have all these different participants out there
doing stuff with computer systems
and something
weird happens and you'd like to come up with a total ordering of events of what led to this
crazy thing or what led to this good thing, who knows, but you want to have a total ordering of
events. If people don't have good clock synchronization to some external source,
you can't compare the timestamp from participant A to the timestamp from participant B. So if you
were to decree, everybody must have time that is within some error bound. If these timestamps are within that error bound, well, then I can't be
sure about the ordering. But if they're farther away than that, then I can be sure about the
ordering. I can know which one came first and which one came second. And that can be very useful.
So that's a motivation that's very specific to our industry. But don't people in other industries
care a lot about clock synchronization too? I would have thought that there are other
reasons that would drive you to want to synchronize the machines on a network.
Oh, sure. There's lots of different things. I mean, just like a general sysadmin-y topic,
a lot of times you want to gather logs from all the systems on your computer network and you want
to analyze them for various reasons. Maybe it's because you're concerned about intruders,
or maybe it's because you're just trying to understand the way things are functioning.
And if your clocks aren't synchronized, it's very hard to kind of understand
things that might've happened on system B and how they relate to system A because the two
timestamps are just not, you just can't compare them if they're not synchronized.
And I suppose there are also some distributed systems, algorithmic reasons to want clocks.
Certainly some kind of distributed algorithms end up using clocks as ways of breaking ties
between systems. And so that requires at least some reasonable level of synchronization.
For sure. There's also other network protocols that are widely used that require clock
synchronization, but much less precise levels of clock synchronization, right? Like Kerberos is a
widely used authentication protocol. And that requires that the clocks be synchronized to like
within five minutes. And the idea there is to like thwart replay attacks and stuff like that.
So making sure that somebody can obtain your credentials from a couple of days ago
and use them again kind of a thing.
So there it's like the error bars are very wide,
but there's still some sort of synchronization necessary.
Right, and I guess that's a general theme
when thinking about synchronization
is different applications
require different levels of synchronization,
but more closely synchronized never hurts.
There's definitely trade-offs
as you start to get into the lower bounds,
but like, yeah,
if it were all free, sure. I'd like to have them exactly the same.
How do people normally approach this problem of clock synchronization? What are the standard
tools out there?
Most people, you just kind of run whatever your distribution shipped as an NTP daemon. So NTP
stands for the network time protocol. And it is a protocol that up until not that long ago,
I just kind of assumed used some
kind of magic. It knows how to talk to some servers on the internet or some local servers
that you probably then having talking to servers on the internet. And it synchronizes your clocks
with those servers. It's exchanged some packets. Maybe it takes a little while, maybe a few minutes,
maybe longer. You probably don't understand exactly why, but eventually your clocks are
like relatively in sync to within maybe, you know, tens or so of milliseconds.
Can you give us a tour of how NTP actually works?
Like I said, for a long time, I just kind of assumed it was magic and didn't really
think too hard about it.
And then at some point I got tasked within Jane Street to actually look at some of this
stuff and try and meet some requirements that were a little bit harder than the sort of
standard, you know, tens of milliseconds synchronization.
So I actually went and just like, was like, okay, well, how does NTP do this from first
principles, right?
Like, let's go read some of the papers from David Mills.
Let's just go see if we can actually reason this out ourselves.
At the end of the day, it's just four timestamps.
There's a lot more complicated stuff around it, but like the sort of core of it is these
four timestamps.
Let's say I'm the client and you're the server.
First, I send you a packet, but before I do, I write down a timestamp.
When you receive that packet, you write down a timestamp. Then you send me a reply,
and before you do, you write down a timestamp. Finally, when I receive that reply, I write down
a timestamp. It may not seem that groundbreaking, but with just those four timestamps, I can compute
two important numbers, the offset and the delay. The offset is how far my clock is off from yours.
So if you think it's 12 PM and I think it's 12.05 p.m., then the offset would be five minutes. The delay is how long it took those packets to traverse the
network. To compute those numbers, you basically take a system of equations. And for me, an
important aspect was actually writing down with a piece of paper and a pencil and solving these
equations myself. It was understanding that there's a sort of huge assumption in this protocol,
that the delay for the first packet, where I timestamped and then you did, and the delay for the second packet, where you timestamped
and then I did, the assumption is that those times are the same. And if they're not the same,
they introduce what's called error. And that is a sort of very important aspect. That is an
assumption that is made such that you can actually solve those equations to get the offset and the
delay. Can you maybe explain what it is about the calculation that depends on the symmetry packet to get from you to me.
And you're like, well, what do I do with this information?
And you say, well, what if I just assume that those two delays are equal?
And if I assume that those two delays are equal, well, then I can start rearranging the various pieces of the equation.
And then that's how you can actually solve for the delay in the offset.
What's the role of the two timestamps on the
server side? So if you ask me what time it is, I write down when I receive it,
and then I write down the time where I send it back. You could imagine a simpler protocol with
just three timestamps. Then you just assume that that time that I wrote down happened in the middle
of that interval, the interval between the time you sent the first message and received the second
message. How do you know when in the middle is, right? the interval between the time you sent the first message and received the second message.
How do you know when in the middle is, right?
There's lots of vagaries that happen with operating systems.
Like if you sort of timestamp it on either end, like as soon as you receive the packet,
you timestamp it, and then maybe you have to do some work.
And then right before you send it back, you timestamp it.
And that's sort of how you get closest to those differences I mentioned representing
the actual network delay from one to the other.
And I guess an extra assumption that you're making here is that in that period between the first
timestamp and the second timestamp, you had better assumed that the rate at which the clock is going
forward is about right. I think that throws another error term into the equation. It's,
I think, typically extremely small, right? It certainly seems like something you can in
practice ignore, because if you just look at the number of parts per million or whatever that you were talking about in terms of how much drift there is in a real computer clock, I think that is in fact pretty tiny.
Right. But you've got the correction being applied by the time daemon that's running on the computer, which is keeping the clock in sync. side of this communication is also taking time from somewhere, either a reference clock or some sort of higher upstream stratum in NTP, like clocks that are better than it, something like
a GPS receiver. And it has applied a sort of correction to the operating system to say, hey,
I currently believe that the frequency is off by this much. Please correct that when you hand me
back time. So I feel like your biggest, I guess, to your point of being able to ignore it in
practice, your biggest concern would be if in between those two timestamps,
something massive changed, like the temperature rose or dropped by like many, many degrees or
something such that that frequency correction was now just wildly incorrect. Okay. So we have now
a network protocol, put a timestamp, send a message, another timestamp, another time. So
you get it back. Now the computer that started all this has some estimate for how much its clock is off. What does it do then? In the simple world, you could just set your time.
You could just sort of say, and the time should be X. But that's not generally how most network
time protocol daemons work. What they'll do is they'll take a number of samples from a single
server, but many times you have multiple servers configured. So you'll take many samples from
multiple servers, and you'll sort of apply different criteria to these things to decide if you should even consider them. I think the
reference implementation of NTPD has this notion of a popcorn spike, where if you're offset,
you know, if you've gotten back 30 samples and they all kind of look about the same,
but then you get one that's wildly off, you just kind of say like, I'm going to ignore that one
because likely that was just due to some crazy queuing somewhere in the network or something like that.
You can sort of think of this as a kind of voting algorithm.
Like you have a bunch of different oracles that are telling you things about what time is.
You kind of bring them all together and have some way of computing an aggregate assumption about what the time currently is that tries to be robust to errors and drop outliers.
Yeah, I think that's right.
You try to pick out the people who are lying to you, right? Some of those servers you might be talking to upstream might
just be telling you incorrect things. They're generally referred to in sort of NTP parlance
as false tickers. And the ones who are not false tickers are true chimers. I'm not sure why exactly
these are the names, but these are some of the names you might see if you're looking around the
internet. So you try and pick out the ones that are telling you the truth. You apply some other
various heuristics to them to try and figure out which ones you think are
the best, right? Which ones maybe have the sort of smallest error bars, even though you might think
that these are decent sources to use. Some of them might have wider error bars than others, right?
Your samples may represent a sort of wider range than the other ones. So you try and figure out
which ones are the best. And then you use that to sort of tell your operating system to effectively speed up or slow down its frequency correction for like how off it is and try and sort of remove that error
over time you don't just like abruptly adjust the time that the system thinks it is most time
daemons will not aggressively step the clock the reason for that is that most applications do not
enjoy when the time just changes drastically especially not when it changes drastically in the negative direction. This highlights another property you want
out of your clocks, which we haven't touched on yet, which is we said we want our clocks to be
accurate. Your criterion for what it means for them to be right is you go to them and ask them
what time it is, and they give numbers pretty close to each other. But there's another property
you want, which is you want the clocks to, in a micro sense, advance about a second per second.
And you especially want it to never go backwards because there are lots of algorithms on a computer
which are making the assumption implicitly and naively, reasonably that the clocks only go
forward. And lots of things can get screwed up when you allow clocks to jump backwards.
Right. So a way that you can maintain that property that you just mentioned
while still correcting is simply effectively tell the operating system like, hey, slow down. I want to have time slow down effectively such that this error gets removed, but I don't have to actually step time backwards and make applications sad. in a serious way was I was asked to look at Jane Street's clock synchronization on its Windows
machines. I wrote a small program that sent little NTP packets to the Windows machines,
which knew the NTP protocol and responded appropriately. And they had the four timestamps
on them. And instead of trying to compute the average, I could compute an upper and lower
bounds on how far the clock sync was off and generate a little graph to see what was going on.
And I remember being quite surprised to discover that if you graphed how far off the clocks were, you would see this weird sawtooth behavior where the clocks
would go out of sync and then bang, they would be in sync again, and then slowly out of sync and
then in sync again. And so that's because the Windows NTP daemon we were running was configured
to just smash the clock to the correct value and not do any of the kind of
adjusting of the rate, which I think if I remember correctly, that's called slewing,
which is I think a term I've heard in no other context.
Yep, that is correct.
Okay. So NTP lets you in fact do the slewing in an appropriate way. So you can keep the rates
pretty close to the real-time rates, but still over time, slowly converge things back if they are far apart. In practice, how quickly can NTP bring clocks that are desynchronized back together?
At least with some of the newer time daemons.
So I don't know what the default is for the reference implementation.
I know with some of the newer daemons like Krony,
the default, it takes 12 seconds to remove one second of error.
So depending on how far away you were, you can sort of like work it out, right? 12 seconds to remove one second of error. So depending on how far away you were, you can sort of like
work it out, right? 12 seconds to remove one second. So if you were a day, it'd be 86,400
times 12, which is a lot of seconds. So that's actually quite fast, which means the rate at
which the clock moves forward can be off by on the order of 10%, which is a pretty aggressively
pushing on the gas there. And these knobs are adjustable. If you really want to, you can sort
of change the max rate at which they will attempt to make these adjustments.
So we had clock synchronization working on our systems before you started in 2012, and yet you needed to redo it. What were we doing, and why did we have to redo it?
So we did what I'm sure lots of people do. We discussed GPS appliances before, so we have some GPS appliances which are bringing us accurate-ish time. And then we pointed a bunch of time servers at those GPS appliances using NTP.
And we pointed a bunch of clients at those time servers.
And we sort of dusted our hands off and said, ah, done.
There was no real requirements around what is the maximum error?
You know, are we trying to maintain anything?
You know, if you look at any given time, can you tell us how far off any given system is
from, say, UTC?
And so that served us fine for a while. The main motivation
for some of the work that was done was a bunch of different financial regulations that happened in
Europe, but one of them specifically had to do with time synchronization. And what it said was
that you have to be able to show that your clocks are in sync with UTC to within 100 microseconds.
So the 100 microsecond number was the big change. At first,
when we first heard this requirement, it's like, well, maybe we're good. We don't actually know at
the moment. Maybe we're just fine. Okay. And so we looked at it and were we just fine?
No, definitely not. So it was, I think I said it before, but most systems were a couple hundred
microseconds out. But the real problem or one of the real problems was that they would bounce all
over the place. Sometimes they could be relatively tight, say 150 microseconds,
but various things would happen on the system that could disturb them
and knock them, say, 400 or 500 microseconds out of alignment.
If a bunch of systems all start on a given computer at the same time
and they all start using the processors very aggressively,
that'll fundamentally change the heat profile of the system.
As the heat profile changes, the frequency will change.
And then the time daemon might have a harder time keeping the correct time because the
frequency is no longer what it was before.
And it has to sort of figure it out as it goes.
So I started sort of investigating, okay, how can we solve this problem?
Like, what do we have to do?
And sort of just started looking into various different things.
So I didn't know at the beginning of all of this, like, can we solve this problem with NTP? Is NTP capable of solving this problem?
Or do we have to use some different, newer, better protocol, right? Because NTP has been around for
a long time. What did you find? I definitely did the dumb thing, right? And I went to Google and I
said, how do you meet MIFID 2 time compliance regulations or, you know, something along those
lines and probably many different sort of combinations of those words to try and find all the good articles.
If you just do that, what you find out is that everybody tells you you should be using PTP,
which is the precision time protocol. It's a completely different protocol. And if you go
read on the internet, you'll see that it is capable of doing much better, quote unquote,
better time synchronization than NTP. But nobody really tends to give you reasons. Lots of people will say things like NTP is good to milliseconds. PTP is good to microseconds,
but without any sort of information backing that. So if you just do that, you're like, well,
we should clearly just run PTP. No problem. Let's just do that. So I did a bunch of research trying
to figure out, is that a good idea? So the first thing I also want to understand was what is magic about PTP? What makes it so much better than NTP such that you can say these
things like NTP is good to milliseconds, PTP is good to microseconds.
Where does the precision of the precision time protocol come from?
Exactly. And what I found actually surprised me to some extent. The protocol is a little
different. The sort of like who sends what messages when is a little bit different. It involves multicast, which is different. But at the end of the day, it's those
same four timestamps that are being used to do the calculation, which I found a bit surprising.
I was sort of like, huh, if it's the same four timestamps, more or less, what is it about PTP
that makes it much more accurate? And what I was able to find is I think it's basically three
things. It's that many, many hardware vendors support hardware timestamping with PTP, right?
So your actual network cards, we sort of talked about the package showing up at the network,
it having to raise an interrupt, the CPU having to get involved, schedule the application,
right?
You do all this stuff and then eventually get a timestamp taken.
PTP with hardware timestamping, as soon as that packet arrives at the network interface
card, it can record a timestamp for it.
And then when it hands the packet up to the application, it says like,
here's your packet. Oh, and by the way, here's the timestamp that came with it.
And so we were talking before about trying to move those timestamps as close as you could,
such that they actually, the difference of them represented the delay from the client to the
server and from the server to the client. So if you push them sort of down the stack to the hardware,
it means that you're going to have much more accurate timestamps and you have a much better chance that those things
are actually symmetric, meaning you're taking good time. And it also removes a lot of the other
uncertainty in taking those timestamps, such as scheduling delay, interrupt delay,
other processes competing for CPU on the box, stuff like that. So that's one thing. So you
have hardware timestamping is like a PTP thing. The other thing is the frequency of updates. So I think by default, PTP sends its sync messages
to clients once every second, whereas at least for the reference implementation of NTPD,
I believe the lowest you can turn that knob for how often should I query my server is once every
eight seconds. So you have the hardware timestamping, you have the frequency of updates.
And then the other bit of it is the fact that lots of switches. So I think PTP was basically designed with the idea that you'd have all of the sort of network contributing to your time distribution. So all of your switches can also get involved and help you move time across a network while understanding their own delays that they're adding to it. Right. And so they can kind of remove their own delays and help you move time accurately across a network. At least that's kind of the intent of PTP.
The idea is, I guess, you can do in some sense the moral equivalent of what NTP does
with like the two middle timestamps, right?
There are two timestamps in NTP that come from the server that's reporting time.
It's like when it receives and then when it sends out, and you get to subtract out
from the added noise that gap between those two timestamps.
And then the idea is you can
do this over and over again across the network. And so delays and noise are introduced by, for
example, queuing on the switch would go away. Like you would essentially know how much those delays
were. And as a result, you could potentially be much more accurate. Yep. I think that's roughly
the conclusion I came to that that's what makes PTP more accurate than NTP, which was surprising to me. And then I did a bunch of research and was
talking to various people in the industry at various conferences and stuff. And there was some
agreement that you can make NTP also very accurate. You just have to sort of control
some of these things, right? So there are, in addition to being able to do hardware timestamping
with PTP packets, some cards these days support the ability to hardware timestamp all the packets. And if your machine is just acting as
an NTP server and most of the packets it receives are NTP packets, well then you're effectively
timestamping NTP packets. Some cards also will timestamp just NTP packets. They can sort of
recognize them and timestamp only those. But it was sort of like, okay, if we have the right
hardware, we can get the timestamping bit of it. That's kind of an interesting thing. With the
different NTPD implementation, so crony being the other implementation I'm talking about,
as opposed to the reference one, you can tune that knob for how frequently you should pull your
server. I think as much as like 16 times a second, there's a bit of like diminishing returns there.
It's not always better to go lower. Point being, you can tune it to at least match sort of what PTP's default is
of once a second. And the more I dug and the more I talked to people, the more people told me,
hey, you definitely do not want to involve your switches in your time distribution.
If you can figure out a way to leave them out of it, you should do so. I was happy to hear that
in some ways because because right now,
the reliability or the sort of the responsibility of the time distribution kind of lies with one
group, and that's fine. When you then have this responsibility shared across multiple groups,
right, it becomes a lot more complicated. Every switch upgrade, suddenly you're concerned, well,
is it possible that this new version of the firmware you're putting on that version of that
particular switch has a bug related to this PTP stuff and is causing problems. So given all of that, I started to believe that it was possible
that we could solve this problem of getting within 100 microseconds using NTP. And I sort
of set out to try and see if I could actually do that. It seems like in some sense, the design of
PTP where it relies for its extra accuracy on these switches violates this old end-to-end property
that people talk about as the design of the internet
of trying to get as much of the functionality as you can
around the edge of the system.
And I think that is motivated
by a lot of the kind of concerns you're talking about
of you have more control over the endpoints
and you want to rely on the fabric
for as little as possible.
I guess the other reason you want to rely on the fabric,
it's not just that there are different groups and like, oh, it's hard to work with people in
different areas and coordinate. It's also in various ways, fundamentally harder to manage
networks than it is to manage various other kinds of software. But the reality is in many
organizations, in many contexts, a lot of getting the network right involves an extremely good,
extremely well-trained, extraordinarily careful
human being just going in and changing the configs and not getting it wrong. It's kind of a terrifying
system. And the less complexity you can ask the people who have to go in and do this terrifying
thing of modifying the network, the less complexity you can ask them to take care of, the better.
I mean, that's a very true point. And another aspect of it is having less black box systems involved.
So Krony is an open source project.
We can sort of inspect the code
and see what it's doing
and understand how it behaves.
The GPS appliances are not,
but the idea of having less black box systems
where, hey, that's really strange.
Why did we see this spike?
We have absolutely no idea.
The amount we could minimize
that kind of stuff, the better.
Right. The primary currency of the sysadmin
is inspectability.
Yes.
You want to be able to go in and figure out what the hell is happening.
Yes. Huge proponent of things that you can inspect and debug.
You talked a bunch about hardware timestamping, and I have a kind of dumb mechanical question
about all this, which is you talked about essentially software processes keeping clocks
up to date. You have this NTP daemon that knows how to adjust rates
of things and stuff. But then you talked about the NIC going in timestamping packets. So does
the NIC have a hardware clock and the motherboard have a hardware clock of the CPU? How are these
clocks related to each other? What's actually going on when you're doing hardware timestamping?
Yeah, the NIC also has a hardware clock.
Is that a different time? Do you have
to run NTP between the NIC and the host system? I think that would be challenging, but like, yes,
you can use a thing from the PTP, the Linux PTP project to move time from a network card to the
system. It's called PHC to Sys. That's just a thing you can do, right? Like you have time on
your network cards. You can move that time to the system. You can move it from the system to another
network card. You can kind of ship this time around in various ways. But yes, the cards themselves do
also have a clock on them that you're also keeping in sync. So another thing you mentioned about PTP
is that it uses multicast. So I've had the chance to sit down and talk at length with Brian Nogito
in a previous episode of this podcast about multicast. And I'm curious what role multicast
plays here in the context of PTP. The whole idea is that at the root of this time distribution tree,
you have what's known as a grandmaster,
and you can have multiple grandmasters,
and a grandmaster is just something that doesn't need to be getting PTP time from PTP.
It's like, you know, the GPS reference or something else.
You have this grandmaster, you can have multiple ones.
There's a thing called the best master clock algorithm
for the participants of PTP to determine which of them is the best one
to act as the grandmaster at any given time. And then the idea is that you multicast out
these packets to say, here's my first timestamp. And it just makes it easier on the network.
You know, as a PTP client, you just have to come on and start listening for these multicast
messages. And then you can start receiving time as opposed to you having to actually reach out
and be configured to go talk to the server. You can just sort of have less configuration and sort
of start receiving these time updates
from the Grandmaster.
Got it.
So you think it's mostly a kind of zero configuration kind of story.
It also makes it easier for the Grandmaster.
You don't have to maintain these connections.
You don't have to have all these sockets open.
You just sort of have like one socket that you're kind of multicasting out.
It's not 100% true because there's a delay request and a delay response message that's
involved in all this too.
And it's also actually kind of strange.
I think this was changed in the most recent version of PTP, but technically the way it works is the grandmaster sends this multicast message that is a synchronization message, which contains one of those timestamps.
When the client receives it, it actually sends a multicast message back that says, hey, I'd like a delay request.
And then when the grandmaster receives that, it sends out another multicast message that
says like, here's the delay response, which is like kind of insane when you think about
it, right?
Because you're kind of involving all of these other potential peers that are listening in
on the network with like your exchange.
And you can configure some of these open source projects like the Linux PTP project, which
uses a daemon called PTP4L.
You can configure it to do a hybrid model where it receives the sync message as a multicast message.
But then since it knows where that message came from, it just does a unicast delay request
and then delay response, which makes a lot more sense.
Yeah. The base behavior you're describing sounds pathological, right? Essentially quadratic.
You get everyone sends a message to everyone. That is not usually a recipe for good algorithmic complexity. I'm not sure why it was designed
that way. It could be that the original people were sort of thinking that you'd have these smaller
domains where you have these boundary clocks where sort of you're multicasting, you're sort
of limited to how many people you're talking to. But I kind of agree. The sort of default behavior
seems a little crazy to me. And that's why in our case where we are using PTP, we're using it in a
small area of the world. We have it configured to do that hybrid thing where the actual sync
message comes in multicast, but the delay request and the delay response wind up being unicast.
There's a major thing that I haven't touched on here, which is that like NTP, I mentioned it
before, you have multiple servers and you kind of have this built-in notion of redundancy, right?
Where you're sort of comparing all the times from the different servers and you're trying to figure
out which of them are false tickers, right? And so any of them misbehave,
the protocol kind of has this built-in notion of trying to call them out and ignore them.
With PTP, if we're talking about the single grandmaster, that would be a GPS appliance.
And unfortunately, we have found black box GPS appliances to be less than ideal. It would be
fine if you're just talking about a straight failure scenario, right? We have a GPS appliance.
Maybe we have two of them.
They have agreed amongst themselves
who is the grandmaster.
One of them goes offline.
The other one notices that.
It picks up, starts broadcasting.
Like that would be
a perfectly fine world
and I wouldn't be too concerned about it.
But the thing that we've seen happen
is like we want to perform maintenance
on a GPS appliance
because its compact flash card
is running out of life
and we need to actually replace that.
When you go to shut it down, it happens to send out a PTP packet that is like complete
crazy pants, like just absolutely bonkers, makes no sense whatsoever.
The timestamp is off the charts.
And we've had GPS appliances do things like that.
And so part of my thinking through this was, you know, geez, at the end of the day, I really
don't want to be pulled back to a single GPS appliance that is providing time to potentially
large swaths of the network.
Because if it goes crazy, there's no real provisions in PTP for finding the crazy person.
Everybody will just follow those crazy timestamps wherever they lead.
At least for a while.
It sounds like there's a way of eventually deciding someone else is the right one to pay attention to, but it means for short periods of time,
you may end up just like listening to the wrong guy. I'm not an expert in exactly what's involved
in the best master clock algorithm, but I thought what was in the best master clock algorithm was
simply about how good your clock was. And so if you were sitting there saying, I have the best
clock, it's fantastic. But then you were telling people the completely wrong time because you had
some kind of a bug or misconfiguration. You would continue to operate
in that mode indefinitely. That's fascinating. What it sounds like is, despite the fact that
PTP is newer and in some ways shinier and in some ways having fundamentally better capabilities for
some aspects of what it's doing, it also just threw out a bunch of the careful engineering
that had gone into NTP over a long period. Because NTP has significantly more robust ways of combining clocks than what you're describing for PTP.
Yes, that was my interpretation of looking at all this stuff. It feels like we threw out a lot of
the safety, and that makes me super nervous based on my experience with these devices.
So here we are. We have an NTP solution that's not working well enough,
and a PTP solution that's kind of terrifying. So where'd you go from
there? So we're trying to build a proof of concept. At the end of the day, we sort of figured, all
right, we have these GPS appliances. We talked about hardware timestamping before on the GPS
appliances and how they can't hardware timestamp the NTP packets. So that's problematic. We thought,
how can we move time from the GPS appliances off into the rest of the
network? And so we decided that we could use PTP to move time from the GPS appliances to a set of
Linux machines. And then on those Linux machines, we could leverage things like hardware timestamping
and the NTP interleaved mode to move the time from those machines onto machines further downstream. The NTP interleaved mode, just to give a short overview of what that means,
when you send a packet, if you get it hardware timestamped on transmission,
the way you use that hardware timestamp is you get it kind of looped back to you as an application.
So I transmit a packet, I get the hardware timestamp after the packet's already gone out the door.
That's not super useful from an NTP point of view, because really you wanted the other side to receive that hardware timestamp. And so the
interleaved mode is sort of a special way in which you can run NTP, in that when you transmit your
next NTP packet, you send that hardware timestamp that you got for the previous transmission.
And then the other side, each side can use those. I don't want to get into too much of the details
of how that works, but it allows you to get
more accuracy and to leverage those hardware timestamps on transmission.
I see.
And this was already built into existing NTP clients, this ability to take advantage of
hardware timestamps by moving them to the next interaction.
That's not a thing you had to invent.
Nope.
It's existed for a while.
And I think even with the reference NTP implementation, it can leverage timestamps taken at the driver
level to do something similar.
But Krony adds the ability to actually leverage hardware timestamps in the same fashion, sort of sending them in the next message so that they can calculate a more accurate difference.
Because hardware timestamps are a relatively new invention of all of this, right?
When NTP was designed, I don't think there was any devices that did hardware timestamping.
I think that is true.
And as I was saying before, when this all first came to fruition, the things that supported hardware timestamping were PTP
specific. Okay. So now you have an infrastructure where there's a bunch of GPS devices, a layer of
Linux servers, which get time via PTP from those GPS devices, and then act as NTP servers for the
rest of the network.
So maybe I missed this before, but why does that first layer have to use PTP rather than NTP?
The major reason is that the GPS appliances, apropos of what we were just talking about,
the GPS appliances will hardware timestamp their PTP because they have dedicated cards for it,
but they don't hardware timestamp their NTP. So the quality of time that you're getting off of the GPSs, if you're talking NTP to them, like if you just remove the time servers and you have the clients
talk directly to the GPS appliances, for example, it's just going to be a lot lower quality.
And to be honest, I don't know if they support the interleaved mode of NTP. Like it's not
something I ever really dug into. It sort of goes back to that black box thing of like, well,
we can configure this thing in such a way that it spits out hardware timestamp PTP and be relatively confident that it's doing that job. But anything more esoteric
gets a little dicey. Got it. And you solve the false ticker problem by basically having each
Linux server that's acting as a kind of NTP, marrying each one of those to an individual
GPS device. So if that GPS device is crazy, then the Linux server will say crazy things,
but then things internally on the network
are going to, in a fault-tolerant way,
combine together multiple of these Linux servers
and be able to use the high-accuracy way
of getting data from those servers.
That's exactly right.
We constrain the servers that any given client can pick
using various sort of knobs within Krony
because we want to meet certain requirements.
And so we would like
to ensure that any given client is going to talk to its local NTP server, as opposed to one that
is say, you know, 600 mics away somewhere. Because as soon as you go to talk to that one that's 600
mics away, you introduce a lot of potential error. And so what we do is we force the NTP clients to
talk to their local servers. And then we also configure them to talk to a bunch of other
servers, which are sort of too far away to get very high accurate time. But
we use them just as you described to sort of call out and figure out if either of the two local ones
have gone crazy. If both of the two local ones have gone crazy, well, we're kind of out of luck.
How well did this all work in practice?
It worked surprisingly well. So sort of designing the system, coming up with a system that can do
this stuff and remain fault tolerant and all that is one thing. But then there's also the other thing of like, show me that you're within 100 microseconds of UTC. So that required understanding, well, what are the errors? And that comes back to the sort of asymmetry question and understanding things like if the NTP daemon is not accepting updates from its server for whatever reason, because maybe it thinks the server is crazy or because it thinks that the samples it just took are incorrect. Like maybe you had a popcorn spike or something like that. It'll do
a thing where it'll basically increment a number that represents its uncertainty about how much it
might be drifting while not getting good updates from its server. And so you kind of have to add
together all these different errors. You have that one, you have the known error introduced by
the actual time daemon, the time daemon, what it knows how far off it is.
And then you have that round trip time divided by two that I mentioned.
So if you take all that, add it together,
you have to do a similar thing for the PTP segment I mentioned.
And then you have to add on the 100 nanoseconds for the GPS that I mentioned.
You can add all that together.
Most of our servers, we can show that we are absolutely no worse error than about 35 microseconds.
Most of the time, assuming not some extenuating
circumstances.
So a design choice that we made in this whole thing was your best bet for getting good time
to clients is to have a dedicated network for it, right?
Dedicated NIC, dedicated network, have it be quiet, nice, no interference.
But that's expensive and annoying and nobody really wants to do that.
It's expensive in a few ways.
It's expensive in the physical hardware you have to provision, but it's also just expensive
in the net complexity of the network, right? I think
there's a lot of reasons why we want to try and keep the network simple. And having a whole
separate network just sounds like a disaster from a management perspective.
Agree, right. So I was like, I really don't want to go down that road. So we sort of said like,
well, let's see what happens. And so I was just saying, you know, most of the time we can attest
that we are better than 35 microseconds, you know, 35 mics of error, worst case. But there are situations where you can cause that to fall
over. You can, for example, we have some clients that don't support hardware timestamping. They're
just older generation servers. They don't have NICs with hardware timestamping. If on those
things, you just saturate the NIC for five minutes solid, you're probably going to get your error
bounds outside of a hundred mics, just going to happen. But on newer machines that do support hardware timestamping, you can do things like that
and still stay within 50 mics of UTC, which is pretty cool.
Some of this is built upon the fact that we know we have a very smart networking team.
We're confident in the ship that they run and the way our networks are built and that
kind of stuff.
And that lends something to the not wanting to build a dedicated time network.
And we think we can get by without it.
So that's sort of where we ended up.
Around 35 mics.
I want to say it's 35 to 40 mics for systems that don't have hardware timestamping on the
client side, and closer to 20 mics for systems that do have hardware timestamping on the
client side.
And as I mentioned, the systems that do have hardware timestamping on the client side are
kind of more robust to problems, you know, to just things that
people might do.
You know, maybe somebody's debugging something and they want to pull a 10 gig core dump off
of a machine.
They're not thinking about the timestamping on the machine right now.
Like they're focused on their job to try and actually figure out what happened with that
system.
So the other aspect of all of this was reporting on it and showing it, right?
How do we surveil to show that we are actually in compliance?
And so for that, we took what we think is a relatively reasonable approach, which is we sample. So there's kind of no interval at which you can
sample, which is small enough. If you want to be like absolutely sure that all the time you were
never out of compliance, right? You could say, well, what's a reasonable sample every five
seconds? Nah, that's definitely too much. Okay. Every one second, maybe that's fine. Every hundred
milliseconds, right? Like, so where do you stop? So we sort of decided that for machines that go out of compliance,
it is likely that if we sampled every 10 seconds,
we would pick them up
because it's not like there's these crazy perverse spikes
that sort of jump in there and then disappear.
It is more like somebody's SCPing a large file
across the network
or something is misconfigured somewhere.
And then therefore it is a sort of persistent state
that sticks around for a while.
So we sample every 10 seconds,
pulling out these sort of various stats I mentioned about what represents the total error.
And then we pump that into effectively a big database all day long.
And then at various times throughout the day, we surveil that database and look for any
systems that are not sort of meeting their time obligations.
We hold different systems at different levels of accuracy.
So after all of this, not that I want to call this into existence, but imagine
that there's a new version of European regulations, MIFID 3 comes out and says, now you have to be
within 10 microseconds. Assuming that technology is still as it is now, what would you have to do
to get the next order of magnitude in precision? Not this. So I think probably you'd want to go to something like PTP,
but probably not just PTP directly. There's a thing called White Rabbit, which is kind of like
some PTP extensions, basically. I think it might actually be completely formalized in the most
recent PTP specification. But that is a combination of roughly PTP with synchronous Ethernet. So
synchronous Ethernet allows you to get syntonization across the network. So you can sort of make sure that the frequencies are the same.
Can I ask you what the word syntonization means?
Just basically means that those two, the frequencies are in sync. So it doesn't
mean that we have the same time, but it means that we are sort of advancing at the same rate.
I see. So there are techniques essentially that let you get
the rates the same without necessarily getting the clocks in sync first.
Correct. And it's my understanding that White Rabbit sort of uses this idea that you can have the
rates the same with PTP to work out some additional constraints that they can solve and get sub
nanosecond time synchronization.
I think we would have to put a lot more thought into the reliability and redundancy story.
I sort of discounted PTP because it didn't necessarily have the best reliability redundancy story. I sort of discounted PTP because it didn't necessarily have the best reliability redundancy
story. It's not to say we couldn't have figured out a way to make it work for us. We almost
certainly could have. You could have two grandmasters, one sitting there as the primary
doing its normal operation, one sitting there as a standby. And if the primary one goes crazy for
some reason, you could have some automated tooling or something that an operator could
use to take it out of service and bring the secondary into service and only have
maybe a minor service disruption. I can imagine us doing that work, but given the problem we were
trying to solve, it seemed like not necessary. We can solve this problem using this existing
technology, but I do think if we had to go much lower, like you said, an order of magnitude lower,
we'd have to start looking at something else. Well, thank you so much. This has been really fun. I really enjoyed
learning about how the whole wild world of clock synchronization is knit together.
Well, thank you very much. It was a pleasure being here. It's a pleasure talking about these
things. It's fun to try and solve these interesting, challenging problems.
You can find links related to the topics that Chris and I discussed,
as well as a full transcript
and glossary at signalsandthreads.com thanks for joining us and see you next week