Two's Complement - Sequence Locks
Episode Date: October 26, 2024Matt talks about a work thing, called a sequence lock. Ben suggests some dumb ideas about that work thing. Then our hosts discuss how to starve a reader, anger the Gods of Volatility, and invoke Sylve...ster Stallone.
Transcript
Discussion (0)
I'm Matt Godbolt.
And I'm Ben Rady.
And this is Two's Compliment, a programming podcast.
Hi, Ben.
Hey, Matt.
How are things?
Not too bad.
Cool.
Well, unusually for us, we have a topic before we started today, because yesterday you and
I got coffee, and I got excited, and then we both were like, I wish we'd recorded this,
so we're going to try and have that same conversation again, the same level of excitement and enthusiasm,
which is a bit of a struggle for us both we're
not very passionate about what we're not normally enthusiastic people no so we'll we'll have to try
every moment of this podcast has been pure pain i'm just gonna admit that up until up until i've
been keeping it hidden up until this very moment until this second and now the public revealed to
the public yeah i mean we've already told everyone that this is like extremely low effort podcast.
We go out of our way to make it as simple as possible for ourselves.
I still like the fact that we do transcripts and that we do edit them a little bit.
I try and make my dog not make a noise in the background.
Good luck.
I know.
He's just started moving.
I can hear him.
We'll see how this goes.
But today, I wanted to talk to you about some
cool programming stuff that i've actually been doing you know after our last conversation
where we said hey it's kind of cool to be a programmer and not be a people manager that
has come to pass in the last few days and so yeah let me let me phrase it as a puzzle to you okay so this is these are the set of
constraints that you have you're writing a piece of code where you have many threads possibly many
processes there are some important critical pieces of information that you need to share
between these various programs and it's imperative that at no point do you prevent the producer of this information from being stopped or blocked by anything.
Okay.
Yeah.
You only need the most recent output.
So imagine it is, let's say it's the current S&P 500, whatever the S&P 500 is,
something's happening,
you're measuring it,
and then you're writing the value to somewhere
that many readers are then going to read from.
But obviously, this information is big enough
that you can't just write it
in the hope that everyone sees the output.
Right.
Exactly as you wrote it in's many bytes it's many bytes
right exactly yeah so um how do you design a system where you can share small but not trivially
small pieces of information between threads where you never block the writer um the readers need a
fast access to it but you can you don't mind if they have to do a bit of extra work. So
there are actually many solutions to this. But what comes to mind?
Well, I mean, you haven't mentioned anything about the latency constraints.
Correct. Yeah.
So given the fact that there are currently no latency constraints, I can think of many
solutions. Well, actually go on then. So let's think of a latency constraint where we don't
block the writer. At no point do we block the writer. So think of, well, what can we do?
So going back to one of our previous episodes on building boring things, files, you could use a fan out queue. You could use something that if you don't care
about sort of like, if a receiver of this information is expected to handle a situation
in which it's falling behind by just losing information.
Right, which is absolutely the case.
And it's,
we only care about the most recent for the recent readers.
You could use some sort of network broadcast to just broadcast it out.
And it's like,
if you're listening and you get this,
that's cool.
And if you don't,
you don't.
Which has some,
some of the same properties as a fan out queue,
but isn't,
isn't exactly the same thing.
You know, those are the, those are probably the top three again modulo no latency constraints that's where i would probably start as one of
those if you drop right ram constraints and things like that you could just have an append only
like queue and then the readers only look at the most recently written that's what i was saying my first one was files yeah like here's a file start you know the messages are fixed length so start at
byte zero and read and then when you hit the end of file try again right and then if you read from
if you want to read the latest you don't care about the intermediates which is perhaps something
i didn't state very well then you seek to the end of the file. And if it's not a multiple of the object size,
then you know that you've kind of caught the writer
while it's still writing.
So you mod it, something like that.
And that gives you-
That'd be really dumb and would work.
Exactly, yeah.
And all of those things, I think you could make work.
But yeah, then to get round back to the obvious part here
is that we don't have
infinite ram files are probably not a good choice and we're looking for the lowest possible latency
what is the fastest possible way that you could share this information again the writer probably
is going to be updating this very commonly it's you know some derived financial thing and then
the readers from time to time need to be able to look at it and kind of say, what is the current price? Maybe it's even like a web view of like, what is the current S&P
500? And you're like, okay, I don't care how long it takes my web service to read the current value
out, but it is a consistent value. It's not like half of-
Yeah. You don't get half the bytes.
That's right.
Wow, market looks cheap today.
Yeah, that's a very bad mistake that you make usually only once in a career.
Yeah.
So yeah, one of the solutions to this problem is to use a relatively unusual and fun construct
called a sequence lock.
It's used in a bunch of places
in the Linux kernel. So it's kind of well-worn and well-trusted, but it has some interesting characteristics. So let me explain to you what a sequence lock is, and then we can talk about
why it's kind of difficult to get this right and why I've spent a week trying to write one
and not sure whether or not I have actually got not. So, so the trick is this, uh, if you are a writer, yeah, you, um, well, first of all,
the piece of information is in a shared piece of memory and it's the same piece of memory for that.
Like we're not using it like a queue. It's just one place to look for a particular value. So like
this is the S and P 500. It lives at this address in some piece of shared memory.
Okay.
So the readers and the writers in this context have this shared memory.
So they both have access to this memory,
which means they're at least running on the same machine.
Correct.
Yes, that is right.
They may not necessarily be in the same process,
but they're running on the same physical computer.
That is correct.
For this, yes.
The same, well, yeah, virtual computer or whatever. I mean,
modulo, there are like, you know, tricks to do RDMA and things like that across networks and
stuff, but no, what we're talking about here really genuinely is a same computer,
shared RAM chips somewhere in the, in the system. Right. And it's the same address.
So, and as well as storing the piece of information that we want to protect,
which has, you know, as I say, it's, 16, well, if it's 8 bytes, we probably wouldn't worry about this.
If it's 1632-ish, that kind of size of bytes.
We also store a sequence number.
Every time the writer goes to write, it increments the sequence number.
It updates the data inside the block that's being protected.
And then it increments the sequence number again we'll get to why that's important lots of anyone who's listening who knows roughly
what i'm talking about here is like already seen all the problems with the things that we've just
talked about but let's talk about what the reader does the reader reads the sequence number uh if
if it's odd the read the uh the reader knows that the writer is in the process of updating it so the reader
says i can't look right and it just waits a bit and tries again so what we've got is very optimistic
kind of attempt here if it's even we take a copy of the protected data but that doesn't mean to say
that it wasn't while we were copying it being updated by the writer.
Right.
So we've just read a bunch of data.
We can't look at that data yet.
What we're going to do is we're going to look at the sequence lock again.
And only if the sequence lock, sorry, the sequence number matches the first sequence number,
when we read it, then we know that we got a good copy of the data that was being shared. And now we can
look at it. It's safe to look at it. We know that the writer wasn't in the process of updating it,
nor did the writer get and do an entire update while we were taking a copy of the information.
Obviously, if the sequence number changes, we have to retry.
Right. So what happens if you have the slowest reader in the world, it's like a dude with an abacus. And every single time
you go to read the sequence number a second time, it's different. Then you are in trouble, right?
You're just never getting this data. You're just not fast enough. You need to be faster.
That is correct. Yeah. So in terms of testing in the machine that I've got, I can get it to the state where if I have a writer that is fast enough that literally goes back to back doing modifications, then it is possible to completely starve out the reader was able to sneak in, but this is with a very fast reader, right? And on purpose, what we're talking about is a tiny data structure where you're saying
that these are bytes that I'm going to like
mem copy out of shared memory.
So that like, you're kind of hoping that like,
even if you're going to do something slow with it,
you're going to do something slow with the copy
that you got, the good copy that you got of the data.
And so it shouldn't matter too much,
but it is possible.
That is absolutely a trade-off you're making here.
The writer can
starve out the readers, but computers are super fast. And provided your writer isn't doing
something pathological, like sitting in a tight loop doing an update, even the loop overhead
is enough to give the reader a decent chance of sneaking in from time to time,
at least on a computer timescale. But it's absolutely possible. Yeah. And if you care about these things, you probably need to make your API not retry or have
a maximum number of retries and sort of fail for the reader and say like, hey, this didn't,
I tried a thousand times, it didn't happen. And then the reader could make a choice about it.
So it's an interesting thing here. So again, the writer never has to block. The writer is able to just
write unconditionally. So for example, if the writer is processing a stream, a torrent of
incoming UDP market data, as we do in our world, then it's imperative that you never block because
if you block for any length of time, you might miss some information on the network and then
you have to go through an expensive recovery process so it has a great property for the writer and for the reader it has a bit of a tax it's not free to do this check
but but if the ratio of times the reader and writer are both looking at the same thing at
the same time is effectively de minimis which hopefully it would be then essentially it's free
on both sides it's a cool it's a cool data structure it's used
my understanding is it's used in the linux kernel to protect some of the data around the system time
so the kernel has some configuration information about you know how fast the clock's running and
all that kind of good stuff um it's bigger than it can atomically switch out which is a
things that cs can do.
And it is mapped into every process
so that when you do get time of day,
you don't actually have to do a system call.
You can just read these numbers out
directly of a piece of memory
that's read only in your process.
Yeah.
But the kernel can update it
when NTP is fiddling with the times
or when there's overflows in some of the counters,
the hardware counters that you need to use
to actually work out what the current time is. But yeah, for the most part,
the kernel doesn't touch that information, but if it does, it needs to do it in a way that's safe
for all the processes to read from, and they don't mind doing a retry. So I may be butchering the
memory of exactly where that's used, but it's something, something like that. It's a, it's a
cool thing to do. So. Hmm. So this reminds me, there's a Sylvester Stallone movie.
I don't know if it's Judge Dredd.
Whoa, I was not expecting it to go this way.
Yeah, I think it's Judge Dredd.
It might be Demolition Man.
I don't remember which.
I've only seen one of those.
I think it's Judge Dredd.
But there's a robot walking around selling food and
it's recycled food and there's this like thing that's playing it's like recycled food is good
for the environment and okay for you and so this data structure to me sounds like this data structure
is good for the writers and okay for the readers that's basically it again though it does depend
on the duty cycle of the writers and readers and so even at the scale that's basically it again though it does depend on the duty cycle of the
writers and readers and so even at the scale that we work run out in finance um there are many
microseconds in between packet updates uh many single digit microseconds in time packet updates
which is more than enough for even the most slow reader to get in and get a chance but even if they're delayed for a microsecond or so yeah um so it's it's usually a win-win situation and so it's fun
it's fascinating it's interesting but the real trick comes from the fact that um everything that
both the compiler and the cpu want to do on your behalf to optimize your code is at loggerheads with this very strict ordering of
events. So as I described to you, the writer absolutely has to bump the sequence number first.
Then it needs to change all of the data that it's doing. And it's got a certain amount of time to
rummage around changing the data. Again, it's blocking the reader all the time it's doing this.
Yeah. And the sequence number is odd while this is happening, which is a signal to the reader that it shouldn't be reading.
That's correct, yes.
And this only allows for a single writer.
There are extensions that relatively trivially let you have multiple writers
for what it's worth.
And then at the end, once you finish rummaging around
and modifying and mutating your data, you need to bump it again
to both make it even again to signify that it's okay to start reading it
and to be one higher
than or two higher than when you started so that anybody who got a copy in between you modifying
it also notices and knows to retry right but those absolutely have to happen in that order
and if anyone's ever you know for folks who are used to like regular programming languages like
you know python the like,
it's like you write a sequence of statements.
They absolutely do happen in that order.
There's not anything particularly exciting.
If I said A equals 1, B equals 2, A equals 3,
at no point could you pause the CPU and look and observe that A was three and b was two or whatever there's not like sorry that's a really
bad example and it needs a picture to be for me to even get it right in my head but yeah there's
nothing funky going on there it just does things one after another but um for an optimizing compiler
it's very convenient to be able to play fast and loose with all of the operations that you're doing
right if you can pull forward something that's of the operations that you're doing. If you can pull
forward something that's expensive that you know you're going to do later on, then you can start
like a big divider or multiplier or whatever while you're doing some other setup work. And then when
you go to look for the result of the divide, it's ready for you. Hooray. Compilers want to do this
all the time. So they want to be able to move the instructions around. And you need to be able to
tell the compiler, no, you can't move these
otherwise unnecessary looking memory operations. Right, right, right. Is there anything that you
can use to indicate that you're writing into memory that is shared, potentially shared with
another process so that the compiler knows it's like, I'm doing this for the side effects on
purpose in a way? There is, I mean, there there are so specifically in terms of c++ which is where i'm strongest in
this regard um a way to model this is to use the volatile keyword that is not a good way but it is
a way so volatile is on the path to being deprecated because nobody can really describe what it's for or how but what
it essentially says is for don't make any assumptions about this memory location and in
when you're writing a device driver if you know you're writing some very low level stuff then
having a pointer that is a volatile pointer to something that's like maps to i don't know a
temperature sensor so that when you read from it, you're reading the temperature.
It's not real memory.
It's just a thing that that's a great candidate
for something to be volatile.
It says like, hey, this can change outside.
Every time I say to read or write to this,
you can't optimize about it.
There's nothing you can do.
Just do it.
So that is a way to do it.
But most compilers will see the word keyword volatile
and then start slowly backing away from your code.
They homer into the bushes.
They homer into the bushes.
Well, certainly the optimizer does.
The optimizer says, oh, there's a volatile here.
So as not to anger the gods of volatility, we're going to not do anything here right so in your very critical code you might uh arguments obviously
exist for like don't use volatile for this for other reasons which we'll get to in a second
but um specifically for shared memory where another cp another process outside of the program
you're compiling could be changing there's an argument that says volatile is the right way to
model so the short answer is for shared memory like this there isn't really a good way for c++ to say hey this bit of ram outside of this program's remit will change
under your feet so the best we can do is we model it with atomics and atomics have a whole bunch of
things that come with them right one of the things is atomicity at the hardware level like if i read
and write to an
atomic variable then it either happened or it didn't happen you'll never see half of it written
you know like if you've got an eight byte sequence number or four byte sequence number
there's no world in which i write that and somehow only the first two bytes of that four
byte value have been written that's what yeah yeah at the basic level an atomic operation is and
the the compiler and
will will collaborate and generate the correct cpu sequences or it will make it an error that you
this can't happen like you hey you've got 128 byte structure that we can't do this yeah you
can't guarantee that so now that that being said actually what it will do is it will actually
use a spin lock to do bigger uh bigger things so there's there's a way of saying please
you can assert statically at compile time that like hey can i do this without some kind of other
lock being taken out because obviously the whole point of this is not not to have a lock so it'd
be really tragic to do it uh to have something but all modern cpus will let you write four bytes
and probably eight bytes and with some some like very big caveat and footnotes 16 bytes using some tricks
um but you can do those atomically provided they're aligned provided they they're inside a
cache line and they don't straddle two cache lines all these kind of things but like we can assume
that we can do that but it's only a small amount that we can do that but um another thing that with
the c++ memory model is that the atomics also model the fact that this could be being written or read
from other threads now obviously if we talk about multiple processes that's kind of a stretch to
call it a thread um but it's the kind of the best we've got right now if you wanted to share between
processes it's only the thing that i've i with here. Now, atomics are funny because
reading or writing to an atomic doesn't necessarily... Sorry, reading or writing
atomically to memory doesn't necessarily follow that the compiler isn't going to reorder things.
It just means that you're literally writing to this memory location. So back in my, atomically, in the example of the A equals one,
B equals two, C equals,
sorry, A equals three thing,
you could make those all atomic.
And I'm saying this is not C++ atomic.
I mean, atomic in terms of like the CPU
can read and write these memory locations
and it either happens or it doesn't happen.
And without telling the compiler,
oh, but you couldn't like elide the fact
that A equals one ever happened.
You can't throw the A equals away
because you're about to overwrite it
with a three later on.
That's a separable thing.
But in C++, they're sort of bound together.
And so when you say something's atomic,
you're also saying something about the operations
that can happen with the memory system,
where the memory system encompasses
both the compiler's ability to
move things around and more critically the cpu's ability which we'll talk about in a sec so in the
example of um setting a to one setting b to two and then setting a to three if they were all atomic
the compiler wouldn't throw away the a equals one at the top. Right. Because it's like, no, this had some semantic meaning
to something I can't see.
Right. Yeah.
At least by default, it won't do that.
The default in C++ is to use an incredibly expensive,
sequentially consistent view for atomic operations,
which means that the atomic operations and the code around them
happens in the one true order,
even if you have multiple
threads doing it which is expensive potentially on some hardware that makes that an expensive
operation to to serialize you know you could potentially imagine that like um in order to
make sure that um everybody sees a equals one before anyone sees b equals two you have to kind
of like add in instructions or like some other um um sort
of x mac in a way of saying all threads need to have make sure that they are they're good the
caches are all consistent across all the threads and now now i can write to b and then we go again
hey has everyone got the b equals two if you care about it and so on that can be a very expensive
operation and so there are these different ways of acting
with atomics where you can give sort of a secondary semantic meaning that's less strict
than this sequentially consistent view. Like for example, you can say this is an acquire.
So when you read, when you store a number in, you could say this is an acquire operation. I am
acquiring a lock is the way I like to think about it.
And then when you finish with it, you release it.
And again, it's not like this has any meaning.
We're not actually acquiring any locks or anything like this, but it has the semantics of a release operation and an acquire operation.
And what do I mean by that?
Well, if you're taking out a lock, it's a pretty big hint that anything after that
is under the purview of that lock.
Right. So if you've got 10 lines of code and line five takes out a lock line six seven eight and nine can't be shuffled above
the lock because otherwise you're doing some things that you were trying to protect outside
of the locked area yeah and then similarly the release says nothing above the release can come
south of the release operation itself because again you want to protect things that are bounded by an acquire and and a release sort of pair yeah there are some other aspects
that is to do with different cpus and different threads and things that are much more subtle in
this in this respect but um that is sort of the gist of it and then you can also have like a
relaxed operation which says i need it to be an atomic operation but i don't really make any
guarantees about who which CPUs might see
which values as long as they're still atomic. That's more like the traditional original atomic
that we were talking about, where it's just make sure this happens atomically at the hardware level,
but imposes no other restrictions on the order of things. So it's easy to think about it from
where I'm sitting in terms of acquire and release. And so you could imagine that in our case of our writer here,
we write the incremented version, the value, with an acquire.
We do all of our rummaging around, and then we write the secondary
plus the second increment with a release.
And it kind of tells me, my processor, that everything I did
between adding one and adding two to the original number
the first increment and the second increment is for all intents and purposes a lock yeah
and that makes a lot of sense right yeah unfortunately c++ has some awkward very
specific to c++ things that make that not true but that's fine but logically in a that that's
that seems to make sense and on the, it's a little bit more difficult.
You're definitely reading to acquire both times.
You're acquiring a lock, lock, inverted commas.
Then you're reading out stuff.
And then you're reading it again.
But again, it's a little bit more subtle than that because the second thing is also a release in a way.
Because if you did get a good copy, you need to make sure that nothing of the copy leaks the other side of the the acquire it's easy with a picture but it's subtle it's subtle and difficult to get right
which is why it took two solid days and lots of testing and that that brings us to the next bit
here so it's very straightforward it's somewhat straightforward to explain it without a picture and without a whiteboard around
in terms of this sort of hand-waving things that compilers allow to reorder.
It's much, much harder to prove that you got it right.
Yeah, right.
The output of the compiler may or may not encode this optimization a bit this this like optimization barrier that you've put in you all
you've said to the compiler is like hey by saying by putting this acquire in here i'm saying that no
no read um that happens after this acquire can be reordered by the compiler north of this this read
or whatever the acquire operation but i can't all i can i can get sort of circumstantial
evidence by looking at the disassembly of it and kind of go it doesn't seem to have done anything
but i can't tell that in general it is not able to do that because it depends which program i put
it in if i put it in a big program with lots of inlining happening happens and all that kind of
stuff i don't know if i have hinted to the compiler correctly that this is in fact the right thing to do.
So that was a real challenge.
And we talked a little bit about testing these kinds of things before.
Yeah.
Testing multi-threaded code in general is not easy.
No, exactly.
Yeah.
It's very difficult.
And all you can do is kind of develop a certain amount of confidence that you haven't got something egregiously wrong.
Right. I mean, I think it's the same thing with looking at your compiler output, right?
The best that you can do in that case is see that you were wrong. But if you fail to see that you
were wrong, you don't know that you're right.
That's exactly it. Yeah. So you tend to do, and I don't know if this is true of the sort of decompiled analysis,
but you can do a lot of things where you give yourselves lots of opportunities to see that
you're wrong.
Run the code on different architectures, run it with different sets of data, run it multiple
times over
and over again yeah different compilers different compiler settings i'm sure yeah yeah and you
sort of like look for all of the possible opportunities to prove yourself wrong and
then you fail to prove yourself wrong and then you just kind of give up and admit that you might
be right that is exactly how yeah that is exactly the sort of the approach that that i've been taking which is why it's taken quite so long, apart from just reasoning about it in the first place.
And interesting you mentioned about architectures there, because one of the gifts to programmers like myself of the Intel domination in server architecture, at least, uh a very strong memory ordering so one of the really
cool things i'm sure we've talked about it before and certainly anyone who's seen any of the nonsense
that i've put on the internet knows that i love the kind of amazing tricks that the the cpu is
pulling off under the under the hood and one of the one of its sort of real trump cards that has
caused problems along the way is its ability to reorder and speculate instructions out of the sequence that you gave it.
Very much like the compiler itself does, the CPU is able to find flows and sequences of instructions and reorder them to take better advantage of the fact, hey, I have a spare multiplier going now.
But look, I can find a multiplier that's like 300 instructions in the future. And I know what
inputs are going to go into it. So I might as well start it now. All those kinds of tricks.
It's amazing. But when it comes to loads and stores to memory, it doesn't reorder those.
Or if it does, it does them in a way where it can track if it got it wrong and it can redo them
later on, is it takes a
bonkers amount of silicon to do but it does mean that in general if i go load load store on one
thread and on another thread i go uh no yeah sometimes i wouldn't it's so difficult to explain
this without without a picture but the sequence of operations will make sense they're effectively
for most intents and purposes there
are exceptions this sort of effectively sequential um even even in the presence of of these kinds of
um uh atomic like operations where you've got thread a doing one thing and thread b doing
another thing it's very hard to catch it out doing the wrong thing so for the most part
even the naive code that's got it wrong in air air quotes, won't be wrong on an x86
because the compiler, short of it doing its own optimizations,
the code that it generates will run in the boring, dull order
of loads and stores and increments and things that you gave it.
That is not true in general of every other CPU on the planet because Intel
love to be backwards compatible.
And presumably the very first time they went multi-threaded,
they went,
Oh,
this seems like a useful property to have unaware that it was going to
hamstring them for like the next 25 years.
But arm CPUs,
uh,
risk five CPUs,
uh,
MIPS and anything else,
nothing else that I'm aware of has such a strong memory ordering.
That is to say, if you do a bunch of stores and loads on one CPU,
they could be reordered if you were able to observe them on another CPU.
Yeah, yeah, yeah.
Right.
Obviously, they're not willy-nilly reordered
to make your program wrong on a single-threaded case, right?
That's not what I'm saying here. I'm that like you know hey if you do load load store and
the cpu um uh it won't really know whether there's loads and stores from where you're sitting to give
you the wrong answer but you might see something weird and wacky on another cpu they i think the
the um those are sort of cases if you uh do sort of like uh remember the value of a on one thread
and then write b to be one on one.
And then on the other one, you go write B to be, sorry, write A to be one. And then remember the
value of A. You can see a case where they both observed A and B to be zero. And then they both
wrote one to the other. And there shouldn't be a way that they could do that, right? Again, paper,
this is terrible podcast material. I'm sorry. So anyway, on those architectures, the atomic operations and the sort of like this acquire and release or acquire release type semantic thing that you give to it is not just a hint to the compiler to say, your compiler, you're not allowed to move these things around.
It tells the compiler to put instructions or flags on the instructions that say, hey, CPU, you're not allowed to do reordering.
And so one of the other things that we were doing while we were working on this was we were deliberately,
although we don't run on any architecture other than x86,
we were using the presence
and the particular semantic interpretation of the presence
of these serializing barriers or fences
or loading instructions that have like particular properties
about reordering that can happen.
As a litmus paper, a litmus test for not only do we observe the code to be correct on x86,
but we observe the kinds of loads and stores to be the right kinds of loads and stores
on ARM that says, hey, the compiler knows that it absolutely has to have let this increment
happen before any of the following code completes. But again, it's not-
This is sort of like telling somebody something and then being like, okay,
now explain it back to me. I suppose it is.
Right. And to making sure that they understood what you said. It's like,
you're going to put these things in there and see if the compiler reorders things in a way that doesn't make sense? And you're like, well, okay, maybe we don't, we're not on
the same page here about what instructions need to be ordered. Well, that's absolutely it. Now,
obviously there's still no guarantee that that's right. And it's still no guarantee that while the
compiler has emitted those correct instructions, that it still wouldn't move things around above and below
those instructions because it feels it has that degree of freedom. I can't introspect into the
compiler and look at each instruction and say, given free reign, would you move this up further?
Right. Yeah.
But that would be an interesting thing to do. So it has been an interesting few days. I mean,
and the performance you get out of this is pretty astronomical. It's like three on X86, it's, you know, single clock cycles to read and write to something that is,
you know,
16 bytes long.
Now for something that's eight bytes long,
you can just atomically write the whole thing and be done with it.
And you don't need any kind of sequence lock,
but if you've got 16,
24,
32,
even up to like,
you know,
48 bytes,
single cache lines worth of work,
this is still essentially free,
free.
It's all, it's all relative in our our world so it's been a super fun thing um to the point where actually you know
someone was reviewing the code and saying you know well you say it's single threaded writer
can you detect the case where there are two writers and you know throw an exception or blow
up or you know something like that and it's like the answer would be yes i can but it would add something in
a region of twice the overhead because just even the check and the read and the whatever and the
out of band jump is is actually noticeable at this level it's pretty minimal don't get me wrong
it's pretty minimal but and also what what you're going to do at this point you know like it's a bit
late if you've got two writers but it's nice to have a debug so that's what we've done as we put
it in as a debug check and then but it's been a really fun and interesting journey and in doing so we've
discovered a whole bunch of interesting other techniques that you can use to do to have
something that have like different trade-offs between back pressure on the writer versus
readers having free reign um you know the the classic example of something that's a little bit
more fair for the readers is a reader's writer lock,
like an actual lock in this instance, where the lock is an atomic value and the bottom 16 bits or
the bottom 32 bits are how many readers are currently reading. And the top bit or however
many bits is like, is there a writer? And then the readers, in order to acquire a lock, they have to
do some work here. They acquire a lock and they basically try to order to acquire the a lock they have to do some work here they acquire
a lock and they basically uh try to atomically replace the value with one higher and if it fails
they'll get as part of the atomic switch operation that the cpus can do you get back what the actual
number was and you can see whether or not the writer was there and if the writer was there you
know all bets are off but if you were able to increment it okay then you're now you've let the writer know that there's at least one reader.
And so that's what all the readers, and you can have like 10 readers all reading from the same
thing and that's fine. They can all be reading from it. And then the writer has to do the other
thing where it says, I need to be able to replace a zero. Like literally there are no readers and
no writers with a top bit set value. And if I'm able to atomically do that operation, then I've
locked out all of the readers and i've
got the the lock and i can spend my time writing to yeah so just to just to clear clarify here
yeah you say there are you know you've got these two halves of this the upper bits are the number
of writers and the lower bits of the readers you're saying the number of things that are
currently reading or currently writing correct currently. Currently reading. Not like the logical number of readers and writers in the system. No, no, no, no. This is a current thing
which allows them. And so yes. Yeah, yeah. So at that point, what you're doing is you're actually
preventing the data race. So that's an aspect we glossed over earlier is that strictly by the book,
taking that copy of the data while it was unprotected by any lock is not allowed by the
C++ standard. It's an undefined behavior to read a value that's being updated.
Even though we don't look at it,
unless we know that we got a good copy,
that's not okay by the C++ standard.
And there's some papers out there that are trying to canonify it
and come up with ways to make it okay to do it.
But we're ignoring that for now.
But yeah, coming back to this thing,
in this particular instance, no,
we are actually preventing any kind of data race because the readers do take out the lock for the tiny amount of time that
they're taking their copy or working on it and then they reduce the lock so it's a number that
no it's not the number of readers it's the number of readers that are currently reading
yeah and locked it but they obviously you can have as many readers as you like
reading from it and that's potentially useful again if it's some kind of shared piece of data like you know this is the again the linux kernel kind of example would be
hey this is configuration information about which time zone everyone's in uh you know there are many
processes that are getting the time of day all the damn time we never really you know we want to
allow them to make progress most of the time they don't have to lock each other out right but there's
only one writer and it's the kernel and every now and then he needs to be able to go in and say hey
someone just did a cis cuttle and change the time zone or whatever it is whatever kind of shared
information of course the problem with that is that the readers if there are enough of them can
easily starve out the writer you can get a situation where the writer can never actually
oh right yes and so then it gets more sophisticated where you start using more bits to say hey there's
a pending writer and then the readers aren't bits to say, hey, there's a pending writer,
and then the readers aren't allowed to get a lock
if there is a pending writer, even though...
And so on and so forth.
So these things are complicated, but they're fun.
They're fun and interesting.
But at least that case doesn't suffer from the data race,
the sort of strict data race by the book version there.
But yeah, testing this has been essentially an exercise
in all the things we talked about,
looking at the disassembly, writing some simple-ish tests.
And I think in one of our earlier episodes,
you made a comment about these kinds of tests
have to be something like,
you have to have seen it work once.
Oh, yeah.
Yeah, yeah, yeah.
Right.
So I had made the argument in an earlier episode that, you know, write all the unit tests you want. And I do. But write all the unit tests you want. But if you've never actually seen this thing work talking about, you're passing on an opportunity to prove yourself wrong.
And sort of moreover, like, it's one thing to say, like, okay, I wrote this, you know,
block for EQ, and I pushed a bunch of data into it, and I pulled a bunch of data back
out.
And I've tried that a number of different ways with different data sets and different
architectures and all these things trying to prove that it doesn't work.
But if you never even tried the base case of like, did it ever work at all once?
Right.
And that's just, you're just being silly.
That's true.
You're being silly on purpose. like, have you seen it work once I think is, is more applicable for things that are more mundane, um, that are like, um, you know, pretty, pretty easy to get confidence that they work with
just unit tests. And the only question is, do you have enough unit tests? Right. It's like,
if you've never seen it work once, then, you know, you don't really, there's no reason to
believe that you have enough unit tests essentially. Right. Yeah.
But this is a whole other thing,
right?
This is like,
you can see it work a thousand times and still not be confident that it's going to work the thousandth and first.
Right.
That is,
I think the material difference.
Yeah.
And you got to get much more creative about how to create that confidence.
Because unit tests alone and a little bit of exploratory testing,
you know,
watching it work one time or two times or four or five times is not going to be enough.
Yeah.
No, it's really interesting.
So when I put out that we had four different engineers review the...
Yeah.
And this definitely passes the test of...
In general, I don't write comments in my code.
I think I've said this before.
I prefer to have explanatory names for things where it makes a lot of sense and so you know
rather than manifest i mean most people a lot of people feel this way you know you put
sensory named intermediate values or variable names and then it's like it explains itself
because essentially the last line is a piece of english prose right yeah it's great return
interest rate times whatever you know know, there you go.
They don't have to explain it.
But this code has somewhere in the region of 50 lines of non-comment,
and it's two or 300 lines of header, right?
It's like the ratio is obscenely the other way,
because there's a huge setup explaining how the general thing works.
Then above every single line, there's my reasoning why it's okay to have it in this particular sequence in order why there are these flags set why which
other lines of the code it kind of interlocks with in terms of if the other thread was doing
this thing on this other thread but um yeah so anyway it went to review four engineers looked
at it and all of them so far have said yes looks good to me um but one of them brought up a really
interesting point which was like you, there are formal verification systems
for multi-threaded programming that you can use
where you describe in a sort of DSL
the operations that you're doing
and then a sort of theorem prover runs
and shows that there isn't a window of opportunity
where things are wrong.
And so, for example, you know, if you're designing,
I don't know, the Paxos protocol
or the Raft protocol or whatever, then this is the kind of thing, if you can explain your
protocol algorithm to the system, it can say, yes, I can prove that there exists
no opportunity for this to give you the wrong answer. And that was a really interesting case,
interesting observation, but unfortunately they don't run on c++ they run on
the description of the system right right and that's where i fell over in terms of like i don't
know that this is going to help me here because i quote know that sequence locks work in theory
because they're in the linux kernel and i've got five different on you know there are five different
open source things that have very wildly different atomic operations in them.
And so I was like, no, let me try and reason this from first principle so that I can defend why we're doing it this way.
But I would have no confidence that any explanation of what I had done to a theorem prover was a good representation of what the C++ semantics were.
Right, right.
Yeah.
It's the difference between the algorithm and the implementation.
You can prove that the algorithm is correct using the theorem prover. Prove the implementation is correct is much more work. Yeah. And the other thing is, and we mentioned this at
the start of the podcast, we were like, well, assume no latency constraints. And this problem
gets a lot simpler. When you add in the You know, you were talking about the sort of ratio of code to comments.
Yeah, yeah, yeah.
It's like you have to it's like the square of the comments now because you have the dimension of thread safety and the dimension of latency.
And you're doing both of those things at the same time in the same code.
Right.
And that's going to make it much, much, much harder.
Yeah. Right. And that's going to make it much, much, much harder.
Yeah.
Because,
you know,
the, the cycle time,
at least it's like,
okay,
we figured out some way to improve the performance by another 30%.
And it's like,
okay.
And how do you know that this hasn't completely broken your threading
model?
Well,
we don't.
So we get to do all of that.
That's the thing.
I think,
you know,
like one loses the,
the warm, fuzzy feeling
that like you can touch this code in any way
and not go through some like non-automated handoff.
You know, obviously I will check in the test
and the smoke test that runs for five seconds
and make sure that like nothing overtly stupid happens
will remain and I'll leave that in.
But I can't, or I don't want a conscient i can't think about
um having like oh yeah this is it we run it through you know compiler explorer effectively
i've actually been using the site to do something for a change uh through you know here's six
different architectures and then it should exist that there should be no ldr of this variable
before the lda are that's the serializing load, blah, blah, blah, you know, that kind of stuff.
That one could go as far as that,
but I don't know that that wouldn't be so brittle
that it would, yeah, I don't know, actually.
Maybe that is the way to do it.
Well, the golden rule of all of these things,
and especially if you have something
that is this like thread safety and latency sensitive,
is design it in a way where you won't have to change it.
Right?
Like you're going to do all this upfront costs
and all this really hard work to get this code
just exactly the way that you want.
Make sure that you encapsulate it really well
so that the chances of you having to change it
are as minimized as you can make it
so that you can just kind of leave it there.
And then, you know, go in to get five years later
and be like, yep, this code hasn't changed in five years. It's been working for the five
years. That's as confident that I'm ever going to get in any piece of code like this, that it
actually works. So please don't change it. Right. That is a very, very good observation. In fact,
one of the, one of the review comments was I had had an interface that was slightly more
open-ended and you could do some cool things with it because I could, I gave up hey yeah you know rather than just replace the value i was like well the reader can the sort of
the writing thread can at any time read it it knows no one else is changing because it is the
only writer so like you could actually update it anytime you like by looking at the current value
and like maybe incrementing it rather than replacing it and then one of the reviewers was
like yeah i don't think i'm a user of this i'll be a user of this and I will never want to do
that. And so I took out the interface and
just like, okay, you've got get and you've got set.
And maybe,
just maybe I will have get, but I promise
Scouts Honor that I am the
writing thread and as
such, I'm allowed to get it without any kind of
lock promise, promise, promise function
that will allow
the other access type but
yeah it's um it's been an interesting few days and uh yeah i figured it would be a fun thing
to talk about and i said to you this is probably going to be a short one i've just looked at the
timer and if we haven't lost absolutely everybody with my attempt to try and describe
multi-threading memory ordering problems then i don't know what we'll get rid of, listeners.
We're trying our best here.
Yeah.
No, no, this has been fun.
Thank you for letting me talk about my tech project.
Oh, no, this is super cool.
I love this stuff.
I'm a big fan.
Well, maybe next time we'll talk about
something a little bit more human.
I don't know, we can talk about more of this.
I'm all right.
I'm done with that.
That's cool.
Actually.
Yeah,
we have,
we have done a lot of humaning recently.
Yeah.
Right.
Probably do a bit more tech content,
although,
you know,
to the extent that folks have contacted us and we encourage you,
you know,
you can contact us at the,
the,
the mastodon hackydom.io or whatever it is.
And,
or email,
you can find our emails.
We always love to hear from
our listener and you know folks were very you know positive about some of our more
human focused and less tech focused stuff but you know we can get back to tech as well
you know i i take the philosophy that um the human elements of software are inseparable from the
technological elements yeah they are sort of one in the same thing.
It's hard to see that, but it really is true.
And if you want a great example of this,
take a code base that was written by other human beings
and give it to a different set of human beings
and then watch them.
Oh my gosh.
And if that doesn't convince you
that technology and people are intimately intertwined,
I don't think anything will.
We really do need to stop,
but Kate Gregory has an amazing keynote speech
where she talks about the empathetic programmer
or something like this.
And it's like all about that.
It's like, you know, you should have empathy
because you know who's going to be reading this code.
It's going to be your friends, your colleagues,
and also probably yourself.
Yes.
Future you who has slept since then
and has no idea what
any of these things do exactly all right we should stop now but this has been fun thank you very much
and i guess we'll see you next time yep until next time
you've been listening to two's compliment a programming podcast by ben rady and matt godbobs find the show transcripts and notes at www.twoscomplement.org
contact us on mastodon we are at twos compliment at hackyderm.io