CppCast - Meltdown and Spectre
Episode Date: January 11, 2018Rob and Jason are joined by Matt Godbolt to talk about the Meltdown and Spectre vulnerabilities and how they affect C++ Programmers. Matt is a developer at trading firm DRW. Before that he's ...worked at Google, run a C++ tools company, and spent over a decade in the games industry making PC and console games. He is fascinated by performance and created Compiler Explorer, to help understand how C++ code ends up looking to the processor. When not performance tuning C++ code he enjoys writing emulators for 8-bit computers in Javascript. News More C++ Idioms C++ Tips of the Week (Abseil) Retpoline: a software construct for preventing branch-target-injection GCC 8.0 supports std::filesystem now Matt Godbolt @mattgodbolt Matt Godbolt's blog Links Compiler Explorer CppCon 2017: Matt Godbolt "What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid" GOTO 2016: Matt Godbolt "Emulating a 6502 system in Javascript" GOTO 2014: Matt Godbolt "x86 Internals for Fun & Profit" Patreon: Matt Godbolt is creating Compiler Explorer Finding a CPU Design Bug in the Xbox 360 Meltdown and Spectre Vulnerability Note VU#584653 Sponsors Backtrace Embo++ Hosts @robwirving @lefticus
Transcript
Discussion (0)
Episode 133 of CppCast with guest Matt Godbolt, recorded January 9th, 2018.
This episode of CppCast is sponsored by Backtrace, the turnkey debugging platform that helps you spend less time debugging and more time building.
Get to the root cause quickly with detailed information at your fingertips.
Start your free trial at backtrace.io slash cppcast.
CppCast is also sponsored by Embo++.
The upcoming conference will be held in Bokom, Germany
from March 9th to 11th.
Meet other embedded systems developers
working on microcontrollers, alternative kernels,
and highly customizable zero-cost library designs.
Get your ticket today at embo.io. In this episode, we talk about C++ tips and file system support in GCC.
Then we talk to Matt Gobbold, creator of Compiler Explorer.
Matt talks to us about the Meltdown and Spectre attacks. Welcome to episode 133 of CppCast, the only podcast for C++ developers by C++ developers.
I am your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
Doing okay, Rob. How are you doing?
Doing okay. Starting to settle more into the new year we
had that crazy uh blizzard scenario on the east coast last week it didn't hit north carolina too
much we got like an inch of snow but of course in north carolina and just that was pretty paralyzing
for days so we had uh north florida tallahassee had snow snow for the first time in 27 years or something ridiculous like that.
Yeah, pretty nuts.
Yeah.
And meanwhile, I have a balmy 60-degree day today coming up in Denver.
That's unexpected.
Yeah.
I guess we took all the cold weather for you.
Okay, well, at the top of our episode, I'd like to read a piece of feedback.
This week, we got an email from Dimitri.
And Dimitri writes, thank you for the podcast.
Last time, you mentioned patterns as a potential topic.
And Nicole mentioned some idioms when you asked her about patterns.
As a coincidence, I was looking at More C++ Idioms, which is an online book on wikibooks.org,
and found it a very nice initiative to describe idioms as a fully available book,
especially taking into account that nowadays knowledge about C++ is often transferred by naming some of them.
So it might be worth mentioning. And I wasn't really familiar with Wikibooks, but it does look like this is a pretty good free resource available,
and we'll put that in the show notes.
Interesting.
Yeah, I'm familiar with Wikibooks.
I was not familiar with this particular one.
The list of idioms is pretty long.
They have a total of 91 C++ idioms,
and a lot of these don't immediately look familiar to me.
So, yeah, there's a lot there.
Yeah.
Some of them do have to-do listed to them, though.
So I guess this book is still somewhat a work in progress.
Yeah, and I think this kind of thing accepts user contributions.
Right, right.
Well, we'd love to hear your thoughts about the show as well.
You can always reach out to us on Facebook, Twitter, or email us at feedback at cpcast.com.
And don't forget to leave us a review
on iTunes. Joining us again
today is Matt Godbolt.
Matt is a developer at trading firm
DRW. Before that, he's worked at Google,
run a C++ tools company, and spent
over a decade in the games industry making
PC and console games. He's
fascinated by performance and created Compiler
Explorer to help understand how C++
code ends up looking to the show.
Hi, thanks for having me back again.
So have you done any other 8-bit emulator in JavaScript other than JSBeep?
Yes, my first emulator actually was a master system so the sega master system i don't
know if you guys had the 8-bit version you know like sort of around the same time as the nes
yes that was my first console and i i wrote the emulator i actually originally wrote an emulator
in arm assembly for the archimedes which was a popular um home computer back in the 90s um and uh i wrote it so that i could
re-complete my favorite game ever which is wonder boy 3 um so i completed it under emulation again
having completed it on the original and then i was later able to re-complete it for the third time
in my web browser using my own emulator that i'd reported to JavaScript. So yeah, I had a lot more spare time back then.
You know, to be fair, I realized as you were saying that,
that I know the master system from when I lived in Europe,
but I don't know if it was popular or known in the US at all,
but Rob's too young to know.
I've never heard of it.
It was a disappointment to me.
I remember I saw like a-op of Altered Beast,
and I'm mistaken at looking at the box art as it was back then.
You look at the box and you're like,
wow, this looks just like the coin-op.
And I got it, and it was not just like the coin-op at all.
It was pretty terrible, as you can imagine.
But, you know, those were the times.
I know we definitely got a version of the Master System in the form of the Game Gear.
Right. Yes, of course. Yeah. I think I had a few extra bells and whistles, but essentially it was
a Master System with, I think, an FM chip for the sound, which I think the Japanese variant of the
Sega Master System had too as well. So, yeah so yeah this is a deep dark hole we can go into
for many hours if we're not careful well the main reason i brought it up is is i'm actually curious
if there's any version of your jsb talk on youtube uh there are i think two different versions of it
okay um because i yeah i gave one as a um uh uh extra content talk at CppCon this year,
which was like an updated version of it.
Which is the one I saw, yeah.
That's right, yeah.
Yeah, the two older versions, if you just search for JSBeep, I think,
you'll find it on YouTube.
There's one I did at my company,
and there's one I did for the GoTo conference.
Again, sort of just explaining roughly how it works
and all of the weird and wonderful things
you discover along the way
and just how much power you can get out of 3,000 transistors
that power the whole thing.
It's amazing.
Absolutely amazing.
Well, I do recommend,
if anyone hasn't seen that talk yet,
to go watch it.
It was the one that you gave at CppCon
was a lot of fun.
I'm like, I'm thinking,
this should have been recorded.
Glad to hear there's other versions of it online so matt we have a couple news articles to discuss uh feel free to comment on any of these and then we'll start talking to you more about uh
various things okay sounds great okay so this first one is uh we talked about the c++ tips
of the week
that have been going around in Google for years
when we had Titus Winters on the show right after CppCon 2017.
Yes.
And I'm not sure, is this the first official release of them going online
or is this just an updated list of the new ones?
Did you see that, Jason?
I believe this was a and we have released more announcement
that came up recently
okay so yeah these are the civil sauce tips of the week and I'm guessing uh you're pretty
familiar with these masks since you used to be a googler right yes um some of these are familiar
I think some of them are after my time there um I noticed that episode one was string view which
I do remember from I think the google equivalent back in those days was called string piece
um and it was used pretty throughout the code base as a convenience thing and it's one of the
first things that i redeveloped when i uh left the company and needed to solve the same problem
myself so um yeah they're they're great and the thing about these these these tip of the weeks
is that they are short and to the point and they are often like the kind of thing that you were
like oh yeah i remember that vaguely but i'd never really looked into it and they're great to print out and like stick around the place
so um i think it's modeled vaguely on uh testing on the toilet which was a an idea they had to try
and promote all of the ideas of testing your code where um in the inside of the stalls they would
print out and stick weekly and replace like here here's a tip for testing your code.
And I think tip of the week is the similar kind of thing.
You know, you could print it out.
If you have a C++ shop, you know, replace them once a week in the cubicles.
And then you've got something to look at when you're taking a moment to yourself.
Seems almost a little intrusive somehow.
I'm not sure what's the right word I want there.
Well, it's one of those more interesting things
because, of course, the bathrooms are one of those places
where even visitors would have to visit at some point.
So, you know, there was a limit to what they could put in these sheets
because potential IP violations of other people going and taking a leak
and then discovering the secret source of google that's crazy anyway already we brought the show
down this is okay uh this next one we have is also coming from google and this is retpoline
which is a software construct for
preventing branch target injection and i guess we're going to be talking more about this topic
with you matt because i think you've dove in kind of deep into uh this specter attack that came out
over the past few weeks um but this is a google project for mitigating branch target injection
right so yeah as you say we'll probably chat about this a little later. It's a pretty
complicated thing, but ultimately it's a way of making the compiler emit code that is less
susceptible to the kinds of attacks that we've been seeing in Meltdown and Spectre. So if you're
writing a piece of code that has privileged access to memory, maybe it's a kernel maybe it's a jit that
protects uh does some kind of sandboxing of the rest of the process you know like your browser
is doing it's preventing the javascript that you're executing from being able to see like
the passwords that are living in the same address space as your javascript then perhaps it's worth
compiling with this um this this flag this mLE, which I believe has landed in Clang already,
to mitigate some of the attacks.
And we'll, as I say, try to explain a bit more later,
which will be hard without a whiteboard,
but we'll see what we can do.
I'm sure we'll come up with something.
I do want to point out that the first example here for there,
a common C++ indirect branch in this example.
No compiler would actually compile common c++ indirect branch in this example no compiler would actually
compile this as an indirect branch i'm just saying because every compiler can totally trace this code
and would inline the indirect uh virtual function call but you know that's yeah yeah
okay and then the last thing we have is gCC 8 now officially supports std file system.
And the commit just came out a couple days ago.
So that's pretty exciting that they're moving so quickly on C++ 17 support.
Yeah, this was mind-blowing to me because I would have sworn that GCC already had file system support in there.
It did, but only in the experimental branch.
And so you have
to go to std experimental and there's a whole bunch of things i mean we've used it in a couple
of new pieces of code that i've done because i'm just you know dipping my toe um and certainly
whatever is on cpp reference does not match what was in std experimental file system to whatever
that i don't know who was right who's wrong in that but hopefully std std file system is uh
more stable now.
Yeah, the main difference that I've seen come up recently,
I think on a Reddit discussion,
was how the path joining works.
If you're joining an absolute path
to the tail end of another path, what happens?
Apparently, I guess the final decision was that
if the right-hand thing is an absolute path,
then it throws away the front thing,
because if it's an absolute path,
then you want it to be an absolute path.
But anyhow, apparently that's one of the things
that changed between experimental and the final release.
Okay. I wonder if the stuff that was in GCC differs,
because the thing that I recall being different
is actually the handling of of is this a file
that is actually at the end of it or get the attributes
of this file and those kinds of things
is this a file?
the path manipulation aspect of it is you know like
almost like text processing
and then there's the is this a
directory is this a file what attributes does
it have is it executable kind of level
of thing that seem to be
not as Cpp reference described so
i'll be interested to know who was right and what we're going to get with the the file system that's
actually in the standard right yeah and it's funny you mentioned is this a file because any file
system operation is inherently is this a file well it was at the exact moment that you asked me to check. Now it's a directory.
There's no way of knowing.
Okay, so let's start talking a little bit more about Spectre and Meltdown,
which we briefly mentioned there.
I know I didn't read a lot of tech news over the holiday break,
and then when I started paying attention to Twitter again, I just saw people were angry at Intel.
So I kind of read a little bit after the fact what was going on.
But could you give us the breakdown of what Spectre and Meltdown are, Matt?
Yeah, sure, sure.
So there's these two papers that have come out around about the same time
that were under embargo, I think, for at least six months.
People have tracked it back even further than that,
that the various parties who were involved have known about it.
And then Google chose to release information on both Spectre and Meltdown
and all of their conversation around it.
But they basically use some of the features that have been added to processes
in the last 20 years to make them go faster
caching out of order execution speculation and branch prediction all those things together to
get around various security checks inside the processor so let's first of all talk about
meltdown because meltdown is potentially the scarier of the two. It's potentially the easier for inverted commas to work around and fix with a software patch.
And indeed, if you are running anything on AWS or Google Compute, I believe all of those instances have already been patched.
And there are performance aspects to the patch, which we can talk about in a second.
So yeah, let's talk about in a second but um so yeah let's talk about meltdown so sure uh meltdown is an attack
where if you try and read from a user unprivileged mode um process a piece of kernel memory obviously
you're going to get a page fault normally you'll get seg fault right if you happen to know the
address in which some protected piece of memory in the kernel is at, and you try and read from it in user space,
then you get a segmentation fault,
and then your process gets killed, right?
Makes sense.
Now, you might ask why you could even read it at all,
why you would even know that it has an address in your process space,
and that's because normally when you want to switch
in and out of the kernel mode,
like you're opening a file or accessing network or whatever like that,
you need to go into kernel mode, and the kernel wants to be able to read both your
memory and its own memory and so they're actually mapped into your process at all times it's just
that some of it is marked as you can't read this from user mode and you can't write to this from
user mode as opposed to it being not there at all okay so it's a convenient speed up for the kernel
makes going in and out of kernel mode calls much much
quicker and everyone's happy right and of course the processor guarantees that the user mode can't
read or write to these mapped but inaccessible pages of kernel memory so no problem excellent
except that inside a processor every instruction runs effectively um in a sequence uh determined
by the interdependence of instructions right so the processor can issue a whole bunch of things
at once if you can prove that they are either not dependent on each other or that um some
instructions uh don't yeah depend on other things which are still which have completed all right let me try and
think of the about this a better way so a load instruction is going to read from memory and
it's going to make the results available to the next instruction that depends upon it
if the check inside the processor for is this a valid piece of memory that I'm allowed to look at happens sort of asynchronously,
then obviously there's a window of opportunity between me reading the data and the fault going
off and saying, whoa, you shouldn't have been able to read that piece of data. Now, if something's in
the level one cache, the Intel wants to make it as fast as possible for you to access it so it doesn't want to you to wait for all of the access checks to clear before it reads the memory out of level
one cache okay so far so maybe one cycle and you've got the memory of l1 cache and in some
background process on this on the cpu is going to take two or three cycles to determine oh wait
i've just checked the page table you're in user mode that's kernel mode memory you shouldn't have
been able to access that and so it will mark the instruction at that point as being,
whoa, when that instruction completes at the end of the pipeline,
please fault, cause the fault, cause an interrupt to happen there,
throw away everything that happens after that instruction.
It's as if nothing happened.
Now, this happens all the time inside the processor.
So when your branch predictor gets something wrong,
it has to basically undo everything up to the branch and redo from there so the same mechanism is being used at this point that is the instruction that is read some data it shouldn't have read
is going to be undone anyway so there's no harm no foul now that's cool right we we've got a process
by which um the processor can take a shortcut a bit later on it it can work out that something should not have been allowed to happen,
and it can make it all go away as if it never happened.
Except if the instruction that was issued after that read of kernel memory influences the cache in some way.
The processor cache itself is not state that is rolled back and that's a key to
all of these attacks all of these attacks rely on the fact that there is a side effect that is not
undone when this speculative execution or this trap instruction happens and so if you can come
up with a way of doing something to the cache with that speculatively executed instruction that you can
measure afterwards you can see that ghostly remains of what had happened with the data before
so your your attack looks something like this clear all the caches call an operating system
routine to ensure that the memory that that you would want to attack is now back in cache so the
only thing that's in cache now is the level one kernel mode that you wanted to so is the kernel memory that was protected that
you want to read and you can do that by calling a system call or something like that and making
sure the protected you know like the the kernel has executed it's now it's in l1 cache excuse me
and then you could do something like attempt to read that memory yourself in user mode
and then touch a piece of memory that is a cache line multiple of the byte that you read from the
protected memory. So you have an array of say 65,000 bytes that's not in cache, you know it's
not in cache, and you have now gone array plus plus where array is indexed by the value that you read back from
kernel mode times say 64 which is the size of a cache line so you know that instruction will never
actually execute because the process will the processor will throw it all away but if if you've
done it quick enough if it's close enough after the previous instruction that read the protected byte, then it will have brought into the cache
a piece of memory
whose address is
dependent on the value you read from
protected kernel mode. So let's say
you read the byte from
kernel memory was 10.
You're going to read address 64 times 10
in an array somewhere.
Doesn't matter what you do with it.
That gets faulted. you catch the segfault
in your program that's probing it, so now you've undone the fact that the process has affected,
it would have crashed, and now you go and you look at the cache, and you say, okay, which of these
cache lines is now in the cache, and you measure how fast it is to read the zeroth one, the first
one, the second one, the third one, the fourth one, the fifth one, and when you get to the tenth one,
suddenly that's really fast, you're like, oh, that's interesting.
That must have been in the cache. Whatever's in the 10th element of my array is in cache,
and therefore the value that I read from kernel mode must have been 10.
Okay.
Does that make enough sense? I'm sort of, it's again, with a picture, it's a little bit easier.
But the general process relies on the fact that
if you can influence the cache in a way that depends upon a value read speculatively from
the kernel then you can after the fact after it's all been undone you can go and look at the cache
and from the state afterwards you're like reading the tea leaves in the can in the cup after the
effect and go ah the only way it could look like this is if i had read the value 9 or 10 or 11 or whatever and then you can just keep doing this over and over again and keep
probing and read basically every byte of kernel memory that's really really bad because of course
i'm an unprivileged mode attack i'm an unprivileged mode process i can now just stream through the
entirety of the kernel address space which means I can read everything the kernel can see,
which is every other process.
On Linux, it also means that the kernel maps in all of physical memory as well,
so you can just basically read the entirety of physical RAM,
which means any keys that are in the kernel you can read,
any other process you can read.
It's really, really, really bad,
and it's essentially undetectable because you're just doing something
which is going to get cancelled over and over and over again and then like reading the the side
effect so that's bad how can it be worked around well there's a patch called kaiser to the linux
kernel which prevents and stops the kernel from mapping its own pages into user space. And that pretty much fixes the problem.
Pretty much.
Pretty much.
There are some pages that absolutely have to be in the user space,
that is like the interrupt tables and things like that,
things that the CPU requires to see
that are part of the kernel's view of the world.
But it's a fairly minimal set,
and there's nothing too scary in there.
The only thing you can do with those, I believe,
is use them to determine where the kernel is living in memory.
And even that can be worked around.
And obviously there are mitigation for other attacks
called address space randomization.
And so the kernel tries to put itself in a random spot every time
to stop people from making attacks
based on knowing where things live in memory.
The problem with this this the drawback
is that um it's it slows down system calls and people are seeing in like general workloads a
few percent slow down in heavy like cache and disk space uh sorry disk access code somewhere
between 30 and 50 i'm seeing reports of now obviously everything there's a lot of hyperbole
out there and um i've had a person actually i am me just before this chat
saying hey i'm just seeing my my search queries have gone from one second to being more like 10
seconds what am i doing wrong i'm like well i suspect we know what it could be um so that's
unfortunate but as again there is a workaround for it it's worth also noting that this this particular feature this particular issue
is is only uh happens on intel processors and apparently some arm processors other processes
don't allow this kind of speculation that depends on values or they substitute in a like a zero
value that the that was read from l1 cache if it turns out that actually it wasn't you weren't
supposed to be able to read it or they have some other performance characteristic that the that was read from l1 cache if it turns out that actually it wasn't you weren't supposed to be able to read it or they have some other performance characteristic that means that
the check happens before any further instructions can use the value they shouldn't have read so
this is sort of intel specific um again i've seen some reports that some ARM processors do this too
and you can understand why like performance is king They're trying to make this thing go fast.
You've got a single cycle, which is like a third of a nanosecond.
It's an insanely small amount of time to do anything,
let alone all the checks that you'd have to do
to make sure that you're supposed to be able to read memory.
So it's sort of forgivable, I think.
And the workaround seems to be reasonable.
Obviously, I think as time ticks on,
we'll see more and more people will have
ideas of improving the speed of maybe that that kernel transition um there are a number of i mean
i know mostly about linux in this particular um topic there are a number of system calls that are
sort of pseudo system calls so like getting the time of day um and other sensitive things they are actually implemented
in user space with a magical um mapping of kernels into user space called a vdso which is like a
virtual uh shared object i can't remember what the d is um and so those are unaffected by this
so getting calling get time of day or other time-based stuff is still as fast as it was
before which is good news for people like me who like to measure how fast their code is
without slowing it down as much as possible.
So, you know, that's...
All these are mitigation things you can do.
I've also seen some reports
that there are other side effects
to the translation look-aside buffer,
which is used in further aspects of the caching of the memory protection hierarchy.
And there is something called a PCID,
which is a process control identifier that the kernel can use if it's available.
And I'm seeing some reports that if your CPU is old enough that it doesn't support PCID,
that the cost for mitigating this is so high,
as the entire cache hierarchy has to be flushed and all
the tlbs that um it's not enabled by default so even if you patch your system if it's old
you should probably look into whether or not you're affected by this um if you don't have
this pcid and you can do cat prop cpu info and if it has the word pcid somewhere in there you're
probably okay um i don't want any further on that. On a significantly old system,
they're not patching it at the operating system level
because it would cause too much of a slowdown.
That is my understanding, yes.
I mean, they're patching it, but the operating system goes,
I don't have PCID.
Do you really want your processor to return to the 1960s?
You know, no.
So I maybe won't turn this on.
Like emulating an 8-bit system in JavaScript.
Exactly like that.
So anyway, that's Meltdown.
It is the more severe of the two, I think,
because it lets you read kernel memory,
but it can be worked around with the caveats that we just talked about.
Spectre, on the other hand, is more complicated,
so I probably won't go into the huge details
given how much of a tie myself in knots
with the last explanation I am without a whiteboard.
Spectre uses more of the speculation.
So that is the fact that the CPU
likes to get ahead of itself.
As we know, it tries to sort of predict where it's going.
And even though instructions haven't yet completed,
the branch predictor has tried to guess where the control flow is going and it's fetching and speculatively
executing these instructions and as we've just learned from the meltdown effect speculatively
executed instructions have a ghostly trace in the cache which we can through various nefarious ways
use to our advantage if we're um if we want to leak information from what was speculated
so with um specter there are a variety of attacks all of which basically target the branch predictor
um so imagine your jit code um you're in your javascript um thing you're running my my little
emulator and i'm i'm using all these um arrays to mimic the 64k of a bbc micro right and my java my it's been jit compiled
because javascript emulators are super fast these days um you know that there's going to be a bounds
check to me accessing my 64k block so i've allocated in javascript i've allocated a 64k block
and then i'm reading and writing to my 64k block and that's going to be compiled by the jit engine
into code assembly instructions that like access effectively just a raw array in memory great except that it's going to have to bounce check
it because it can't guarantee that I am not going to read off the end of it and unlike C++ you know
the browser wants to make sure that you can't do that so you know the code is going to look like
something like if index is less than 65535 or rather if index is less than 65535, or rather, if index is less than array.size, allow the thing to happen.
Now, what happens if I access the array in bounds a million, million times, right?
A billion times.
Well, the branch predictor predicts that I never, ever, ever don't go into the reading code, right?
It never skips the reading code.
The branch predictor says, hey, to all intents and purposes, this branch is never taken. I never skips the reading code the branch predictor says hey to all intents and purposes this branch is never taken i never skip reading the okay you're right and then
suddenly i give it an out of bounds um read okay so the out of bounds read means the branch predictor
is gonna carry on with the read anyway right the branch will be predicted to be not taken
so the code will fall into into speculating reading outside
of the bounds of the array right immediately though the branch predictor will go oh hang on
the branch will be resolved then it will say whoa that was out of bounds undo all those instructions
that you did right they were all thrown away architecture we shouldn't have done them we
started doing them but you know hey um we shouldn't have undo that and you can
sort of start to see where this is going now i think you say so if you can get enough work done
if if you can tie up the array dot size read for a while by flushing that out of the cache that
means it's going to take 100 cycles to resolve the result of how big is my array you've got like 100
cycles to do some speculative work before the array size comes in and the compare completes,
and then the branch predictor goes, whoa, I was wrong.
So the three steps are you train the branch predictor,
you force a cache miss on the array size,
and then you do your nefarious work inside your code.
And there is a proof of concept inside of the um the google paper that shows from javascript
that you can basically read every byte inside the browser oh which is amazing to think that that they
can tweak the javascript enough to generate the assembly that they want to generate and then they
can measure the cache impact because the the bit i haven't really gone into is how you look at the
cache after the effect and kind of say well which value did i get out how did and that involves some sensitive timing and
another um aspects you know where if you've got access to c++ you can do cache flush instructions
and you can you know you've got a lot more control whereas in javascript you have no such luxury but
it can be done and they've proven that it can be done and that's that's scary but obviously you
can only read the browser process so that's why it it is scary, but not as scary as Meltdown,
where an arbitrary user mode executable could read anything.
So there are other things inside Spectre.
So now we're talking about the Retpoline patch that we discussed in the news articles.
That's an indirect branch.
And obviously every virtual call that hasn't been inlined, as you say, Jason. Right.
Every virtual call and any call to a DLL where it's going through the PLT
involves an indirect branch.
Processors want to try and hide that away from us.
They want to make our life as good as possible.
So the branch predictor, as well as predicting whether a branch is taken or or not which is what we think of when we think of branch predictors there's a sort of separate
aspect to the branch called the branch target buffer which is used even for unconditional
branches it's like i know that there is a branch coming up um where does it go to so even before
we've decoded the instruction this is a hilarious thing like the branch predictor is like three
pipeline stages ahead of the dec predictor is like three pipeline stages
ahead of the decoders on like an Intel.
So even before it's finished fetching the memory,
reading it and noticing that there's a branch instruction
inside the bytes that it read from memory,
the branch predictor has already worked out.
Yeah, go.
Just for everyone to be clear, when you say an unconditional branch,
you mean like a jump into a function call?
Exactly, a call or a jump
um so the the because the the front end of the processor wants to be fetching the bytes of
the instructions that it thinks are going to be executed as early as possible it even tries to
guess if there is a branch at some arbitrary location you can think of it as like a stood map
of void star instruction address to where i think it's going to go and it just that's
that's kept completely independently of everything else and indirect branches fall into this category
too so you call mem copy and it goes to the out of line version of mem copy and it's going to go
through the libc thunk to mem copy which does a jump indirect um mem copy is probably a bad
example it's probably well up to us but you you know, some system call type thing. And in there is going to be an indirect jump. And your CPU has said,
hey, you've called this function quite a lot. So I know that jumping to here effectively just means
I go over to the implementation of Memcpy and everyone's happy. The problem is people have
worked out now how the branch target buffer works, and they've realized that they can poison it by jump by doing indirect jumps
elsewhere that just happen to land in the same effective like cache line of the branch target
buffer saying to to mistrain the branch predictor to say hey if you jump to this particular address
then actually this is where you're going to go you're going to go after this this area of memory
and if that area of memory has a useful set of sequence of instructions that you
have like a useful speculative side effect then you can train the branch predictor to speculate
this sequence of instructions whenever you call one of those functions and of course if you can
then train the branch predictor inside a piece of memory in the
kernel to jump to a piece of another piece of memory in the kernel that has some useful side
effects for you you can start looking at what's going on inside the kernel from your your own
code so the so the trick here is teach mis-teach it that hey when you call mem copy you go to
this address and then call
memcpy from like a kernel context and then observe what happened. Again, everything resolves
correctly. The branch predictor kind of eventually goes, whoa, I went the wrong way, reverse and
carry on. But by then the damage has been done. And so this affects not just Intel.
That's correct. So that affects pretty much every out-of-order
or speculating processor out there,
which is almost everything that's been made in the last 20 years.
I'm seeing some reports that the Raspberry Pis
are not susceptible to this because they are strictly in order.
Yes.
But, you know, it's pretty scary.
And obviously it has these side effects of
compilers having to implement workarounds for the retpoline thing.
So I guess we should talk about what that is actually doing.
So the retpoline is a replacement for any indirect jump.
And it uses a call instruction followed by manipulating on the stack the return address of that call function
followed by a return so effectively you want to jump to address one two three four instead of
doing oh sorry the contents of one two three four which goes off to say four five six instead of just doing call indirect through one two three four you do call to um
the call to some other label okay call foo at foo you say hey smash the stack replace the contents
of the stack with the contents of one two three four so now instead of the return address being
back to where i came from for my call the return address is now pointing to the indirect function I wanted to get to,
and then you do a return.
And the return does the indirect jump.
And the reason this is cool is because
the processor is not smart enough to predict that.
And so if it's doing a speculation,
it predicts that it goes back to where it came from,
which of course is not going now.
And so what you'd make sure is
after your call instruction, your original original call instruction you just have an
infinite loop that just jumps to itself and so that any indirect call if speculated incorrectly
will speculate to an infinite loop which has no side effects that anyone cares about and can't
be controlled by the outside world so that's yeah go on you look you sound like you don't quite get this i'm not surprised it's a
complicated topic and i i do i do get what the mitigation is that you're talking about but you
and i matt have had these conversations about how uh the way we write code affects processor design
and processor design affects the way that we write code right so it you say the the cpu is not smart
enough to predict that we're going to
do this therefore it works to mitigate the effects of specter i accept that however if we make this
a common idiom in our code it seems like the kind of thing that cpu vendors are going to start to
optimize for absolutely but they're they're well aware now of the the problems here so
the other mitigations that are coming out now are actually from the cpu vendors themselves
so obviously there are there's there's this replene thing can be used in browsers it can
be used in the kernel to try to like reduce the ability to um use indirect jumps to your
advantage using spectre um the other problem of mispredicting the in um uh mispredicting indirect jumps by forcing branch table buffer
like um uh collisions can only be really mitigated by the cpu vendors themselves and so they have
issued microcode patches which allow the kernel to flush those tables or to not trust them so as you go in and out of kernel mode various things happen like
um either the kernel decides to completely flush the branch table buffer branch target buffer or
the branch target buffer is somehow tagged with this came from user mode versus this came from
kernel mode therefore the speculation system is not allowed to speculate i'm a bit vague on this
and intel are also a bit vague and we're only doing
this from reading the the kernel patch notes but there's a whole bunch of interesting new um model
specific registers that have been put in that allow these kind of features to go in and out
which is remarkable for two reasons one um that they've had to do this and two that they are able
to do this is amazing the amount of changes they can make to your processor just with a microcode update.
That's slightly disturbing.
Yeah.
Who knows what else it could be doing there.
Yeah.
I wanted to interrupt this discussion for just a moment
to bring you a word from our sponsors.
Backtrace is a debugging platform that improves software quality,
reliability, and support by bringing deep introspection and automation throughout the software error lifecycle.
Spend less time debugging and reduce your mean time to resolution by using the first and only platform to combine symbolic debugging, error aggregation, and state analysis.
At the time of error, Backtrace jumps into action, capturing detailed dumps of application and environmental state. Backtrace then performs automated analysis on process memory and executable code
to classify errors and highlight important signals such as heap corruption, malware, and much more.
This data is aggregated and archived in a centralized object store,
providing your team a single system to investigate errors across your environments.
Join industry leaders like Fastly, Message Systems, and AppNexus
that use Backtrace to modernize their debugging infrastructure.
It's free to try, minutes to set up, fully featured with no commitment necessary.
Check them out at backtrace.io.
So, as C++ programmers, who really needs to think about this in detail?
Like, do you only need to care if you're a browser developer or if you're a,
you know,
operating system developer?
I mean,
to the extent that the performance affects us all,
I think it's useful for us to have some,
at least hand-waving understanding of what these things are about.
Um,
but that affects anyone who's writing in,
you know,
no JS or whatever,
like everyone's noticing some slowdowns.
I think the kind of people who need to know about this from the nuts and bolts layer level, uh, writing in you know no js or whatever like everyone's noticing some slowdowns i think
the kind of people who need to know about this from the nuts and bolts layer level uh are probably
yes browser vendors people writing kernel code be it modules or operating systems themselves or
anything that has a sandbox so one of the attack vectors here for example was there's something
called the ebpf the extended barclay packet filter system inside linux which is originally used to filter packets and um has
its own like mini language and that mini language gets jit compiled into the kernel and you know
fun and games begin once you can jit and code into the into the um uh into the kernel um so it does
affect us all to some extent um but i think yeah the people who
really have to worry about it are the people who are already worrying about it um those those folks
that google those folks at amazon and and intel and the other spots that are looking into this
actively i mean it's it's amazing though i mean this the fact that this things are coming out of
the woodwork now about this i i think i think actually you tweeted about this jason the xbox 360 bug where like again a misprediction caused some
strange side effects in the cache which actually ended up causing issue where um like some some uh
prefection a non-temporal prefect was effectively poisoning cache state in a way that was bad and
they kind of put it behind a flag like if if this thing is not enabled then don't do it but of course the speculation would sometimes go
wrong and it would do it anyway and then roll it back and of course it would yeah it's just
it's a scary world out there but it's also super exciting i mean if anything this hopefully
will cause people to go wow i had no idea that my chip was doing all this stuff behind the scenes
um and you know the more people that understand what's going on to me more people that have uh uh whose interest is piqued by the these things going
on the better as far as i'm concerned i think this is the most exciting thing about what we do
and certainly for c++ programs we're that much closer to this kind of stuff i think it's important
to know yeah that article that you just mentioned uh i think went a long way to helping me understand what was going on here.
Because this was effectively a broken instruction, like you could not safely use this instruction because it would corrupt your cache. And they had to go to the lengths of making sure that
that instruction was not in their binary at all. Right. I mean, for the longest time, Intel have,
you know, you start looking back at old pieces of advice that Intel have been giving.
And one of the things they said is like, you know, even after an indirect branch or after a switch statement, you know, is never taken.
Don't just let the program, if you're emitting code, just fall off the end.
Put a bunch of undefined instructions like UD2, which is a well, the well-defined undefined instruction.
It's like trying to be the undefined instruction it's the trap instruction effectively um put enough of those to like fill at least the
rest of the cache line if not more just because who knows what would happen if the processor
decided for whatever reason to fall off the end of their mispredict or whatever it might interpret
those bytes as being any number of things which might have some strange side effects um and again
you know we should have we should have really raised the red flag at that point go well what
side effects can it possibly have if it never actually completed execution you know if these
things were speculatively executed we're protected right apparently not right right so uh i keep So I keep ending up in these conversations, Matt, about how we, you know, TBAA, strict aliasing rules, they break our optimizations.
We can't enable strict aliasing optimizations, like talking about F no strict aliasing, stuff like that, because it's um unintuitive to c++ programmers and we are uh and and their
optimizations that break valid code right effectively right and i uh just curious if i
could get some feedback from you because you're kind of known um for you know caring about
performance like you've been talking about you care about how the cpu works absolutely yeah so
this is something that frustrates me when i see the kind of myth that's grown around this i think that's like oh
if i put o3 on it breaks my code or if i yeah i or no i just have to turn off this f no strict
aliasing and kind of stuff it's like it's it's if your code breaks because of that your code
was broken before in my opinion which is obviously strong um i compile all
my code with o3 and with strict aliasing on so let's talk a little bit about what that actually
means so the c++ standard talks about the kinds of pointers that can alias and what is aliasing
alias means the compiler has to assume if two pointers alias or may alias, it has to assume that they might be pointing at the same underlying object.
And that prohibits a whole class of optimizations.
So the canonical case is something where you're taking two arrays of numbers and multiplying them in and adding and multiplying them together one by one and writing the results into yet a third array um all the vector instructions and all those kinds of things want to be able to assume that you're not going to be modifying
one array like one of the source arrays by writing results into it as you go and this is a bit like
mem move versus mem copy you know like if the if the ranges don't overlap those of course things
can optimizations can happen but if they do overlap you have to be a lot more careful so
in general the compiler wants
to be able to assume that if you have two pointers to things that if it can possibly prove that they
can't point at the same actual memory then that's that's all the better now what tbaa is is type
base alias analysis as i think clang clang calls it. Presumably other compilers do. It's where the C++ standard has said,
this is the set of things that may be assumed to not alias.
If these two types are not in any way related,
if you've got a foo and a bar
and they're not inherited from each other
and they're completely separable,
you cannot take a foo, cast it to be a bar,
and expect things to work well for you.
Okay. But unfortunately unfortunately that's the
kind of thing that we've learned from rc programming days um where uh like the canonical example that
i remember from my games programming days is back in the day when um floating point units were slow
is that to test whether or not a floating point number was negative or not you would look at the
bit pattern by casting it to be an imp pointer and then then reading it back out and say is the top bit set
knowing that that's where the sign bit was um that kind of thing was was uh pretty much prolific in
code you know you just cast things backwards and forwards and compilers back in those days when i
was in the industry weren't smart enough to do anything about it anyway, so everything just worked.
And I think we've kind of grew up and assumed that that's the way you have to write code in order for it to be performant, or it's just allowed to do it.
And so that's why it's a lot of a surprise, I think, to people who have come from that mindset that you're not really allowed to do that kind of thing.
And then there have been various workarounds using unions unions which doesn't work so don't do that um you know that and yes it's i think we're probably as
as c++ teachers as we all are we could probably do a better job of explaining what the rules really
are um and in fact i think the standards committee themselves are still a bit vague on some of the
more subtle things certainly i've been chatting with with people about some of the wording being confusing
to me the canonical way around this seems to be to use mem copy so to go back to my example
between like an internet float and you want to like get the bit representation of an int into
a float is that you mem copy from the float into a new int on the stack do the check there and the compilers are
smart enough to optimize away the mem copy and you haven't broken any violet you haven't violated any
of the tbaa constraints um and then there are other get outs for char arrays or stood byte arrays where
you may take a stood an array of bytes and then one-off interpret them as another type of structure.
So that gets around, as this is my understanding,
and again, this is perhaps something to do with the lack of clarity about how this stuff all fits together,
is that the common idioms of getting a char buffer, reading from a file or reading from the network,
and then casting that buffer to be the foo star that you know is in that, that's okay.
But you can't then cast it to be a bar star
and expect it to work immediately afterwards.
You have to make sure that foo star falls out of scope
and you get a new object to point at it.
Otherwise, you're into the whole Stoodlaunder world,
which we don't want to talk about right now, I'm sure.
So in my experience,
I have not found any performance problems with either doing it right
by having a char array or a stood byte array and doing the one-off cast to the right type and then
doing your pointer gymnastics afterwards or else in the very few cases where i have had to directly
type pun using mem copy to copy from the bit pattern of the old thing to the bit to the
bit pattern of the new thing and then use the new thing the compiler will throw all of that away
the only argument against the mem copy thing that i've seen has been from the embedded world where
oftentimes they have to run debug images for a variety of complicated and not worth going into right now reasons on
their hardware but their hardware doesn't get faster in debug um it doesn't have more memory
in debug and in debug mode um with no optimization on the mem copy is not taken away and that can
have some like deleterious effects for them but i mean they have already they already have a whole
bunch of problems there so i wouldn't unless you're in that world, I wouldn't necessarily worry about that.
And we don't really want to get into Stoodlaunder,
but would Stoodlaunder help there?
I am not qualified to answer that question.
I'm not even sure that anybody is right now.
That was officially approved for C++17, right?
I think so, yeah.
Yeah, I think that's C++17, not c++ i'm vague on it and i mean i did
it's it's not something i plan on using uh i guess i i have a uh we have an internal library
for variant and i think that'd probably be the only place where we'd have to use it but i'm not
yeah it's really complicated and involve constance and other aspects that yeah i don't like so could you give like a general um i guess
when does this come up for the average c++ user or does it come up for the average c++ user
so i think probably it comes up most when you are reading and writing files trying to use like
structures um to sort of map over those files
rather than like parsing it byte by byte um which is common in um the games world it's common in um
networking code and i think even then most of the things you're already doing are probably fine if
you've just got your big char array and you read into it and then cast it i think you're okay i think i've yeah go on uh all
of the code that i've basically ever written in c++ has been cross-platform and i've had to worry
about big indian little indian multiple cpu architectures and i so i've never actually
written code like that because i couldn't assume that would work on the next target that i was
going to be right on right and i think that probably is what affects me, for example. We still have to
deal with endianness and we just have a template class that knows which endianness it needs to be.
And that's the thing we map over, but that gets complicated. But I remember actually vividly,
Jason, when you came in to present at our company, us having that conversation about
reinterpret cast, you looked at me like, I don't think I've ever used reinterpret cast. I'm like,
oh my god, if I get grep reinterpret cast in my code base, you would have a heart attack.
And that remains true. I mean, it is one of the things where when performance counts,
one of the best things about C and C++ is that having a well-defined binary layout to structures allows
you to very easily map over things that you've read in and be that shared memory or be that
something you read from a file or something you read from a network but you know we're already
starting to go off the reservation when you're talking about shared memory and things like that
because again the C++ model has no idea that that a piece of
memory that you're talking to might change but not even because of the code that you're running on
this process but because some network cards dma'd into it so you know that's already into into a
we're in a weird world there um and i'm not even sure now i say that out loud how strictly standards
compliant our code is around exactly that kind of area.
So, yeah, there I'm relying on the compiler not really being smart enough to realize that that's happening.
Okay.
Well, Matt, it's been great having you on the show today.
Thank you so much for coming on
and giving us your understanding of these new issues
we have to worry about as programmers.
Well, thanks for having me.
I've had a great time as before,
and I hope that somewhere in the middle of all that talking
there is a little bit of a glimmer of understanding
of what Spectre and Meltdown are,
and hopefully some interest in your listeners
to go and investigate more about what the crazy things
your processors are doing.
I'm sure.
Okay, thank you, Matt.
Thanks.
Thanks. Thanks so much for listening in as we chat about C++. I'd love to hear what you think of the podcast. Please let me know if we're discussing
the stuff you're interested in, or if you have a suggestion for a topic, I'd love to hear about
that too. You can email all your thoughts to feedback at cppcast.com. I'd also appreciate
if you like CppCast on Facebook and follow CppCast on
Twitter. You can also follow me at Rob W. Irving and Jason at Leftkiss on Twitter. And of course,
you can find all that info and the show notes on the podcast website at cppcast.com.
Theme music for this episode is provided by podcastthemes.com.