CppCast - Meltdown and Spectre

Starting point is 00:00:00 Episode 133 of CppCast with guest Matt Godbolt, recorded January 9th, 2018. This episode of CppCast is sponsored by Backtrace, the turnkey debugging platform that helps you spend less time debugging and more time building. Get to the root cause quickly with detailed information at your fingertips. Start your free trial at backtrace.io slash cppcast. CppCast is also sponsored by Embo++. The upcoming conference will be held in Bokom, Germany from March 9th to 11th. Meet other embedded systems developers

Starting point is 00:00:30 working on microcontrollers, alternative kernels, and highly customizable zero-cost library designs. Get your ticket today at embo.io. In this episode, we talk about C++ tips and file system support in GCC. Then we talk to Matt Gobbold, creator of Compiler Explorer. Matt talks to us about the Meltdown and Spectre attacks. Welcome to episode 133 of CppCast, the only podcast for C++ developers by C++ developers. I am your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? Doing okay, Rob. How are you doing?

Starting point is 00:01:43 Doing okay. Starting to settle more into the new year we had that crazy uh blizzard scenario on the east coast last week it didn't hit north carolina too much we got like an inch of snow but of course in north carolina and just that was pretty paralyzing for days so we had uh north florida tallahassee had snow snow for the first time in 27 years or something ridiculous like that. Yeah, pretty nuts. Yeah. And meanwhile, I have a balmy 60-degree day today coming up in Denver. That's unexpected.

Starting point is 00:02:17 Yeah. I guess we took all the cold weather for you. Okay, well, at the top of our episode, I'd like to read a piece of feedback. This week, we got an email from Dimitri. And Dimitri writes, thank you for the podcast. Last time, you mentioned patterns as a potential topic. And Nicole mentioned some idioms when you asked her about patterns. As a coincidence, I was looking at More C++ Idioms, which is an online book on wikibooks.org,

Starting point is 00:02:49 and found it a very nice initiative to describe idioms as a fully available book, especially taking into account that nowadays knowledge about C++ is often transferred by naming some of them. So it might be worth mentioning. And I wasn't really familiar with Wikibooks, but it does look like this is a pretty good free resource available, and we'll put that in the show notes. Interesting. Yeah, I'm familiar with Wikibooks. I was not familiar with this particular one. The list of idioms is pretty long.

Starting point is 00:03:15 They have a total of 91 C++ idioms, and a lot of these don't immediately look familiar to me. So, yeah, there's a lot there. Yeah. Some of them do have to-do listed to them, though. So I guess this book is still somewhat a work in progress. Yeah, and I think this kind of thing accepts user contributions. Right, right.

Starting point is 00:03:37 Well, we'd love to hear your thoughts about the show as well. You can always reach out to us on Facebook, Twitter, or email us at feedback at cpcast.com. And don't forget to leave us a review on iTunes. Joining us again today is Matt Godbolt. Matt is a developer at trading firm DRW. Before that, he's worked at Google, run a C++ tools company, and spent

Starting point is 00:03:55 over a decade in the games industry making PC and console games. He's fascinated by performance and created Compiler Explorer to help understand how C++ code ends up looking to the show. Hi, thanks for having me back again. So have you done any other 8-bit emulator in JavaScript other than JSBeep? Yes, my first emulator actually was a master system so the sega master system i don't

Starting point is 00:04:26 know if you guys had the 8-bit version you know like sort of around the same time as the nes yes that was my first console and i i wrote the emulator i actually originally wrote an emulator in arm assembly for the archimedes which was a popular um home computer back in the 90s um and uh i wrote it so that i could re-complete my favorite game ever which is wonder boy 3 um so i completed it under emulation again having completed it on the original and then i was later able to re-complete it for the third time in my web browser using my own emulator that i'd reported to JavaScript. So yeah, I had a lot more spare time back then. You know, to be fair, I realized as you were saying that, that I know the master system from when I lived in Europe,

Starting point is 00:05:13 but I don't know if it was popular or known in the US at all, but Rob's too young to know. I've never heard of it. It was a disappointment to me. I remember I saw like a-op of Altered Beast, and I'm mistaken at looking at the box art as it was back then. You look at the box and you're like, wow, this looks just like the coin-op.

Starting point is 00:05:34 And I got it, and it was not just like the coin-op at all. It was pretty terrible, as you can imagine. But, you know, those were the times. I know we definitely got a version of the Master System in the form of the Game Gear. Right. Yes, of course. Yeah. I think I had a few extra bells and whistles, but essentially it was a Master System with, I think, an FM chip for the sound, which I think the Japanese variant of the Sega Master System had too as well. So, yeah so yeah this is a deep dark hole we can go into for many hours if we're not careful well the main reason i brought it up is is i'm actually curious

Starting point is 00:06:11 if there's any version of your jsb talk on youtube uh there are i think two different versions of it okay um because i yeah i gave one as a um uh uh extra content talk at CppCon this year, which was like an updated version of it. Which is the one I saw, yeah. That's right, yeah. Yeah, the two older versions, if you just search for JSBeep, I think, you'll find it on YouTube. There's one I did at my company,

Starting point is 00:06:39 and there's one I did for the GoTo conference. Again, sort of just explaining roughly how it works and all of the weird and wonderful things you discover along the way and just how much power you can get out of 3,000 transistors that power the whole thing. It's amazing. Absolutely amazing.

Starting point is 00:06:55 Well, I do recommend, if anyone hasn't seen that talk yet, to go watch it. It was the one that you gave at CppCon was a lot of fun. I'm like, I'm thinking, this should have been recorded. Glad to hear there's other versions of it online so matt we have a couple news articles to discuss uh feel free to comment on any of these and then we'll start talking to you more about uh

Starting point is 00:07:17 various things okay sounds great okay so this first one is uh we talked about the c++ tips of the week that have been going around in Google for years when we had Titus Winters on the show right after CppCon 2017. Yes. And I'm not sure, is this the first official release of them going online or is this just an updated list of the new ones? Did you see that, Jason?

Starting point is 00:07:40 I believe this was a and we have released more announcement that came up recently okay so yeah these are the civil sauce tips of the week and I'm guessing uh you're pretty familiar with these masks since you used to be a googler right yes um some of these are familiar I think some of them are after my time there um I noticed that episode one was string view which I do remember from I think the google equivalent back in those days was called string piece um and it was used pretty throughout the code base as a convenience thing and it's one of the first things that i redeveloped when i uh left the company and needed to solve the same problem

Starting point is 00:08:13 myself so um yeah they're they're great and the thing about these these these tip of the weeks is that they are short and to the point and they are often like the kind of thing that you were like oh yeah i remember that vaguely but i'd never really looked into it and they're great to print out and like stick around the place so um i think it's modeled vaguely on uh testing on the toilet which was a an idea they had to try and promote all of the ideas of testing your code where um in the inside of the stalls they would print out and stick weekly and replace like here here's a tip for testing your code. And I think tip of the week is the similar kind of thing. You know, you could print it out.

Starting point is 00:08:51 If you have a C++ shop, you know, replace them once a week in the cubicles. And then you've got something to look at when you're taking a moment to yourself. Seems almost a little intrusive somehow. I'm not sure what's the right word I want there. Well, it's one of those more interesting things because, of course, the bathrooms are one of those places where even visitors would have to visit at some point. So, you know, there was a limit to what they could put in these sheets

Starting point is 00:09:20 because potential IP violations of other people going and taking a leak and then discovering the secret source of google that's crazy anyway already we brought the show down this is okay uh this next one we have is also coming from google and this is retpoline which is a software construct for preventing branch target injection and i guess we're going to be talking more about this topic with you matt because i think you've dove in kind of deep into uh this specter attack that came out over the past few weeks um but this is a google project for mitigating branch target injection right so yeah as you say we'll probably chat about this a little later. It's a pretty

Starting point is 00:10:08 complicated thing, but ultimately it's a way of making the compiler emit code that is less susceptible to the kinds of attacks that we've been seeing in Meltdown and Spectre. So if you're writing a piece of code that has privileged access to memory, maybe it's a kernel maybe it's a jit that protects uh does some kind of sandboxing of the rest of the process you know like your browser is doing it's preventing the javascript that you're executing from being able to see like the passwords that are living in the same address space as your javascript then perhaps it's worth compiling with this um this this flag this mLE, which I believe has landed in Clang already, to mitigate some of the attacks.

Starting point is 00:10:49 And we'll, as I say, try to explain a bit more later, which will be hard without a whiteboard, but we'll see what we can do. I'm sure we'll come up with something. I do want to point out that the first example here for there, a common C++ indirect branch in this example. No compiler would actually compile common c++ indirect branch in this example no compiler would actually compile this as an indirect branch i'm just saying because every compiler can totally trace this code

Starting point is 00:11:12 and would inline the indirect uh virtual function call but you know that's yeah yeah okay and then the last thing we have is gCC 8 now officially supports std file system. And the commit just came out a couple days ago. So that's pretty exciting that they're moving so quickly on C++ 17 support. Yeah, this was mind-blowing to me because I would have sworn that GCC already had file system support in there. It did, but only in the experimental branch. And so you have to go to std experimental and there's a whole bunch of things i mean we've used it in a couple

Starting point is 00:11:50 of new pieces of code that i've done because i'm just you know dipping my toe um and certainly whatever is on cpp reference does not match what was in std experimental file system to whatever that i don't know who was right who's wrong in that but hopefully std std file system is uh more stable now. Yeah, the main difference that I've seen come up recently, I think on a Reddit discussion, was how the path joining works. If you're joining an absolute path

Starting point is 00:12:17 to the tail end of another path, what happens? Apparently, I guess the final decision was that if the right-hand thing is an absolute path, then it throws away the front thing, because if it's an absolute path, then you want it to be an absolute path. But anyhow, apparently that's one of the things that changed between experimental and the final release.

Starting point is 00:12:38 Okay. I wonder if the stuff that was in GCC differs, because the thing that I recall being different is actually the handling of of is this a file that is actually at the end of it or get the attributes of this file and those kinds of things is this a file? the path manipulation aspect of it is you know like almost like text processing

Starting point is 00:12:54 and then there's the is this a directory is this a file what attributes does it have is it executable kind of level of thing that seem to be not as Cpp reference described so i'll be interested to know who was right and what we're going to get with the the file system that's actually in the standard right yeah and it's funny you mentioned is this a file because any file system operation is inherently is this a file well it was at the exact moment that you asked me to check. Now it's a directory.

Starting point is 00:13:29 There's no way of knowing. Okay, so let's start talking a little bit more about Spectre and Meltdown, which we briefly mentioned there. I know I didn't read a lot of tech news over the holiday break, and then when I started paying attention to Twitter again, I just saw people were angry at Intel. So I kind of read a little bit after the fact what was going on. But could you give us the breakdown of what Spectre and Meltdown are, Matt? Yeah, sure, sure.

Starting point is 00:13:55 So there's these two papers that have come out around about the same time that were under embargo, I think, for at least six months. People have tracked it back even further than that, that the various parties who were involved have known about it. And then Google chose to release information on both Spectre and Meltdown and all of their conversation around it. But they basically use some of the features that have been added to processes in the last 20 years to make them go faster

Starting point is 00:14:25 caching out of order execution speculation and branch prediction all those things together to get around various security checks inside the processor so let's first of all talk about meltdown because meltdown is potentially the scarier of the two. It's potentially the easier for inverted commas to work around and fix with a software patch. And indeed, if you are running anything on AWS or Google Compute, I believe all of those instances have already been patched. And there are performance aspects to the patch, which we can talk about in a second. So yeah, let's talk about in a second but um so yeah let's talk about meltdown so sure uh meltdown is an attack where if you try and read from a user unprivileged mode um process a piece of kernel memory obviously you're going to get a page fault normally you'll get seg fault right if you happen to know the

Starting point is 00:15:19 address in which some protected piece of memory in the kernel is at, and you try and read from it in user space, then you get a segmentation fault, and then your process gets killed, right? Makes sense. Now, you might ask why you could even read it at all, why you would even know that it has an address in your process space, and that's because normally when you want to switch in and out of the kernel mode,

Starting point is 00:15:41 like you're opening a file or accessing network or whatever like that, you need to go into kernel mode, and the kernel wants to be able to read both your memory and its own memory and so they're actually mapped into your process at all times it's just that some of it is marked as you can't read this from user mode and you can't write to this from user mode as opposed to it being not there at all okay so it's a convenient speed up for the kernel makes going in and out of kernel mode calls much much quicker and everyone's happy right and of course the processor guarantees that the user mode can't read or write to these mapped but inaccessible pages of kernel memory so no problem excellent

Starting point is 00:16:18 except that inside a processor every instruction runs effectively um in a sequence uh determined by the interdependence of instructions right so the processor can issue a whole bunch of things at once if you can prove that they are either not dependent on each other or that um some instructions uh don't yeah depend on other things which are still which have completed all right let me try and think of the about this a better way so a load instruction is going to read from memory and it's going to make the results available to the next instruction that depends upon it if the check inside the processor for is this a valid piece of memory that I'm allowed to look at happens sort of asynchronously, then obviously there's a window of opportunity between me reading the data and the fault going

Starting point is 00:17:14 off and saying, whoa, you shouldn't have been able to read that piece of data. Now, if something's in the level one cache, the Intel wants to make it as fast as possible for you to access it so it doesn't want to you to wait for all of the access checks to clear before it reads the memory out of level one cache okay so far so maybe one cycle and you've got the memory of l1 cache and in some background process on this on the cpu is going to take two or three cycles to determine oh wait i've just checked the page table you're in user mode that's kernel mode memory you shouldn't have been able to access that and so it will mark the instruction at that point as being, whoa, when that instruction completes at the end of the pipeline, please fault, cause the fault, cause an interrupt to happen there,

Starting point is 00:17:56 throw away everything that happens after that instruction. It's as if nothing happened. Now, this happens all the time inside the processor. So when your branch predictor gets something wrong, it has to basically undo everything up to the branch and redo from there so the same mechanism is being used at this point that is the instruction that is read some data it shouldn't have read is going to be undone anyway so there's no harm no foul now that's cool right we we've got a process by which um the processor can take a shortcut a bit later on it it can work out that something should not have been allowed to happen, and it can make it all go away as if it never happened.

Starting point is 00:18:30 Except if the instruction that was issued after that read of kernel memory influences the cache in some way. The processor cache itself is not state that is rolled back and that's a key to all of these attacks all of these attacks rely on the fact that there is a side effect that is not undone when this speculative execution or this trap instruction happens and so if you can come up with a way of doing something to the cache with that speculatively executed instruction that you can measure afterwards you can see that ghostly remains of what had happened with the data before so your your attack looks something like this clear all the caches call an operating system routine to ensure that the memory that that you would want to attack is now back in cache so the

Starting point is 00:19:23 only thing that's in cache now is the level one kernel mode that you wanted to so is the kernel memory that was protected that you want to read and you can do that by calling a system call or something like that and making sure the protected you know like the the kernel has executed it's now it's in l1 cache excuse me and then you could do something like attempt to read that memory yourself in user mode and then touch a piece of memory that is a cache line multiple of the byte that you read from the protected memory. So you have an array of say 65,000 bytes that's not in cache, you know it's not in cache, and you have now gone array plus plus where array is indexed by the value that you read back from kernel mode times say 64 which is the size of a cache line so you know that instruction will never

Starting point is 00:20:11 actually execute because the process will the processor will throw it all away but if if you've done it quick enough if it's close enough after the previous instruction that read the protected byte, then it will have brought into the cache a piece of memory whose address is dependent on the value you read from protected kernel mode. So let's say you read the byte from kernel memory was 10.

Starting point is 00:20:37 You're going to read address 64 times 10 in an array somewhere. Doesn't matter what you do with it. That gets faulted. you catch the segfault in your program that's probing it, so now you've undone the fact that the process has affected, it would have crashed, and now you go and you look at the cache, and you say, okay, which of these cache lines is now in the cache, and you measure how fast it is to read the zeroth one, the first one, the second one, the third one, the fourth one, the fifth one, and when you get to the tenth one,

Starting point is 00:21:04 suddenly that's really fast, you're like, oh, that's interesting. That must have been in the cache. Whatever's in the 10th element of my array is in cache, and therefore the value that I read from kernel mode must have been 10. Okay. Does that make enough sense? I'm sort of, it's again, with a picture, it's a little bit easier. But the general process relies on the fact that if you can influence the cache in a way that depends upon a value read speculatively from the kernel then you can after the fact after it's all been undone you can go and look at the cache

Starting point is 00:21:36 and from the state afterwards you're like reading the tea leaves in the can in the cup after the effect and go ah the only way it could look like this is if i had read the value 9 or 10 or 11 or whatever and then you can just keep doing this over and over again and keep probing and read basically every byte of kernel memory that's really really bad because of course i'm an unprivileged mode attack i'm an unprivileged mode process i can now just stream through the entirety of the kernel address space which means I can read everything the kernel can see, which is every other process. On Linux, it also means that the kernel maps in all of physical memory as well, so you can just basically read the entirety of physical RAM,

Starting point is 00:22:15 which means any keys that are in the kernel you can read, any other process you can read. It's really, really, really bad, and it's essentially undetectable because you're just doing something which is going to get cancelled over and over and over again and then like reading the the side effect so that's bad how can it be worked around well there's a patch called kaiser to the linux kernel which prevents and stops the kernel from mapping its own pages into user space. And that pretty much fixes the problem. Pretty much.

Starting point is 00:22:47 Pretty much. There are some pages that absolutely have to be in the user space, that is like the interrupt tables and things like that, things that the CPU requires to see that are part of the kernel's view of the world. But it's a fairly minimal set, and there's nothing too scary in there. The only thing you can do with those, I believe,

Starting point is 00:23:05 is use them to determine where the kernel is living in memory. And even that can be worked around. And obviously there are mitigation for other attacks called address space randomization. And so the kernel tries to put itself in a random spot every time to stop people from making attacks based on knowing where things live in memory. The problem with this this the drawback

Starting point is 00:23:26 is that um it's it slows down system calls and people are seeing in like general workloads a few percent slow down in heavy like cache and disk space uh sorry disk access code somewhere between 30 and 50 i'm seeing reports of now obviously everything there's a lot of hyperbole out there and um i've had a person actually i am me just before this chat saying hey i'm just seeing my my search queries have gone from one second to being more like 10 seconds what am i doing wrong i'm like well i suspect we know what it could be um so that's unfortunate but as again there is a workaround for it it's worth also noting that this this particular feature this particular issue is is only uh happens on intel processors and apparently some arm processors other processes

Starting point is 00:24:13 don't allow this kind of speculation that depends on values or they substitute in a like a zero value that the that was read from l1 cache if it turns out that actually it wasn't you weren't supposed to be able to read it or they have some other performance characteristic that the that was read from l1 cache if it turns out that actually it wasn't you weren't supposed to be able to read it or they have some other performance characteristic that means that the check happens before any further instructions can use the value they shouldn't have read so this is sort of intel specific um again i've seen some reports that some ARM processors do this too and you can understand why like performance is king They're trying to make this thing go fast. You've got a single cycle, which is like a third of a nanosecond. It's an insanely small amount of time to do anything,

Starting point is 00:24:53 let alone all the checks that you'd have to do to make sure that you're supposed to be able to read memory. So it's sort of forgivable, I think. And the workaround seems to be reasonable. Obviously, I think as time ticks on, we'll see more and more people will have ideas of improving the speed of maybe that that kernel transition um there are a number of i mean i know mostly about linux in this particular um topic there are a number of system calls that are

Starting point is 00:25:18 sort of pseudo system calls so like getting the time of day um and other sensitive things they are actually implemented in user space with a magical um mapping of kernels into user space called a vdso which is like a virtual uh shared object i can't remember what the d is um and so those are unaffected by this so getting calling get time of day or other time-based stuff is still as fast as it was before which is good news for people like me who like to measure how fast their code is without slowing it down as much as possible. So, you know, that's... All these are mitigation things you can do.

Starting point is 00:25:56 I've also seen some reports that there are other side effects to the translation look-aside buffer, which is used in further aspects of the caching of the memory protection hierarchy. And there is something called a PCID, which is a process control identifier that the kernel can use if it's available. And I'm seeing some reports that if your CPU is old enough that it doesn't support PCID, that the cost for mitigating this is so high,

Starting point is 00:26:23 as the entire cache hierarchy has to be flushed and all the tlbs that um it's not enabled by default so even if you patch your system if it's old you should probably look into whether or not you're affected by this um if you don't have this pcid and you can do cat prop cpu info and if it has the word pcid somewhere in there you're probably okay um i don't want any further on that. On a significantly old system, they're not patching it at the operating system level because it would cause too much of a slowdown. That is my understanding, yes.

Starting point is 00:26:52 I mean, they're patching it, but the operating system goes, I don't have PCID. Do you really want your processor to return to the 1960s? You know, no. So I maybe won't turn this on. Like emulating an 8-bit system in JavaScript. Exactly like that. So anyway, that's Meltdown.

Starting point is 00:27:10 It is the more severe of the two, I think, because it lets you read kernel memory, but it can be worked around with the caveats that we just talked about. Spectre, on the other hand, is more complicated, so I probably won't go into the huge details given how much of a tie myself in knots with the last explanation I am without a whiteboard. Spectre uses more of the speculation.

Starting point is 00:27:34 So that is the fact that the CPU likes to get ahead of itself. As we know, it tries to sort of predict where it's going. And even though instructions haven't yet completed, the branch predictor has tried to guess where the control flow is going and it's fetching and speculatively executing these instructions and as we've just learned from the meltdown effect speculatively executed instructions have a ghostly trace in the cache which we can through various nefarious ways use to our advantage if we're um if we want to leak information from what was speculated

Starting point is 00:28:05 so with um specter there are a variety of attacks all of which basically target the branch predictor um so imagine your jit code um you're in your javascript um thing you're running my my little emulator and i'm i'm using all these um arrays to mimic the 64k of a bbc micro right and my java my it's been jit compiled because javascript emulators are super fast these days um you know that there's going to be a bounds check to me accessing my 64k block so i've allocated in javascript i've allocated a 64k block and then i'm reading and writing to my 64k block and that's going to be compiled by the jit engine into code assembly instructions that like access effectively just a raw array in memory great except that it's going to have to bounce check it because it can't guarantee that I am not going to read off the end of it and unlike C++ you know

Starting point is 00:28:54 the browser wants to make sure that you can't do that so you know the code is going to look like something like if index is less than 65535 or rather if index is less than 65535, or rather, if index is less than array.size, allow the thing to happen. Now, what happens if I access the array in bounds a million, million times, right? A billion times. Well, the branch predictor predicts that I never, ever, ever don't go into the reading code, right? It never skips the reading code. The branch predictor says, hey, to all intents and purposes, this branch is never taken. I never skips the reading code the branch predictor says hey to all intents and purposes this branch is never taken i never skip reading the okay you're right and then suddenly i give it an out of bounds um read okay so the out of bounds read means the branch predictor

Starting point is 00:29:36 is gonna carry on with the read anyway right the branch will be predicted to be not taken so the code will fall into into speculating reading outside of the bounds of the array right immediately though the branch predictor will go oh hang on the branch will be resolved then it will say whoa that was out of bounds undo all those instructions that you did right they were all thrown away architecture we shouldn't have done them we started doing them but you know hey um we shouldn't have undo that and you can sort of start to see where this is going now i think you say so if you can get enough work done if if you can tie up the array dot size read for a while by flushing that out of the cache that

Starting point is 00:30:17 means it's going to take 100 cycles to resolve the result of how big is my array you've got like 100 cycles to do some speculative work before the array size comes in and the compare completes, and then the branch predictor goes, whoa, I was wrong. So the three steps are you train the branch predictor, you force a cache miss on the array size, and then you do your nefarious work inside your code. And there is a proof of concept inside of the um the google paper that shows from javascript that you can basically read every byte inside the browser oh which is amazing to think that that they

Starting point is 00:30:53 can tweak the javascript enough to generate the assembly that they want to generate and then they can measure the cache impact because the the bit i haven't really gone into is how you look at the cache after the effect and kind of say well which value did i get out how did and that involves some sensitive timing and another um aspects you know where if you've got access to c++ you can do cache flush instructions and you can you know you've got a lot more control whereas in javascript you have no such luxury but it can be done and they've proven that it can be done and that's that's scary but obviously you can only read the browser process so that's why it it is scary, but not as scary as Meltdown, where an arbitrary user mode executable could read anything.

Starting point is 00:31:30 So there are other things inside Spectre. So now we're talking about the Retpoline patch that we discussed in the news articles. That's an indirect branch. And obviously every virtual call that hasn't been inlined, as you say, Jason. Right. Every virtual call and any call to a DLL where it's going through the PLT involves an indirect branch. Processors want to try and hide that away from us. They want to make our life as good as possible.

Starting point is 00:32:05 So the branch predictor, as well as predicting whether a branch is taken or or not which is what we think of when we think of branch predictors there's a sort of separate aspect to the branch called the branch target buffer which is used even for unconditional branches it's like i know that there is a branch coming up um where does it go to so even before we've decoded the instruction this is a hilarious thing like the branch predictor is like three pipeline stages ahead of the dec predictor is like three pipeline stages ahead of the decoders on like an Intel. So even before it's finished fetching the memory, reading it and noticing that there's a branch instruction

Starting point is 00:32:32 inside the bytes that it read from memory, the branch predictor has already worked out. Yeah, go. Just for everyone to be clear, when you say an unconditional branch, you mean like a jump into a function call? Exactly, a call or a jump um so the the because the the front end of the processor wants to be fetching the bytes of the instructions that it thinks are going to be executed as early as possible it even tries to

Starting point is 00:32:56 guess if there is a branch at some arbitrary location you can think of it as like a stood map of void star instruction address to where i think it's going to go and it just that's that's kept completely independently of everything else and indirect branches fall into this category too so you call mem copy and it goes to the out of line version of mem copy and it's going to go through the libc thunk to mem copy which does a jump indirect um mem copy is probably a bad example it's probably well up to us but you you know, some system call type thing. And in there is going to be an indirect jump. And your CPU has said, hey, you've called this function quite a lot. So I know that jumping to here effectively just means I go over to the implementation of Memcpy and everyone's happy. The problem is people have

Starting point is 00:33:38 worked out now how the branch target buffer works, and they've realized that they can poison it by jump by doing indirect jumps elsewhere that just happen to land in the same effective like cache line of the branch target buffer saying to to mistrain the branch predictor to say hey if you jump to this particular address then actually this is where you're going to go you're going to go after this this area of memory and if that area of memory has a useful set of sequence of instructions that you have like a useful speculative side effect then you can train the branch predictor to speculate this sequence of instructions whenever you call one of those functions and of course if you can then train the branch predictor inside a piece of memory in the

Starting point is 00:34:27 kernel to jump to a piece of another piece of memory in the kernel that has some useful side effects for you you can start looking at what's going on inside the kernel from your your own code so the so the trick here is teach mis-teach it that hey when you call mem copy you go to this address and then call memcpy from like a kernel context and then observe what happened. Again, everything resolves correctly. The branch predictor kind of eventually goes, whoa, I went the wrong way, reverse and carry on. But by then the damage has been done. And so this affects not just Intel. That's correct. So that affects pretty much every out-of-order

Starting point is 00:35:05 or speculating processor out there, which is almost everything that's been made in the last 20 years. I'm seeing some reports that the Raspberry Pis are not susceptible to this because they are strictly in order. Yes. But, you know, it's pretty scary. And obviously it has these side effects of compilers having to implement workarounds for the retpoline thing.

Starting point is 00:35:30 So I guess we should talk about what that is actually doing. So the retpoline is a replacement for any indirect jump. And it uses a call instruction followed by manipulating on the stack the return address of that call function followed by a return so effectively you want to jump to address one two three four instead of doing oh sorry the contents of one two three four which goes off to say four five six instead of just doing call indirect through one two three four you do call to um the call to some other label okay call foo at foo you say hey smash the stack replace the contents of the stack with the contents of one two three four so now instead of the return address being back to where i came from for my call the return address is now pointing to the indirect function I wanted to get to,

Starting point is 00:36:26 and then you do a return. And the return does the indirect jump. And the reason this is cool is because the processor is not smart enough to predict that. And so if it's doing a speculation, it predicts that it goes back to where it came from, which of course is not going now. And so what you'd make sure is

Starting point is 00:36:44 after your call instruction, your original original call instruction you just have an infinite loop that just jumps to itself and so that any indirect call if speculated incorrectly will speculate to an infinite loop which has no side effects that anyone cares about and can't be controlled by the outside world so that's yeah go on you look you sound like you don't quite get this i'm not surprised it's a complicated topic and i i do i do get what the mitigation is that you're talking about but you and i matt have had these conversations about how uh the way we write code affects processor design and processor design affects the way that we write code right so it you say the the cpu is not smart enough to predict that we're going to

Starting point is 00:37:25 do this therefore it works to mitigate the effects of specter i accept that however if we make this a common idiom in our code it seems like the kind of thing that cpu vendors are going to start to optimize for absolutely but they're they're well aware now of the the problems here so the other mitigations that are coming out now are actually from the cpu vendors themselves so obviously there are there's there's this replene thing can be used in browsers it can be used in the kernel to try to like reduce the ability to um use indirect jumps to your advantage using spectre um the other problem of mispredicting the in um uh mispredicting indirect jumps by forcing branch table buffer like um uh collisions can only be really mitigated by the cpu vendors themselves and so they have

Starting point is 00:38:15 issued microcode patches which allow the kernel to flush those tables or to not trust them so as you go in and out of kernel mode various things happen like um either the kernel decides to completely flush the branch table buffer branch target buffer or the branch target buffer is somehow tagged with this came from user mode versus this came from kernel mode therefore the speculation system is not allowed to speculate i'm a bit vague on this and intel are also a bit vague and we're only doing this from reading the the kernel patch notes but there's a whole bunch of interesting new um model specific registers that have been put in that allow these kind of features to go in and out which is remarkable for two reasons one um that they've had to do this and two that they are able

Starting point is 00:39:01 to do this is amazing the amount of changes they can make to your processor just with a microcode update. That's slightly disturbing. Yeah. Who knows what else it could be doing there. Yeah. I wanted to interrupt this discussion for just a moment to bring you a word from our sponsors. Backtrace is a debugging platform that improves software quality,

Starting point is 00:39:22 reliability, and support by bringing deep introspection and automation throughout the software error lifecycle. Spend less time debugging and reduce your mean time to resolution by using the first and only platform to combine symbolic debugging, error aggregation, and state analysis. At the time of error, Backtrace jumps into action, capturing detailed dumps of application and environmental state. Backtrace then performs automated analysis on process memory and executable code to classify errors and highlight important signals such as heap corruption, malware, and much more. This data is aggregated and archived in a centralized object store, providing your team a single system to investigate errors across your environments. Join industry leaders like Fastly, Message Systems, and AppNexus that use Backtrace to modernize their debugging infrastructure.

Starting point is 00:40:06 It's free to try, minutes to set up, fully featured with no commitment necessary. Check them out at backtrace.io. So, as C++ programmers, who really needs to think about this in detail? Like, do you only need to care if you're a browser developer or if you're a, you know, operating system developer? I mean, to the extent that the performance affects us all,

Starting point is 00:40:33 I think it's useful for us to have some, at least hand-waving understanding of what these things are about. Um, but that affects anyone who's writing in, you know, no JS or whatever, like everyone's noticing some slowdowns. I think the kind of people who need to know about this from the nuts and bolts layer level, uh, writing in you know no js or whatever like everyone's noticing some slowdowns i think

Starting point is 00:40:45 the kind of people who need to know about this from the nuts and bolts layer level uh are probably yes browser vendors people writing kernel code be it modules or operating systems themselves or anything that has a sandbox so one of the attack vectors here for example was there's something called the ebpf the extended barclay packet filter system inside linux which is originally used to filter packets and um has its own like mini language and that mini language gets jit compiled into the kernel and you know fun and games begin once you can jit and code into the into the um uh into the kernel um so it does affect us all to some extent um but i think yeah the people who really have to worry about it are the people who are already worrying about it um those those folks

Starting point is 00:41:31 that google those folks at amazon and and intel and the other spots that are looking into this actively i mean it's it's amazing though i mean this the fact that this things are coming out of the woodwork now about this i i think i think actually you tweeted about this jason the xbox 360 bug where like again a misprediction caused some strange side effects in the cache which actually ended up causing issue where um like some some uh prefection a non-temporal prefect was effectively poisoning cache state in a way that was bad and they kind of put it behind a flag like if if this thing is not enabled then don't do it but of course the speculation would sometimes go wrong and it would do it anyway and then roll it back and of course it would yeah it's just it's a scary world out there but it's also super exciting i mean if anything this hopefully

Starting point is 00:42:15 will cause people to go wow i had no idea that my chip was doing all this stuff behind the scenes um and you know the more people that understand what's going on to me more people that have uh uh whose interest is piqued by the these things going on the better as far as i'm concerned i think this is the most exciting thing about what we do and certainly for c++ programs we're that much closer to this kind of stuff i think it's important to know yeah that article that you just mentioned uh i think went a long way to helping me understand what was going on here. Because this was effectively a broken instruction, like you could not safely use this instruction because it would corrupt your cache. And they had to go to the lengths of making sure that that instruction was not in their binary at all. Right. I mean, for the longest time, Intel have, you know, you start looking back at old pieces of advice that Intel have been giving.

Starting point is 00:43:06 And one of the things they said is like, you know, even after an indirect branch or after a switch statement, you know, is never taken. Don't just let the program, if you're emitting code, just fall off the end. Put a bunch of undefined instructions like UD2, which is a well, the well-defined undefined instruction. It's like trying to be the undefined instruction it's the trap instruction effectively um put enough of those to like fill at least the rest of the cache line if not more just because who knows what would happen if the processor decided for whatever reason to fall off the end of their mispredict or whatever it might interpret those bytes as being any number of things which might have some strange side effects um and again you know we should have we should have really raised the red flag at that point go well what

Starting point is 00:43:48 side effects can it possibly have if it never actually completed execution you know if these things were speculatively executed we're protected right apparently not right right so uh i keep So I keep ending up in these conversations, Matt, about how we, you know, TBAA, strict aliasing rules, they break our optimizations. We can't enable strict aliasing optimizations, like talking about F no strict aliasing, stuff like that, because it's um unintuitive to c++ programmers and we are uh and and their optimizations that break valid code right effectively right and i uh just curious if i could get some feedback from you because you're kind of known um for you know caring about performance like you've been talking about you care about how the cpu works absolutely yeah so this is something that frustrates me when i see the kind of myth that's grown around this i think that's like oh if i put o3 on it breaks my code or if i yeah i or no i just have to turn off this f no strict

Starting point is 00:44:56 aliasing and kind of stuff it's like it's it's if your code breaks because of that your code was broken before in my opinion which is obviously strong um i compile all my code with o3 and with strict aliasing on so let's talk a little bit about what that actually means so the c++ standard talks about the kinds of pointers that can alias and what is aliasing alias means the compiler has to assume if two pointers alias or may alias, it has to assume that they might be pointing at the same underlying object. And that prohibits a whole class of optimizations. So the canonical case is something where you're taking two arrays of numbers and multiplying them in and adding and multiplying them together one by one and writing the results into yet a third array um all the vector instructions and all those kinds of things want to be able to assume that you're not going to be modifying one array like one of the source arrays by writing results into it as you go and this is a bit like

Starting point is 00:45:54 mem move versus mem copy you know like if the if the ranges don't overlap those of course things can optimizations can happen but if they do overlap you have to be a lot more careful so in general the compiler wants to be able to assume that if you have two pointers to things that if it can possibly prove that they can't point at the same actual memory then that's that's all the better now what tbaa is is type base alias analysis as i think clang clang calls it. Presumably other compilers do. It's where the C++ standard has said, this is the set of things that may be assumed to not alias. If these two types are not in any way related,

Starting point is 00:46:33 if you've got a foo and a bar and they're not inherited from each other and they're completely separable, you cannot take a foo, cast it to be a bar, and expect things to work well for you. Okay. But unfortunately unfortunately that's the kind of thing that we've learned from rc programming days um where uh like the canonical example that i remember from my games programming days is back in the day when um floating point units were slow

Starting point is 00:46:58 is that to test whether or not a floating point number was negative or not you would look at the bit pattern by casting it to be an imp pointer and then then reading it back out and say is the top bit set knowing that that's where the sign bit was um that kind of thing was was uh pretty much prolific in code you know you just cast things backwards and forwards and compilers back in those days when i was in the industry weren't smart enough to do anything about it anyway, so everything just worked. And I think we've kind of grew up and assumed that that's the way you have to write code in order for it to be performant, or it's just allowed to do it. And so that's why it's a lot of a surprise, I think, to people who have come from that mindset that you're not really allowed to do that kind of thing. And then there have been various workarounds using unions unions which doesn't work so don't do that um you know that and yes it's i think we're probably as

Starting point is 00:47:50 as c++ teachers as we all are we could probably do a better job of explaining what the rules really are um and in fact i think the standards committee themselves are still a bit vague on some of the more subtle things certainly i've been chatting with with people about some of the wording being confusing to me the canonical way around this seems to be to use mem copy so to go back to my example between like an internet float and you want to like get the bit representation of an int into a float is that you mem copy from the float into a new int on the stack do the check there and the compilers are smart enough to optimize away the mem copy and you haven't broken any violet you haven't violated any of the tbaa constraints um and then there are other get outs for char arrays or stood byte arrays where

Starting point is 00:48:39 you may take a stood an array of bytes and then one-off interpret them as another type of structure. So that gets around, as this is my understanding, and again, this is perhaps something to do with the lack of clarity about how this stuff all fits together, is that the common idioms of getting a char buffer, reading from a file or reading from the network, and then casting that buffer to be the foo star that you know is in that, that's okay. But you can't then cast it to be a bar star and expect it to work immediately afterwards. You have to make sure that foo star falls out of scope

Starting point is 00:49:13 and you get a new object to point at it. Otherwise, you're into the whole Stoodlaunder world, which we don't want to talk about right now, I'm sure. So in my experience, I have not found any performance problems with either doing it right by having a char array or a stood byte array and doing the one-off cast to the right type and then doing your pointer gymnastics afterwards or else in the very few cases where i have had to directly type pun using mem copy to copy from the bit pattern of the old thing to the bit to the

Starting point is 00:49:48 bit pattern of the new thing and then use the new thing the compiler will throw all of that away the only argument against the mem copy thing that i've seen has been from the embedded world where oftentimes they have to run debug images for a variety of complicated and not worth going into right now reasons on their hardware but their hardware doesn't get faster in debug um it doesn't have more memory in debug and in debug mode um with no optimization on the mem copy is not taken away and that can have some like deleterious effects for them but i mean they have already they already have a whole bunch of problems there so i wouldn't unless you're in that world, I wouldn't necessarily worry about that. And we don't really want to get into Stoodlaunder,

Starting point is 00:50:29 but would Stoodlaunder help there? I am not qualified to answer that question. I'm not even sure that anybody is right now. That was officially approved for C++17, right? I think so, yeah. Yeah, I think that's C++17, not c++ i'm vague on it and i mean i did it's it's not something i plan on using uh i guess i i have a uh we have an internal library for variant and i think that'd probably be the only place where we'd have to use it but i'm not

Starting point is 00:50:57 yeah it's really complicated and involve constance and other aspects that yeah i don't like so could you give like a general um i guess when does this come up for the average c++ user or does it come up for the average c++ user so i think probably it comes up most when you are reading and writing files trying to use like structures um to sort of map over those files rather than like parsing it byte by byte um which is common in um the games world it's common in um networking code and i think even then most of the things you're already doing are probably fine if you've just got your big char array and you read into it and then cast it i think you're okay i think i've yeah go on uh all of the code that i've basically ever written in c++ has been cross-platform and i've had to worry

Starting point is 00:51:54 about big indian little indian multiple cpu architectures and i so i've never actually written code like that because i couldn't assume that would work on the next target that i was going to be right on right and i think that probably is what affects me, for example. We still have to deal with endianness and we just have a template class that knows which endianness it needs to be. And that's the thing we map over, but that gets complicated. But I remember actually vividly, Jason, when you came in to present at our company, us having that conversation about reinterpret cast, you looked at me like, I don't think I've ever used reinterpret cast. I'm like, oh my god, if I get grep reinterpret cast in my code base, you would have a heart attack.

Starting point is 00:52:34 And that remains true. I mean, it is one of the things where when performance counts, one of the best things about C and C++ is that having a well-defined binary layout to structures allows you to very easily map over things that you've read in and be that shared memory or be that something you read from a file or something you read from a network but you know we're already starting to go off the reservation when you're talking about shared memory and things like that because again the C++ model has no idea that that a piece of memory that you're talking to might change but not even because of the code that you're running on this process but because some network cards dma'd into it so you know that's already into into a

Starting point is 00:53:14 we're in a weird world there um and i'm not even sure now i say that out loud how strictly standards compliant our code is around exactly that kind of area. So, yeah, there I'm relying on the compiler not really being smart enough to realize that that's happening. Okay. Well, Matt, it's been great having you on the show today. Thank you so much for coming on and giving us your understanding of these new issues we have to worry about as programmers.

Starting point is 00:53:43 Well, thanks for having me. I've had a great time as before, and I hope that somewhere in the middle of all that talking there is a little bit of a glimmer of understanding of what Spectre and Meltdown are, and hopefully some interest in your listeners to go and investigate more about what the crazy things your processors are doing.

Starting point is 00:54:01 I'm sure. Okay, thank you, Matt. Thanks. Thanks. Thanks so much for listening in as we chat about C++. I'd love to hear what you think of the podcast. Please let me know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, I'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. I'd also appreciate if you like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Leftkiss on Twitter. And of course,

Starting point is 00:54:32 you can find all that info and the show notes on the podcast website at cppcast.com. Theme music for this episode is provided by podcastthemes.com.

CppCast - Meltdown and Spectre

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.