Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Profiling

Starting point is 00:00:00 Okay, hi everyone, let's get started. So today we're going to, well I'm going to speak a bit about profiling. Microphone's on, yes, okay good. So this is going to be split in two parts basically. I'll give a short introduction to some of the background behind profiling, some of the terminology, just a bit of light background. And then the second half is going to be more of a hands-on part. I just sent around a notification in Moodle.

Starting point is 00:00:28 If you want to follow along, that's how you can do it. The slides are in Moodle, and they contain all the commands that I'm running and everything and some screenshots. So this should hopefully be OK for you to follow along. If you'd rather pay attention to what I'm doing and not follow along, then this is going to be recorded. So if you're not here and watching this as a video, feel free just to run the commands

Starting point is 00:00:47 and pause as you please. Yeah, I'll just get started. And let's see how this goes. So just a brief overview again of where we are. So yesterday, we had the data structure lecture. Today, we're going to look a bit about profiling. So this is still mainly in the CPU and DRAM domain. So also the profiling we're going to do is going to be in that domain.

Starting point is 00:01:12 There's some other things that you can do to profile GPUs and network and disk and that stuff. But we're not going to cover that today. Just basic timeline. I think this is still correct. So we're here right now. Next Tuesday we'll have the introduction to the second task and the discussion of the solution of the first task. So reminder, Monday evening is the deadline for task number one. And then yeah, this is still the tentative stuff for the second part of the course.

Starting point is 00:01:48 So, yes, in this lecture, again, we're going to have two parts. In the first part, I'm going to basically just share some concepts behind profiling and also some tools that I use. So this is by no means an extensive list of everything that exists out there. This is some of the setup that I've been using or we've been using at the chair that I'm a bit familiar with. So again, this is mainly based on in the Unix environment because this is what I work with. So this is going to be a Linux server and I'm briefly going to just show something on Mac OS. But the focus is going to be on Linux server today. And this has a strong focus on C++,

Starting point is 00:02:28 just because I don't have that much experience in profiling other languages. But I guess most of this also applies to other things. So the tools, for example, perf that we'll be showing, you can use this with anything. So this also works with Rust. I guess if you work with JVM or Java, there's going to be some different profiling tools there.

Starting point is 00:02:47 If you have any questions regarding that, you can also ask me. I have a bit of experience in that, but I'm not going to cover this here in the lecture because for the lecture, we're mainly programming C++. So yeah, if you're watching this at home, feel free just to watch the video, pause, run the commands, see what happens, play around with this a bit. Just to start off basically by saying, everything I'm showing you today is just gonna be a very light introduction to the topic,

Starting point is 00:03:13 just to show you some of the tools or the main tools I've been using for this. If you're a perf expert already, you're probably not gonna learn that much today. But in general, this is, using these kind of tools is also just a skill that you need to practice. So for most cases, you've probably not run into this

Starting point is 00:03:30 because it's not been super relevant. But maybe down the line, it'll be relevant to see where performance is being spent or where you're losing performance if things are doing the right things that you expect them to do, and so on and so forth. So maybe just kind of run the commands, see what they do, get a feeling for what you can do with them, because especially perf is an incredibly powerful tool

Starting point is 00:03:51 and we're only going to scratch the surface of what it does because it will probably take, you could do the entire semester on perf lectures to show what this stuff can be done. Okay, so profiling. Let's start with the mandatory kind of random definition here. I picked this off Wikipedia just because I think it's quite nice. So basically, if we look at profiling, we want to analyze something about the program.

Starting point is 00:04:18 Key part here, although it's like a dynamic analysis, so it's not something we do ahead of time, we run the program or an application, and we kind of do something to inspect what's happening and usually we're interested in something like the space or the time complexity so how long is it running how much memory does it use which instructions are used how often are they used or maybe how often are certain functions called how long the functions take and these kind of things so it's a very very wide spectrum of what we can do. We want to inspect in the program.

Starting point is 00:04:46 And then depending on what your goal is, you might want to use different tools or different approaches to cover certain things. So yeah, basically, why do we profile? Why do we want to do this? And there's actually two reasons to do this. So first of all is to understand performance. So for example, here when we do some research,

Starting point is 00:05:05 we're writing a paper, it's really important for us to be able to explain why our approach, for example, is better than another one. And for this, it's really important to profile and to see, OK, we're spending maybe less time here, more time there. And this is not necessary to optimize it, but just to understand why is something behaving

Starting point is 00:05:23 like it's behaving. Or if you're just running some benchmarks, it's important to know why am I only achieving this performance? Because I'm spending a lot of time in this function. It's very slow, so I can't be faster than that. And also, we might want to profile to find bugs, or something like a memory leak, or something like that. I'm profiling.

Starting point is 00:05:40 I'm seeing my memory is increasing, increasing, increasing, so maybe I found a memory leak. We can also do this through profiling. And then the second step, after I've understood performance, then it can be optimizing or tuning your application. So let's see, I found a method and it's very slow. Then I'll spend some time on optimizing this. And this is something what we'll kind of go through in the example later. We're trying to see, okay, where are we spending time?

Starting point is 00:06:04 Or what are we seeing that's unexpected? And then how can we apply a few basic fixes to do this? And the optimization can be automatic or manual. So often it will be manual. You'll run some commands, you'll see where time is being spent, and then you'll optimize your code accordingly. But this can also be done automatically.

Starting point is 00:06:25 So basically, if you think of the JVM, for example, this will profile your code while it's running and then decide based on some characteristics if it wants a JIT function or not. So this can also be hidden from you. But it's basically the same thing. And there's mainly two types of profiling tools. And there's instrumenting profilers and sampling

Starting point is 00:06:46 profilers. And I'll get into both of those in a second. OK, so the first one is tracing or instrumentation. And they're kind of intertwined a bit. So the terminology is not always 100% strict or clear. So I'm just trying to give an overview of what they generally mean and how they stand to each other. In general with tracing, we mean that we want to observe the execution of an application

Starting point is 00:07:12 via traces. Traces are basically just logs. The main difference is if I do logging or normal logging, so most of you might have written an application where I want to log something. Then this is often some semantic event. So let's say user logged in successfully or there was an error that returned 500 codes, so I'm going to log this. And this is often like an important semantic event. While traces are usually used more to capture the flow of something. So if I want to really dig into something in more detail, I want to see, OK, so first this method executed,

Starting point is 00:07:48 then this method executed, and this one. And maybe I have a request ID in all of these logs so I can kind of see how this went through the program. This is usually a lot more detailed than general logging. And if you've used logging frameworks, you'll often see the log levels you have. Like you have the fatal and error at the top, and you have warning info debug,

Starting point is 00:08:05 and usually under or below debug you have tracing, which usually means this is something that is gonna be very, it's gonna be expensive to do because I'm logging this all the time, but it gives you the most details, the most fine-grain information that you can get. And then instrumentation basically is just the process of adding traces. Again, kind of un-technical, but we modify the code in some way

Starting point is 00:08:32 so that we can observe or monitor this behavior that we're seeing. So yeah, we're basically just adding traces. And then there's two main ways to do this. First one is source instrumentation. So this basically just means we're going to modify the application code. Most of you have probably done this in some shape, form, or whatever. It's just like I can add a timer and say, okay, start, end. This is basically tracing already.

Starting point is 00:09:00 And this is a lot more common than binary instrumentation for users because it's a lot easier to do. You're most likely not going to modify your code on a binary level. But there is also binary instrumentation, and this basically actually really modifies the binary code. And again, this is used, for example, for jitting in VMs, or there are certain profiling tools like Valgrind that actually do this for you. But yeah, most tools do this. If you do it manually, it's going to be usually source instrumentation, and then existing profiling tools like perf, for example, also does binary instrumentation.

Starting point is 00:09:42 OK. So let's say we have this very, very basic method, and we wanted to analyze some performance characteristics of it. Let's say we wanted to see how long it takes to run the bar function in this context, and we wanted to understand how many times we call Baz. Then we can do something very simple,

Starting point is 00:10:03 and we can just say, OK, in the foo method, let's add a start timer. Let's run the calls. We do an end timer. And then we track somewhere along the way this took this much time. And then for Baz, we can say, oh, OK, I have a counter somewhere.

Starting point is 00:10:18 Just add plus 1, because then I know that this was executed. I just prepared something. Actually, I'll do this first and then. Okay, so the big, big advantage of this is that it gives you really full control of what you're doing. If you're interested in something, I can just say run this. If I'm not interested, I can say don't run this. If I'm not interested in the BazCounter anymore,

Starting point is 00:10:44 I can just remove it. But if I'm interested in even more detail, then I can add even't run this. If I'm not interested in the baz counter anymore, I can just remove it. But if I'm interested in even more detail, then I can add even more details here. So this really gives you full control. But generally, this has a very high, or can have a very high overhead, depending on the granularity that you use. So if we look at the baz function, for example,

Starting point is 00:11:02 it just does a division by 10. And now suddenly, we have some map access that we need to do. We need to increment something. So this will suddenly make the code a lot bigger, which I'll show you in a second how much this changes stuff. There's an overhead of timer calls. I'll get to that a bit later. And this may even impact compiler optimizations. So let's say we have a very small function and the compiler decides to inline this.

Starting point is 00:11:28 Maybe we've modified it now in a way that the compiler decides, oh no, it's too big now, I'm not going to inline this. So then suddenly you're changing the characteristics of the application. And just to show you a very small example of, okay, this is, how can I get this to go across? Okay, we can see now, so we have this, where's my mouse? Okay, so we have this small code example here, which is the same, very, very basic. And now we have the code here at the bottom that adds the timers here, which is just a

Starting point is 00:12:04 clock and we add this here. So this is the left of the code on the top. That adds the timers here, which is just a clock. And we add this here. So this is the left of the code on the top. We can see we have the division, which is converted to some fancy multiplication here. And we can see this is all inlined here. And on the right, which is very hard for me to read now, we can see that there's suddenly within Baz,

Starting point is 00:12:22 instead of just doing the division, there's suddenly all these move and call instructions. We need to get something from the map. And then we call another something to delete something. And suddenly, this has made our code a lot more complex. So this obviously heavily depends on how you implement it, just to kind of showcase that. I think everything is still in line here, actually.

Starting point is 00:12:43 But you can see there's just a lot more going on because we have different components flying around. So you just need to be aware that this might be expensive. How do I get back to my PowerPoint? OK, perfect. Yes. OK, so there are some pitfalls that you can run into when you're doing this.

Starting point is 00:13:03 I'm just going to mention a few very briefly. So this is generally a problem if I want to do very, very short measurements. If something's running for five minutes and I add a timer before and afterwards, it doesn't really matter because, well, this is going to be very quick. But especially if you're interested in very, very tight loops. So, for example, in the first task, the loops we're working on, the instructions, there's maybe only a few dozen instructions. And modifying this can and will have an impact on performance. So some things that people kind of tend to ignore or may ignore is that the compiler and the CPU can reorder instructions. So if I'm interested, for example, just in the duration of one single instruction,

Starting point is 00:13:48 and then maybe the compiler decides to reorder something, then I'm not actually measuring anything meaningful. So this means that the timers can start and end in the wrong place, and you kind of need to pay attention to how this is done. If you go down this rabbit hole and you're really interested in very, very, very accurate performance measurements of cycles, then at some point

Starting point is 00:14:14 you'll probably read some blog post or Stack Overflow answer that says, use this read timestamp counter instruction on Intel, for example. But that's incredibly hard to get right. So this is basically a hardware register that tracks the number of cycles that are processed. So if I'm really interested in how many cycles have passed, I can just read this.

Starting point is 00:14:33 But it's not consistent with regard to the clock frequency. So if on different cores I have different frequencies, then all of this doesn't really work. And yeah, so it's not synced between cores. So for example, if my application switches from one core to another, then kind of this all breaks. And again, this is a non-serializing instruction, so it can be reordered and very likely will be reordered.

Starting point is 00:14:58 So if you're really looking at a very, very, very low level of performance measurements, then this can actually go wrong and you're just measuring something incorrect. And then there's one thing that again applies mainly to very short measurements. For example in C++ there's something called high resolution clock which people like to use because it says it's high resolution, but this is actually by the standard not guaranteed to be monotonic. So while this might work, it technically doesn't have to work. And by not monotonic it means technically if the system clock adjusts,

Starting point is 00:15:33 it can go back in time and then weird things happen. So also like if you check CPP reference or something, it says it's recommended to use steady clock if you want to check durations and use system clock if you're actually interested in walk time that has passed. So yeah there's a few things here and obviously again if you're looking at very very short measurements then getting the time actually has a cost. So it really depends on the system and what you're running on but getting the the time stamp now will cost you maybe 10 to 30, 40 nanoseconds.

Starting point is 00:16:06 It really depends on what you're running on. Sometimes it's very fast, sometimes it's a bit slower. And if you're looking at measurements at this frequency or this granularity, then adding a time on every instruction will probably really, really mess up your measurements. Yes? One quick question regarding all the TSC. So it does count cycles, right? It would still tell you, like, let's say two calls are

Starting point is 00:16:29 at different speeds, and your code consumes this many cycles. It would consume the equal amount of cycles on each call, just the quicker call would be quicker to process it. So it's still useful information. Yeah, so this is definitely useful. The question is just what do you want to do with this? So often, so if you're purely interested in cycles,

Starting point is 00:16:49 then this is probably okay and probably good to do. But if you want to convert this into time, for example, then that doesn't work. But if you're purely interested in cycles, yes, this probably works. If you also consider that it's non-serializing and that you need to do certain things to guarantee that it will be executedserializing and that you need to do certain things to guarantee

Starting point is 00:17:05 that it will be executed at the right point in time with regards to stores and loads and whatnot. There's a lot of Stack Overflow answers on this and a lot of discussion on how to do this and how to do this correctly. And most people say try to avoid it if you don't really know what you're doing because this is probably one of these things that you run into and You think ah, it looks like I'm doing the right thing, but maybe I'm doing the wrong thing So yeah, but in general just cycles. I think this works

Starting point is 00:17:33 Okay, but I I've kind of stayed away from it also because a lot of people say it's dangerous Yes, like I just mentioned tracing tracing everything can be very expensive. So if we just go quickly back to this example, I'm doing this every single time. We've just seen this is very short. These measurements are very short. The instructions that I execute are very few. So doing this on every iteration is actually very expensive. So instead, let's just sample this.

Starting point is 00:18:02 So let's say we'll do this every 10th instruction, every 1,000th instruction, or maybe do this every 10 milliseconds, every 100 milliseconds, or something like that. So what we can do is we can just add, again, manually, we can just say, well, if there's some counter somewhere, and that is now a multiple of 1,000, then we'll track this. So we sample every 1,000 events. And so obviously, this is less accurate

Starting point is 00:18:26 than the full instrumentation, because we still have the modification of the source code, so we're kind of still messing with what the compiler will do. And we're not tracking every event, but only a sample of it. But generally, this has a lot less overhead during runtime, because we're not doing expensive calls every time, but only if we do something that's fairly cheap to compute. So here probably also like modular. If you really wanted to do it this level,

Starting point is 00:18:53 doing something modular 1,000 is probably a bad idea because it still involves multiplication and everything. If you do something that's a multiple of, oh, if it's a power of two, then this will just be a shift and it's a lot quicker to check and these kind of things um so again if you're really looking at low low um or high performance measurements then there's a lot of things to pay attention to um and yeah so one main downside let's say is that this still requires manual code modification so writing this in all of your code is going to be very very annoying um so if you're generally just saying, oh, where am I spending time?

Starting point is 00:19:26 What's going on? Then I don't want to write this for every single function. Sure, if I only have five functions I'm interested in, then this is OK. If I have a code base with a million lines, then this is probably going to be very messy, especially if we think of something like I need to keep track of some counters somewhere and these kind of things. That gets very messy very quickly. OK, so let's kind of switch now from the source modification that we did.

Starting point is 00:19:52 And let's look at something that hardware provides for us that we can do to inspect some of the code we're running. And the first thing we kind of look at there is hardware performance counters. So these are special registers to track hardware events. So this also means they run in hardware, and this also means they're very fast. And you don't really need to care about them.

Starting point is 00:20:12 Because for example, if you have an Intel chip or power chip, I think basically all of them do this. And every clock cycle, some counters are incremented somewhere in hardware registers that at some point I can read. So there is a small cost in reading these registers, but they're constantly tracked, and there's no cost for the system to really do this.

Starting point is 00:20:33 So again, on Intel, for example, there's this read performance monitor counter instruction, which was introduced, I don't know, probably 10, 15 years ago. It's been there for a while, to extract information from these counters. And Skylake, so this is now, I don't know, what, seven-ish years old? I don't know.

Starting point is 00:20:57 Not super important. Skylake has four programmable registers, or eight if you disable hyperthreading, because then you'd have two logical cores that share this and now Ice Lake for example has eight programmable registers per core but they have a wider range but now they have a bit more restraints or restrictions on which instructions or which events I can track in there and again this is very hard to like generalize because it really depends on the system you're

Starting point is 00:21:27 running and the CPU generation you're running, which kind of events are supported and which aren't. But generally, this will be something like L1 misses, number of cycles, branch mispredictions. So very low level metrics that you want to extract that are basically impossible to track otherwise with this level of detail. low-level metrics that you want to extract that are basically impossible to track otherwise with this level of detail. And I said there's four or eight registers that I can just program and say, please, in this register, track this counter.

Starting point is 00:21:55 In this register, please track this counter. Let's say I wanted to look at 10 counters. Then I don't have enough registers. So what most tools will do, or systems will do, is to multiplex this. So for one cycle, for one iteration in whatever time period I have, I'll put this value in the register, and once that's done, I'll take the other value and put that in the register,

Starting point is 00:22:20 and then I can kind of time multiplex and scale the values. This is obviously a little less accurate, but this is how most tools will do this. And for the majority of things you're looking at, this is probably also fine. Okay, so we want to, like I said, we're looking at hardware events, and we want them to be very precise, because we're looking at, like, we're want them to be very precise, because we're actually interested in very low level performance metrics. So basically, all major CPU vendors now have something that lets us do

Starting point is 00:22:54 some form of precise event-based sampling. So in Intel, this is called PEPS. On AMD, it's called instruction-based sampling. On ARM, it's called CoreSight. It On ARM, it's kind of called core site. It's not 100% the same, but kind of covers the same topics. And the main idea behind this is that I can set a register, and I can say when this register overflows. So I'm saying basically this is a limit.

Starting point is 00:23:19 Let's just say 100. If my register now has counted 100 branch mispredictions, then it will trigger an interrupt. So then the CPU will say, OK, this register's overflowed. Interrupt, and then the application monitoring disk can say, OK, let me read the values that are there now. And then I can just store this information. And this is, for example, what many profiling tools do.

Starting point is 00:23:43 So I'm interested, OK, when this register now is overflowed I get an interrupt I'll check okay what is the current instruction pointer where's that pointing to what are the registers what are the values here and I can collect all of this what's the stack the more my core graph right now and I can just dump this somewhere or store this somewhere and then I can just continue executing normally so this is what basically all tools you mainly well let's say the this somewhere or store this somewhere and then I can just continue executing normally. So this is what basically all tools, mainly let's say the focus here is on perf. So this is definitely what perf does and Intel VTune for example also does this.

Starting point is 00:24:15 This is all hidden from you. You don't really need to care about this. I'm not really sure. I think on the next slide I have some more information on perf for this in a second. So this is basically some of the basic, the fundamental background on how these tools operate. So if something like perf, I'll get to in a second, we'll use these hardware counters,

Starting point is 00:24:37 we'll use this precise event-based sampling to monitor what's going on, and then give you this information. Another question. So now, nowadays, lots of code is executed in virtualized Cloud environments. Would you say that the Ritualization just prevents that or are there different levels Where you could say, like, linux-based containers, you Get everything, get the permissions, but general

Starting point is 00:25:03 Hardware-based virtualization breaks this and you can only go To software? i'm not 100% sure. There is a lot of discussion. So, obviously, people doing virtualization are still interested in these things. But I say if you think of it on, like, an abstract level, then probably virtualization will say, I don't really care about this. So, it's not necessarily going to pass this on. And also, for example, if you run something in Docker, then you need to run everything in privileged mode

Starting point is 00:25:28 so you can actually even get this information. Because by default, you're saying this is important information and can be used for security reasons or whatnot. So there are ways to expose this, and it does work. I'll show an example. This does work in Docker, but it really depends on if you want to expose this or not. But I wouldn't say that virtualization in general

Starting point is 00:25:49 prevents this, because also in virtualization, you physically need to run somewhere. And then it just depends on the tooling around it if it can stitch everything together again. I guess I don't mean whether it prevents it, but let's say we two run something on the same physical box. I look at the hopper counter. I do nothing. I it prevents it, but let's say we two run something on the same physical box. Yeah. I look at the hopper counter, I do nothing.

Starting point is 00:26:07 I look at it again, the counters increase because you did something. Okay, so that's an interesting question. So the question was basically, what if we have two applications running, one is mine, one is his, and now we're reading counters and they change. So first of all, this is only per core. So you can monitor system wide.

Starting point is 00:26:27 But you need privilege access to do this for security reasons. So in a truly virtualized environment, this won't happen. And you'll generally just read the counters that are assigned to your program. So that generally shouldn't happen. But again, this is kind of tooling around it. You, A, shouldn't really care about this.

Starting point is 00:26:45 If you're building virtualization yourself, you Probably need to deal with this. But in general, i'd say tooling Or the kernel or the cpu will take care of this for you that You don't read values that someone else has done. Yeah. As long as the calls are not Shared, you're fine. If the calls are not shared then this is a zero problem because the counters are per core.

Starting point is 00:27:05 If they are shared, then it's basically if I'm scheduled off the core, then you would need to manage this somehow. And I assume there's probably all, I'm not 100% sure, but there's probably an interrupt for that, too. Then you can probably track something, say, okay, I've been scheduled off this core, store everything. And then when I go back on, I can, wherever I'm scheduled to next. I can start. OK. Let's dig into some tools and some friends. So this is just some basic profiling tools that I've been using a bit, and also some tools that are not directly profiling tools, but that help you with profiling.

Starting point is 00:27:42 So profiling in itself is often a bit of an art form, let's say, because it's usually not super obvious what's going on. So the example we'll show later, it's a bit more obvious, but often it's not that obvious. So you need to dig into a few things to understand where time is going. So the first one, the first big one I'm going to share is perf.

Starting point is 00:28:00 So this is just the standard Linux kernel profiling tool. So basically, if you have Linux, you probably want to use perf. And it abstracts all these hardware and software events for you. It doesn't matter if you're running on Intel or AMD or ARM, there's a set of things that it just fully abstracts for you.

Starting point is 00:28:16 Obviously, if you're interested in CPU-specific counters or events, then you need to know which hardware you're running on. But something like cache misses and branch mispredictions and these kind of things, this will just abstract for you. And it doesn't matter where I'm running. I can just say, give me cache misses, and it will do that. So these are like hardware events. But also, it has a lot of software events

Starting point is 00:28:41 that you can kind of look into. And that's why I'm going to start with this scary picture here. And you can see in the kernel there's a lot of different things going on and I can get information about syscalls or networking or file systems or page faults. For example, I think there's core migration which would be I've been switched to another core

Starting point is 00:29:02 and these kind of information I can all get that and perf does all of this for you, which is quite nice. If you run perf list, there's a list of all these events. And I didn't fully check because there's multiple lines per output. But they're in the order of thousands of events that you can use in perf to track. Most of them will probably never be relevant for you. But some might be.

Starting point is 00:29:28 By default, currently, perf samples at about 4,000 megahertz just FYI. You can obviously change this too, but that's just what you usually get. And now just to cycle back to these hardware counters. So on Intel, it tries to see to use this unhalted core cycles counter. So basically when that reaches an overflow with precise sampling then it'll take a snapshot of everything and continue.

Starting point is 00:29:56 On AMD it's a different instruction, on ARM again different instructions. You really need to dig into some of the documentation to figure out what's going on underneath but in most cases it's also not very relevant and yes so there's some really good resources on how to use perf so if one is just the official Vicki from the news Colonel and the second one is where I also got this picture from from guy called Brendan Gregg who does a lot of perf things there's a very nice talk about how they use perfect Netflix on this website from a guy called Brendan Gregg, who does a lot of Perf things. There's a very nice talk about how they use Perf with Netflix

Starting point is 00:30:28 on this website. There's a lot of one-liners that you can use just to check different things. So it's a very, very good resource if you want to get into Perf. So yeah, Perf, like I said, is a very, very mighty tool. And if you just run Perf without any arguments, then this is the list of commands you can run.

Starting point is 00:30:47 And this says this is the list of the most commonly used commands. So there are even more. It's just not showing all of them. Again, most of them are never going to be relevant for you. This was originally also designed for kernel profiling. So a lot of these things are probably only relevant if you're actually in the kernel or if you're

Starting point is 00:31:03 doing very specific things. And the most commonly used ones, don't take my word for it. Obviously, it depends on what you're doing. It's just going to be perf record and report, which says record something and later on, once I've recorded it, show me the output. Or perf stat, which just counts the events and outputs them directly.

Starting point is 00:31:26 And some nice things. So you can attach this to a running program in Linux, as long as it's your own. So you can't attach this to someone else's program. You can do system-wide monitoring. So I can just say, please track all page faults system-wide. You need to have administration privileges for this but yeah it's possible most often you'll probably just run a single application

Starting point is 00:31:52 so you'll say perf record my application then my application finishes and then I can see what's going on afterwards which is also what we're going to be doing later okay so just very brief output here. So if we have perfstat, for example, for the command ls, so you use this perf, then the command you want to do, and then this is usually the program. For this very basic example, I'm just showing ls. And this is just the default set of events that perf will emit.

Starting point is 00:32:18 So we can see how much time it took, how many context switches there were. This is the CPU migrations I was talking about. So how often was I switched from one core to another? It will track the number of page faults we had, how many cycles have passed, how many instructions were executed. From this, we can also then see how many instructions

Starting point is 00:32:39 per cycle there were. We can see how many branches were taken, I guess, or how many branches there were, and how many branch mispredictions there are in here. So these will just give you a general idea, and depending on the output then sometimes this will also give information on like, oh, you're spending a really, really high amount of time somewhere here, you're bound by something, really depends on what you're running. And then just a bit more specific, if you let's say I wanted to run this 10 times just to make sure I'm actually getting good results and I'm only interested in L1 data cache loads and stores, then I can just specify the events I'm

Starting point is 00:33:17 interested in and then I get only these events as output. So I can kind of mix and match here what I want to do. OK, next one is Intel VTune. So as the name says, this only works on Intel. It's an Intel tool for x86. They have a desktop app that you can run, which you need to run on it. So you need to have an Intel CPU to actually even

Starting point is 00:33:41 run this locally. But you can then connect it to a remote server. There's a web UI, which means you can run the, it has an application on the server and connect through the web, which I'll show you in a second. And they have a command line tool. I usually use the web UI because I have a MacBook. I don't, I can't, don't have Intel anymore. But if I want to profile Intel servers, and I can actually, I'll show you in a second.

Starting point is 00:34:06 The command line tool, I don't use that often, because actually the one main advantage for me personally here, this is a visual tool, so I can kind of click through it, and if this helps you, then that's good. Like I said, it only works on Intel x86, but then it has a lot of more specific information that might be only available on the systems so just one example all the arrows a bit messed up so one example here I've picked for example is 4k aliasing which is like a cache effect when I when I have I'm

Starting point is 00:34:38 accessing memory in four kilobytes strides then usually this is really bad for caching because depending on how your cache is designed, I'm always evicting the same cache lines just by how the addresses are calculated. So this can give me information on that, which again is very specific to the actual CPU and how the caches are designed. If you ever run anything on the data lab servers, this is installed on them. You can opt in to one API somewhere, somewhere. There's an application called VTune Backend, and you can run that to actually get this in the web UI. And then this looks like this.

Starting point is 00:35:21 I have no idea what I actually profiled here, so this is something old. Let's see. And then you can see there's some, this is probably not going to be super, oh this is some, okay. And then you can see some things here, and you have like a call graph and you can see where time is being spent and I can't show the assembly probably right now. OK, I can't show that right now because the path is all gone. But then you can see the assembly and the code

Starting point is 00:35:53 next to it in which instructions are being executed. And all these things, it's very visual. And you can see there's a lot of information here on where this came from and I don't know which function. And there's a flame graph here of how much time you're spending. So these are all very nice visual features all in one. If you're profiling Intel, then it probably makes sense at some point to look into this.

Starting point is 00:36:22 And this also gives you some nice insights into memory consumption and this also gives you some nice insights into memory consumption and memory bandwidth and on which memory channels this stuff is running on if you're interested in these kind of things question so this yeah this is generally i'll get to that in a second if you're profiling you always basically want to build with debug symbols. It doesn't mean you need to have a debug build. You can do release with debug symbols, which can be 0203, just with a minus g to add this. And then there's the debug information

Starting point is 00:36:55 that then the tools can pick up. Otherwise, you'll just get cryptic names here, which is like 0x2499 underscore something. Whatever it can pick up. If it has information, if it doesn't, then you won't get anything. like 0x2499 underscore something. Whatever it can pick up. If it has information, if it doesn't, then you won't get anything. OK.

Starting point is 00:37:13 Yeah, so this is a nice tool. If you want, play around with it and see what can be done. Another one is Apple Instruments, just for the sake of completeness here. This is Apple's visual profiling tool. This is mainly designed for mac os and ios applications so it kind of lives in this xcode environment you can use it without and i have used it without i'll show you in a second this only works for mac os but it supports arm

Starting point is 00:37:36 if you have an m1 or m2 macbook which is quite nice because if if you have like intel on the one side x86 it's a very, very massive tool. Tooling specifically for ARM compared is very low. There are some commercial ARM profilers, but the feature is free. So that's usually a problem if you just want to play around with it. And I have something here. I can switch that across. So you can see here there's some code I ran for some benchmarks a while back.

Starting point is 00:38:07 Without digging too much into this, you can see here how much time we're spending in different instructions. And we can see it as a call graph here and seeing how much time we spend. So you'll kind of see the structure again later in Perf. These tools kind of all have a similar structure. But without digging too much time,

Starting point is 00:38:25 we can see how much time we're spending, basically all we're spending here in this hash bucket get function. Again, Apple Instruments, feel free to play around with it if you have a MacBook and if you want to profile some of your code there locally. Okay, some friend tools. So you've probably seen this in the lecture already. It's Godbolt. It's officially called Compiler Explorer, but the guy who invented it is called Matt Godbolt, so everyone just calls it like that. There's the link. I'll also show this later in the hands-on part. Yeah, so basically what it does, it just takes the code you have and it shows the generated assembly for it. So this is incredibly powerful if you want to discuss what the compiler is actually doing the main motivation for this actually was that

Starting point is 00:39:11 Two people were arguing about will the compiler optimize this or not and with this tool You can actually say the compiler will or will not do this It works for many many many languages. It works for many many many compilers if you want to look at GCC and how all the versions if you look at Clang and have all the versions it runs on ARM it runs on x86 you can compile stuff for power on there you can use super esoteric compilers that no one really uses but they have it so definitely check it out and just one fun fact that I didn't know for a long time actually is for example you can also include public headers from GitHub.

Starting point is 00:39:47 So if you go onto the raw view on GitHub, I can just take this, include it, and C++, for example, and then I can just use public code there to show, let's say I have a dependency, and that's the easiest way to do that. Okay, the next one, and this is a bit more specific, but I've been using this a bit recently for some things. I've been digging into SIMD for example, and there's this website called uops.info. This is run by some people from Unizarland. And on this website you can get four x86 servers so this is AMD and Intel you can get and certain instructions for example this VP V div PD instructions which is probably some division of some floating point something 100% sure double yes okay and this is really helpful if you really

Starting point is 00:40:45 want to dig deep into something. So most of this is probably not relevant. There's just a link here. Again, if you're looking at assembly generated code, you're looking at some profiling output, and you kind of want to see why instructions are behaving like the instructions, this might be helpful. The third one, which we're also going to use in the example,

Starting point is 00:41:03 is Google Benchmark. So this is a widely used, probably not the mostly widely used C++ microbench framework. Something that it's also kind of well known for is this do not optimize method. There's a nice video explanation of what it does. It's basically a compiler developer telling people how to trick the compiler into not doing certain things. Because if you're benchmarking something and the compiler sees, oh, you're actually never doing anything with this value you're reading, then I can just remove the entire

Starting point is 00:41:34 function call. And then you're profiling nothing. This usually happens when you profile something or benchmark something, and you see I'm getting 10 trillion events per second, then this is probably happening. And there's an online version of this for QuickBench. So this is kind of similar to Goldbolt. I'm not going to open it now for the sake of time.

Starting point is 00:41:52 But you can add some code here in this Google Benchmark format. And then you can just run it, and it'll give you a brief overview. This is not going to be super precise, because you have no idea which server it's running on. But you can see conceptually which of the two might be faster or not. So if you want to look at really, really low-level optimizations for specific hardware, this is probably a bad fit. But if you're looking at general things, this is actually quite a nice tool. And you can then just share a link and show someone, hey, I ran this somewhere without needing your own server.

Starting point is 00:42:21 Okay, just a list of other things. Again, I said profiling is usually not done in isolation. Actually, the first one is still in isolation. So, Valgrind, there's an entire tool set around this. This does dynamic binary instrumentation. So, it kind of says, I'm simulating the CPU instructions, adding something, and then actually executing it on the real CPU. So, this is an incredibly powerful tool. I personally don't have that much experience with

Starting point is 00:42:48 it, so I can't share that much information here right now. But again, a lot of people use it. It's very mighty. It's been around for 20 years. There's other tools you should definitely use, like static analysis, static code analysis. So I think for the example I'm going to show you in a second, most of this will actually be found by a static code analysis tool. Something like address sanitizers, thread sanitizers will help you find bugs. So then it kind of also, you're guiding in where you need

Starting point is 00:43:20 to profile and where not. If it's telling you how you have a problem there, then you can kind of look into some deeper. Use flame graphs. Like I showed you, it might be very helpful to get a visual representation of where time is being spent. If you are a visual person, there's a lot of tools out there for it. VTune can do it. There's a plugin or like a script for perf to do this. And there are many, many others. So if you ever need to profile something, just Google and maybe there's a tool out there that actually does it for you. OK.

Starting point is 00:43:49 So I'm going to switch over to the hands-on profiling part now. This is the first time we're doing this. So let's see how this works. Oh, actually, just a very brief setup before I get into it. So the scenario we're running is a very, very, very basic, simplified semi-join on strings with a count afterwards. So we're just saying, please count how many elements in my right relation have found a join partner in the left relation. So I mean, basically, it's just two

Starting point is 00:44:19 for loops here. We add stuff to a hash set. If we found a partner, we increment a counter, and at the end, we return that counter. Very small, very basic thing, because I can show this on my computer here so you can actually see the full code. Again, for doing this like these toy examples, there's always a trade-off between showing purely toy examples or actually showing something where there's a bit of progression. But usually, you'll do this in code bases that are a lot larger or you're looking at very, very,

Starting point is 00:44:48 very specific things that are not very easy to do, both of which are hard to showcase in a short example here. Yeah, this is just general information, like the question we had up here before. If you want to profile something, always use minus G flags for C++. This will add debug symbols to your build, because otherwise the compiler will have no idea, the profiler will have no idea which function you're actually in or what the name is or where you are in the code, because you're explicitly saying, I don't care about this. And one small random detail, you need to have this well you should generally use this no emit frame pointer command here to actually keep the frame

Starting point is 00:45:31 pointer in the frame pointer register one thing is that Intel didn't use to have the do it back in the day Intel didn't have that many general purpose registers so someone decided to say ah let's use the frame pointer register also as a general purpose register. Then you lose this information. So if you don't do this, it might happen. And then if you want to do call graphs and seeing where I'm jumping through the code, this will probably break if I don't do that. And for perf specifically, you need to have this small command here at the bottom, which actually allows you to profile kernel information. Otherwise, Linux will just say you have no rights to do this.

Starting point is 00:46:10 OK, so let's see if I can kind of switch to this here now. And I want to move. And now I actually want to stop. I want to mirror something so I can actually see what I'm doing. OK. I've prepared something. My printout, so I actually know what I'm doing here. Because otherwise, it's easy to kind of lose track of where we're at.

Starting point is 00:46:41 OK. So the very first thing we want to do, I just like I said, this is going to be full run-through. So first of all we probably want to actually this is probably... is this okay? Okay good. So very basic setup. Build type. So generally, what we want to do is, first of all, we want to build something to run. That shouldn't be a surprise for most of you. So here, again, we're doing, in CMake,

Starting point is 00:47:26 there's this predefined rel with that info, which says this is release. Build it with debug information, though. If you want to follow along, feel free to follow along. For the sake of time, it might be tricky to keep up, because some commands will take a while. I'll switch back will take a while. I'll switch back and forth a bit.

Starting point is 00:47:47 This is all recorded. The slides contain all the commands, all the information. If you want to do this later, feel free to do this. So, yeah, first of all, I just want to show you the code we're running. So at the bottom here, we just have some random setup code. We're doing some initialization. We're doing the Google Benchmark code here, which is not super interesting.

Starting point is 00:48:12 We're generating some data. Again, feel free to dig into this. But what we're actually looking at is we're looking at this count tuples with join partners method here. So if you have some experience writing C++, you'll probably also immediately detect some issues with this code, which we'll kind of go through step by step. But let's say this is the code we want to profile now.

Starting point is 00:48:32 Now we are a database engine that compiles its queries, and we have this super fancy generated code here, and we see it's kind of slow. So this is kind of what we want to do for the first thing. So first of all, we kind of want to build everything. While this is building, let me just quickly go into here. I think your tabs have different font sizes, right? So like the first tab has bigger font size than the others? Oh, okay. Okay.

Starting point is 00:49:09 This kind of... Okay. Okay. As long as people can read stuff. Okay, so what we do, we have this small benchmark here that we can run. Let me just run this. This will give us some output. Hopefully, this works.

Starting point is 00:49:26 Okay, it's a bit slow actually right now. Maybe someone else is working on the server. So first of all, let's just rerun this. But it's kind of hard to... Yeah. It's weird if you usually type with a different keyboard and then you use this one. Oh, it's repetition, isn't it? Okay.

Starting point is 00:49:50 So, generally, we want to do things a few more times to actually see if the numbers we're seeing are good or not because if they're not, then it's kind of hard to see if we're making progress if they're very random. And for larger code snippets, they usually get a lot more noisy. So this will hopefully show us that the performance is somewhat the same. We can see it's somewhere between 590 and 600ish nanoseconds per tuple. So if we go back to the code now, this kind of shows, we learned in the lecture a while back, that there's something like branch mispredictions on ifs,

Starting point is 00:50:32 and so we can use predication to solve this. So, this would be like, we learned this in the lecture. So, let's actually fix this and say that we want to count. So let's say we want to do this. Sorry for the weird indentation here. I'm not very good with Vim. And it's a bit slow. OK. OK, and let's say we have this.

Starting point is 00:51:11 Actually, can I set number? Can I unset it like that? No, OK. Huh? It's like no number. Nice. So let me just copy. Oh, it doesn't have anything.

Starting point is 00:51:26 Okay, never mind. I wanted to show you in Godbolt. We have vector, string. These can all go. We don't care about that yes then we want to do yeah I'm on that so we've come running stuff with GCC 11 and there's a 3 so let's see if this hopefully does what it won't but we want it to do. So yeah, let's quickly go back to the other code. And we can see here, if we go to the Revealed Think Code, we can see there's a jump here.

Starting point is 00:52:15 So if this value is not equal to whatever, I don't know whether to check, the comparison's right here. If it's 0, then we jump, which is somewhere up there. Oh, well, it doesn't really matter right now, especially if we have a branch, and then we add something, and then another branch somewhere along the way. We want to get rid of that branch because it's slow. So we do this.

Starting point is 00:52:39 We compile it. And then we see that exactly the same code appears. So again, we've just randomly guessed that something doesn't do any difference. I can compile this and I can guarantee you the performance will be exactly the same because it's the same assembly code. So this is one of these things. Well we learned something. This is generally what people usually talk about when they mean premature optimization is the root of all evil.

Starting point is 00:53:06 So I'd argue that this code is still OK to read. But people making random guesses about performance, optimizing something and saying it's going to be faster, well, it's usually not. Interestingly enough, if I switch this to 16, just for the sake of comparison. This has a predicated version in it, one second. And actually, if I switch to the other version, it also uses predication.

Starting point is 00:53:35 So Clang apparently says I always do predication, while GCC says I don't. Very hard to generalize. Let's see. At least it did yesterday. Let's see if I least it did yesterday. Let's see if I can. Yeah, same code. The question up there, yeah?

Starting point is 00:53:50 Is there any difference in the compilation time when you try to compile what we're doing? No, no. This is tiny compared to whatever's going on. This is completely irrelevant. This is a very, very, very low level performance decision that the compiler makes. So just generally to show you, this is actually completely irrelevant. So I'm actually, for the sake of time,

Starting point is 00:54:16 took a bit longer than I expected to. Let's just get rid of all of these. Just make, again, just to make sure we have everything. Okay, so what we actually want to do is now, if we want to, oops. We actually want to record this, which I said, perf record instruction. So this will run now for a bit, shows the output. So you're running the normal command but in the background you can see it's woken up five times to write data. I'll show you in a second how much data has been written. The first time for some reason it always takes a while.

Starting point is 00:55:01 Okay and then we can use the perf report command. And then we can look at this and we can see, well, we're spending some time here in some random instruction. We're spending some other time in another random instruction. Another time here in node pointer static cast something. What was this? This is some range hash. This is some hash table internal move instruction.

Starting point is 00:55:28 So it's very hard for us to actually understand what's going on here. Personally, for me, it would be very, very tricky to do this. Yes? You were using tab there just to go through all the instructions. Oh, yes. Oh, yes, okay. Yes, so I'm using tab, the tab,

Starting point is 00:55:40 so I can switch between the hottest instructions. So you can see I'm here at 21. If I go backwards, then I'm at the lowest one. And this is just cycle down through the, yes, thanks for the pointer. So this is not really helpful. So what we can do is we can do perf record minus g, which says we're interested in the call graph. So we do the same thing. Now I want to see where the function call is coming from.

Starting point is 00:56:04 What am I spending time on? So this will run. I can perf report again. And now we can see there's these small pluses on the left hand side there somewhere. And we're interested in our join something here. And we can see there's our join and some libc start call main, which probably is not in our control.

Starting point is 00:56:25 And then we have our count tuples with join partners, and there's std hashbyte. So I'm using the plus character symbol to go into these. We can see there's some c3 glibc call hashbyte something. So for me, this is very memmoved. It's not really clear what's going on. So there's another way that we can go into this, which is, execute this.

Starting point is 00:56:55 Okay. So, what we can do is we can say, currently what it just did, it used the frame pointer to detect. So, if I just do minus g, it detects the frame pointer, where am I, where have I come from, and it didn't have enough information for us. So, so if I just do minus g, it detects the frame pointer. Where am I? Where have I come from? And it didn't have enough information for us. So another thing we can do is we can use the dwarf format. So what it does, at every snapshot,

Starting point is 00:57:13 it takes a snapshot of the entire stack. And then now when I run this to report it, now it will go through all of these snapshots in the background and try to figure out what was the information that I've stored previously. And it uses the dwarf debug information that we've said because we have minus three, minus g. So we have the debug information and we're using this to actually get some insight. So this will take a while at the beginning.

Starting point is 00:57:40 And this will now hopefully be helpful. So this is all done at runtime now. It takes a bit longer. At report time, the other one actually did it at recording time. So when this is done in a second, we can see here, this is now a bit cleaner. We can see we're interested in our benchmark join function here. Okay, so this is the count tuples with joint partners this is the method we're interested in because it's a hot loop but interestingly

Starting point is 00:58:12 enough we see that about like was this nine percent eight nine percent of the time we're spending in some vector logic so this is kind of unexpected let's we just jump to to the code can someone maybe tell me why you think this is kind of unexpected. Let me just jump to the code. Can someone maybe tell me why you think this is the case? Why are we spending so much time, or like 8% of our benchmarking loop in some vector code, even though we're working with an unordered set? Any guesses? Why would we be spending time in vector-specific methods in this?

Starting point is 00:58:54 Cool, yeah. Yes, very good. Because we are copying the vector. Like I said, any static profiling tool will probably detect this for you. We're only ever reading left side and right side here. But we're copying them on each invocation of the method. So this kind of the guide here kind of told us, hey, we're spending some time.

Starting point is 00:59:17 And unfortunately, I can try to zoom out. But you can see there's a vector construction and destruction right at the right. This is always a bit shitty to view in perf, unfortunately. But yeah, so basically the very, very simple fix here is just to say we're copying these. We're not copying these. We're taking these by reference.

Starting point is 00:59:42 And this will actually, so we see here we're at somewhere like 580 nanoseconds before. And now we're at 530. So this gave us about a, let me just see if this aligns with the numbers I had. Yeah, this gives us about 40 to 50 nanoseconds improvement. So now again, next step, we're still not convinced that this is good. So we just run the same record command again. We're going to stick with this dwarf 1024. Information on 1024, by default, dwarf tracks 8 kilobytes, so 8,192 bytes of the stack. We're telling it to only do 1024, because in this example, we don't need more.

Starting point is 01:00:37 And B, if I have more data than the report on this side here, takes, oops. Yeah, didn't clear clear didn't work oh I had Q clear that's why thought so if I have a kilobyte then this will have I don't know a few millions more and this will take like two minutes to run so we'll try to avoid that if possible. So yeah, the dwarf one takes a bit longer, but it gives us more information. And so generally, you need to be a bit patient. But now we are there. We're still interested in our benchmark join information.

Starting point is 01:01:26 So just also FYI, there's some stuff like here, 0xFFFFF. This, I think, is the way to represent anonymous functions it has no information about, so lambda. We can also see there's some libc3 stuff in here somewhere. So we can see libcso6. So there's a lot of stuff in here. But currently for now, again, I really encourage you to play around with this if you haven't yet,

Starting point is 01:01:53 just to get a feeling for what information is there. What's the difference between opening up the children, so pressing the plus button, and all the different lines right now? So if you go into the join, you get all the functions that are in the join function, right? Yeah, yeah, yeah. So what are the different lines traveling?

Starting point is 01:02:14 So this will, so basically, I'm not 100% sure if this aligns fully, but basically, this line that's yellow right now should also be somewhere down here, somewhere below. And if I'm not mistaken, it's probably this one here. By the looks of it, no. So this one here, where's the other set? So yeah, this yellow one is probably this one here.

Starting point is 01:02:38 We can see it's 44%, 44%. This probably aligns somewhat. You need to pay attention here that it depends on if you have Relative or absolute percentages. So i think this is, they're all absolute percentages here. Yeah. Yeah, they're ordered by the time spent in them. Exactly. So now i'm just stepping in to say, okay, within this join tuples with join partners,

Starting point is 01:03:08 this one we're interested in, these are the methods that are being called. So you can also invert this. I'm not going to show this for the sake of time, but you can also just run. There's like no children, I think, was it? So that you can just use a different instruction when doing perf report, and it will invert this. So then, for example, if I'm interested in exactly this yellow method here, it'll show me all the places it was actually called from to see, for example, if that's correct. And the next thing we're kind of seeing in here again, so we have count tuples, we have

Starting point is 01:03:41 our unordered set, which is expected. So we know from our database two lecture that hash joins will take some time. They're fast, but they're not incredibly fast compared to other things. So we're seeing an unordered set here. We're doing some stuff. But we can also see that we're spending some time in a basic string constructor and a basic string destructor, which, again, is about oh and not eight nine ten percent and so that's again something we're a bit confused

Starting point is 01:04:10 about because we shouldn't be thinking about creating strings should we so you can someone maybe tell me some where we are creating strings that is not within the hash set itself. Distinction here, important distinction. We'll get to that later. Maybe focusing on the line that is currently highlighted in white-ish. Yes? .

Starting point is 01:04:42 MARTIN SPLITT- Perfect, exactly. So we have just moved to C++ from the Java land. We've never taken care of these things. And suddenly we see, ah, this again makes a difference. Before, for every element on the left side, which is the vector, I'm currently creating a string copy of value. Now I'm saying, please just use the one that's in there. Same here.

Starting point is 01:05:03 So this is, again, a very, very, very common C++ beginner mistake. Also, not so beginner mistake if you're not thinking about it. This happens all the time. Again, this is why you probably want some tool to tell you, are you sure you want to take a copy here? Because you're only using it as a reference or a const ref, actually.

Starting point is 01:05:20 And that's not super important right now. OK, so we've written this. Let's make our code again. And then I can join benchmark. And hopefully we can see another speed up. We are now down from 540 nanoseconds to 460 nanoseconds per tuple. Yay.

Starting point is 01:05:46 So again, we want to go to the same thing. I'm still not convinced we're at the end of what's going on here. So we can again do the call graph and get some information on this. Perf report. Again, every time this will be a bit quicker, luckily, because we're collecting less data because everything's quicker. But also in VTune, for example, this sometimes, if you do a big benchmark on VTune, I've waited for an hour for this to happen. It's also interesting that the local copy of the string was not removed by the compiler, because you would guess that from a very small for loop there the compiler could actually remove the copy yeah so yeah the compiler didn't optimize anything out

Starting point is 01:06:29 maybe the compiler says i can't do that i'm not sure if it's legal c plus plus to do that maybe it could again it didn't we saw that in profiling we're spending time in string construction that we don't want to um yeah again here we can see this a lot of time is spent in some anonymous function within Google Benchmark. So that's not super interesting because we know we're in the count tuples with joint partners. And now we can see we're spending all of our time in the unordered set. So that's kind of good.

Starting point is 01:06:59 We're kind of at a spot where we're saying, well, OK, this is how fast we can be. We're not wasting a lot of time around with other things. So in our code, now the performance bottleneck seems to be we're spending all our time in unallocet methods, which is good in general, because that's exactly what we want to have, because we know that the hash join is probably going to be the most expensive part here.

Starting point is 01:07:23 So now our colleague comes along and says, hey, I found this super, super cool new hash set that we can use, which is way quicker. Unfortunately, we've already included this in our example, which is unordered. For some reason, I can never spell unordered correctly. So there's a unordered dense set. So we can actually replace this.

Starting point is 01:07:53 And so what we want to do here is, what was it? Dense set. So we're just changing that. We're just swapping the one implementation here. Out of curiosity, how much do you think this will speed things up? Random guesses, please. Twice. Twice? Okay. Do we have high, do we have lower?

Starting point is 01:08:21 Sorry? How much time did we spend on the operation? We have lower. Sorry? Before we had, what is this, 30, so yeah, basically all the time in our count tuples, 50% of the entire partners and annotate the instructions, I would guess that you would see that even all the stuff in there is also hashing and hash table based stuff. Basically, I would expect that you'd spend 95% plus in hash table logic. Okay, so we have about 2x. That's a good guess. Let's see what happens if I make this and run it.

Starting point is 01:09:15 We actually get a 3x improvement. So this is nice, because now 2x was not that bad. And also, Thomas commented here about we're seeing how much time we're spending here. This is one of the fun parts about profiling, is that these numbers are always kind of hard to interpret. So because we're saying we're spending 94% of the time in the join benchmark function, and then we're

Starting point is 01:09:39 having 50% of this in the method we're at, but we're also spending 50% roughly in whatever this is. So there's, again, some other things. And we're not really sure. Again, it's a bit tricky. It's a bit convoluted. So actually, yeah, this is one of the takeaways I think that's important from this lecture

Starting point is 01:10:01 is that it's actually quite tricky to fully understand what's going on. It might take a few iterations. You might need to combine this with different tools. You might need to use a flame graph to visualize something, just to kind of go step by step, understand what's going on. And again, this is a very, very basic example. If you're profiling something, so the VTune thing I just

Starting point is 01:10:18 showed you was from Velox. We did something for a research project, and this is Facebook's new query engine. Profiling that's very, very hard. It's a big mess. First of all, you don't understand the code. You don't know where time is being spent. It's doing some random things you don't understand. And then kind of digging through this, stepping through this.

Starting point is 01:10:36 So this is often paired with debugging, trying to see, okay, I set a breakpoint in some method. Where did I stop? Where did I come from? One specific problem I had is actually I was taking two completely different paths, but I wasn't sure, I couldn't see that in profiling. I only figured that out while debugging. So this is always kind of paired together in these kind of things. So for example, if you had the string constructor method you're taking a lot of time with, I kind of told you where this was coming from.

Starting point is 01:11:03 And another approach you might have to actually set a breakpoint on construct string, run your code, see where am I coming from, and then I can see, ah, okay, I'm spending time in the construction of my method here, but while calling my method, which is kind of unexpected. So let's, okay, we saw we're actually a lot quicker, but I'm convinced we can still be quicker. So let's just profile this again. Let's report. And we can see we've gone down from 25 megabytes to 9 megabytes.

Starting point is 01:11:34 Or was it? I think it's. No, it's megabytes. Before, I think it was. Yeah, it's megabytes. I don't think that's true. I think it's event. It says here 9 megabytes. It says 25 megabytes it says 25 megabytes and that was exactly

Starting point is 01:11:49 what we have one megabyte okay so we're in here again oh we're page falls so also now that we're everything is getting a bit quicker the output here starts to get a bit more noisy because some other things suddenly are important and we can see we're spending all the time again in the hash set which is good but now I'm convinced that we can actually dig into the hash map and look into this a bit deeper and surprisingly we can again see some time being spent on some basic string constructor. And I think there's a destructor somewhere here. Probably somewhere down here, maybe, or not.

Starting point is 01:12:34 It says, again, we're spending time in constructing strings, but we don't really want to construct strings, do we? So I would usually, maybe I would debug and see where we're getting this from. Can someone tell me why we are seeing a lot of string constructions in this example? While I. So we've constantly been talking about we've been copying strings. Where are we still creating strings in this code? Within the hash set? Yes?

Starting point is 01:13:09 . Yes, very good. So we're replacing a value here. We've taken something by reference, but we're actually storing an std string in this hash set here. So what we want to do here, which is use a nice C++ feature, which is called string view, which is basically just a pointer and a length. So I have my string somewhere in memory, and now instead of copying the entire string

Starting point is 01:13:36 with all that belongs to it, with short strings and 24 megabytes of overhead and whatnot and memory allocation on the heap and whatnot, I can just say, here's a pointer, and please read these many bytes, and memory allocation on the heap and whatnot, I can just say, here's a pointer, and please read these many bytes. And that's my string view. And that's what we're doing here. So again, we'd only need to change this one very nice line. Then we can make everything again.

Starting point is 01:13:59 How much speedup? This is less obvious. So we're seeing we're somewhere in here in some vector, because that's probably what it does underneath the store something. There's a basic string here. This is all string construction. How much speedup?

Starting point is 01:14:20 This is more of a question. Do you know why it's not showing any number anymore? Like, on the line? This here? Probably because it doesn't have any. I honestly don't know. So here we can see there's, for the create, there's somehow numbers again here.

Starting point is 01:14:42 You can see there's a lot of time being spent here in allocations. Just random guess. There's no wrong answer. Ricardo. 3x. OK. Why do you think 3x? OK.

Starting point is 01:15:04 That's a good answer. OK. Let me just run this join benchmark. And we have 4x. So it's somewhere between 3 and 4, depending on how much load is on the server. But yeah, so actually, again, creating these copies was very expensive. And if I kind of dig into this, again, this is kind of the example is a bit oversimplified. We're kind of digging through this, but I can see there's a lot of methods here around basic string construction. This one here is going to be a killer.

Starting point is 01:15:30 We kind of know that memory allocations are expensive. We don't want to do memory allocations. So getting rid of these will have a huge impact on performance, which is what we did. Just about your general workflow. So I guess every optimization step, you should really go back and run your unit test suites or whatever again, because everything you did, like, let's exchange this data structure. Now you have different guarantees.

Starting point is 01:15:52 Yes, yes. Let's now not copy. So what about ownership? Yeah, so this is a very, very, very good point. Obviously, you need to test things. We're actually well developing this. I think Richard had some weird things where he started creating or making references or not copying things,

Starting point is 01:16:09 and stuff broke. So obviously, you need to have very good test suite for this. Probably, you're not going to exchange your hash set in the running system. That's a big decision. So a lot of these things try to be drop-in replacements for std unordered map or unordered set. But I think the thing that kills performance A lot of these things try to be drop-in replacements for STD unordered map or unordered set.

Starting point is 01:16:27 But I think the thing that kills performance most in STD unordered map and set is that they guarantee pointer stability if you do some weird things. And basically, that limits your design so fundamentally that you can't really apply that to other things. Because then you need to have pointers to everything and have indirections. Because otherwise, you couldn't keep the pointer stability, which is why the sets are as slow as

Starting point is 01:16:51 they are, because the C++ standards requires this to happen. So if you care about performance, you probably don't want to use them. So yes, we are at the end of the example here, actually, which is good. Good timing. I'll show you one or two more things. For those who want to try this, we're still not at the end here. There are some things that can be done. You can still optimize this, which is a very nice example of... So these were kind of like, let's say, physical changes. We kind of did some properties.

Starting point is 01:17:23 The other fix is actually a logical fix, which kind of also hints that in database systems, if you do logical optimizations, this will make huge difference compared to physical optimizations. So for this example, kind of consider how, if you're looking at this, how are we building the hash set, and which events go building the hash set and which events

Starting point is 01:17:45 go into the hash set. So yes, with that, I will only show you one more thing in the slides, because I kind of showed you the, actually, did I? Now I can see it's only 3 megabytes, which is coincidentally also exactly what it says here, Richard. Yeah, so you're collecting exactly 1,000 megabytes. OK, just some fun. OK, let's look at this. Actually, which one did I want to annotate?

Starting point is 01:18:29 OK, let's just annotate the entire thing here. So here, if I'm in this overview, so this is usually like perf report is usually the overview we're going to be in. Well, let's say we're interested in this line. If I press Enter here on the count tuples, it says annotate count tuples with join partners and some other things. But let's just press Enter on this again.

Starting point is 01:18:49 And then this will give us the instructions that we have, the x86 instructions with the code lines above it. So in this case, we're seeing here this mov instruction. And this is the C++ instruction that gave this to us. And now we can actually see a single MOV instruction here is causing 40% of the entire execution is spent on this one instruction, which is quite interesting. Can someone maybe tell me why this is the case,

Starting point is 01:19:24 looking at this line of code? How can one move instruction be so incredibly expensive? So generally, in a hash table, why is a hash table slow? If it's slow. Yes? Okay, so that's actually a good guess. But looking at this instruction here, or this line of code, does this look like we're handling a collision right now?

Starting point is 01:20:19 Actually, it could be. So again, let me rephrase. Would collision handling be done in one move instruction? Okay. Still, it's a very good guess, and it's definitely not wrong. But why are we spending so much time in one move instruction here? Okay, I'm just going to tell you. So basically what we're doing is we're loading data.

Starting point is 01:20:42 This is a random memory lookup. So what we're doing is we're loading data. This is a random memory lookup. So what we're doing is we're issuing a move instruction. We're reading data from memory into our register. And if I have to wait for it to be fetched from memory, this is going to take forever. So what we're seeing here, this instruction will just wait. Maybe I'll do one or two instructions below. But actually it can't, because it

Starting point is 01:21:04 needs to compare what we're reading into memory. So we're reading from whatever is in RAX into EDX, and then we're comparing that. So we can't really move on. I need to wait for this to happen. So we're spending a lot of time here on memory loads, which is actually also kind of a good thing

Starting point is 01:21:20 if we're thinking about the design we're using. 50% of the time we're just spending on waiting. We're not going to be able to optimize this unless we have a different structure in the hash table. But if it's sufficiently large, everything is just going to be memory bound at some point. OK. And I think with actually, I can also show you something else.

Starting point is 01:21:39 I showed you how to do it this way. Perf annotate. How do I spell this? Annotate. Why was it wrong? Oh, okay. Yes, and it just goes directly to where we were. And we can see move. There's the same modern structure again.

Starting point is 01:21:58 So it's just a different way to get there. Okay. With this. Yes, there's a question, yes. So if I want to optimize and reduce my CPU cycles, is there a way in Google Benchmark to say it's pre-warm and fill everything in buffers, I don't want any IO stores? So Google Benchmark, first of all, you should generally do more iterations, or multiple iterations.

Starting point is 01:22:27 You can do multiple repetitions. And you can play around with setup time. And I think there is actually explicit warm up time, too, where you can tell it. So basically, if I go here, let me see. So in Google Benchmark, I didn't really show you this. I glanced over this. So this is basically the main benchmark loop.

Starting point is 01:22:47 I'm saying for auto underscore because we don't care about this in state. This is how I tell Google Benchmark, please run my benchmark. And this will run then for x iterations for a given amount of time or a certain amount of iterations. And this can be done a few times without it measuring anything when it warms up.

Starting point is 01:23:03 So then you're kind of executing it in a real setup. Yeah. So again, I definitely encourage you to check this. If you're curious on a bit more complex example of this, feel free to also check out on our group's GitHub page the C++ repositories that were mainly done by me. You can find a lot of Google Benchmark setups in there because yeah I use Google Benchmark a lot and it's very helpful to figure out stuff okay just one one tiny bit of theory at

Starting point is 01:23:38 the end so in the example I just showed you at the end it was quite nice because you can see we're spending 50% of the time on the MOV instruction. Sometimes, what you will actually see is that we are spending 50% of the time on this compare instruction. So, this is not in this example, but something you may see while profiling. And this is actually quite common because of how this information is gathered we kind of know that it's the more instruction we know the compare instruction is not that slow it's a very fast instruction and we know that the

Starting point is 01:24:14 more instructions the one that's slow but that's something which is kind of hard to guess if you don't know and the reason for this is something called skid so this is a problem that there's a delay, like a physical delay, between the time there's an interrupt and the time the information is actually stored. So there's an interrupt. And by that time, my frame pointer

Starting point is 01:24:36 might be on the next instruction already. My instruction pointer might be on the next instruction already. So in this case, my MOV instruction is executing. But my instruction pointer now has moved on to the compare. And then it says, oh, I'm spending time on this compare, while it's actually still, the interrupt was triggered while in the MOV instruction.

Starting point is 01:24:56 So this actually happens that we've recalled the wrong instruction. And then this is, for example, if you're looking at it at this level, first of all, it's very helpful just to get a general understanding and knowledge of how these things work. But second of all, maybe something like the UOPS table page can help. Actually, I might just be able to open this just for the sake of showing it. So this is the website here. I don't know where my mouse is.

Starting point is 01:25:28 So we have this nice table here. Let's just compare. I mean, there's a lot of things here, and we can kind of see the throughput and the latency of these instructions. And then there's these instructions. And then there's MOV instructions. It's not showing the latency for memory moves, is it? Damn it. I was hoping to show you something else.

Starting point is 01:25:53 Never mind. So this might be helpful if you're looking at instructions and how long they take and to see if this actually aligns. Again, it's always kind of a figuring out and playing together with this. And if you're interested in very specific instructions, I said at the beginning that perf uses the clock cycles to trigger overflows and then get the information.

Starting point is 01:26:16 If you're interested in very specific events triggering things, so if I'm very interested in which instructions generate cache misses or branch miss predictions or something like that. I can use this colon P in perf behind the event. So I showed you on one side I can do the dash E, give me the L1D cache stores. And I can put a colon P behind that, or PP or PPP,

Starting point is 01:26:39 which gives us higher precision. And then that basically says, OK, I'm going to explicitly track not the clock cycles that did something, but the exact instruction that gave me the cache miss. And this will give you more precise output for these things. By default, this happens for clock cycles. And there's always a bit of timing issues in there.

Starting point is 01:27:00 But for this, again, at some point, you need to start googling or checking the documentation. So there's this 5,000 page Intel manual that has all the information you will ever need about developing stuff for Intel. And they have a long list of things there. And that's why I also didn't have anything on the slides here, because it very, very much depends on the system you're running. Okay.

Starting point is 01:27:22 So just to wrap up, perfectly in time, I guess. Today we looked at profiling. We kind of looked at instrumentation, how to kind of do this manually, which tools do that with the sampling, what are hardware counters. These are the three things that are important to remember. Hardware counters, if you're looking at this, are your friend. And in the hands-on part, we kind of looked at using perf perf some of the how to play around with this what can we see what we maybe

Starting point is 01:27:48 cannot see and how to how to navigate this a bit and also just like one or two things or information one or two small hints on the setup for profiling in C++ that are important to remember just if you haven't heard of it there's something in CMake which a lot of C++ programs are built with today, has rel with depinfo, which exactly says, please build and release, but give me debug symbols so that I can do this, which is very helpful. Because otherwise, there's no way for you

Starting point is 01:28:16 to know what's going on. If I had more time, I would show you what the output is without debug symbols. It's very messy. OK, so next session will be on Tuesday, and we will discuss the solution of task number one. And we will introduce task number two. This session will not be recorded,

Starting point is 01:28:36 because we're discussing the solution. And if I have a recording of the solution online, then next year we cannot use this task again. Next Wednesday, there will be a lecture on multi-core parallelism, not given by me. And with that, are there any questions, comments on profiling, anything we covered today, or any other lecture, I guess, also, in regard to profiling?

Starting point is 01:29:06 OK, no questions. Perfect. I wrote in Moodle, there's a setup. You can all use this. You can all use the perf. All of the commands I ran here, this is all also in the Docker. So this is not a different setup. This is exactly what you can do.

Starting point is 01:29:24 You can SSH into it. You can run these things For this to happen I need to start docker with privileged modes. So I'm not hundred percent sure how much you can break Please if you play around with this a bit be a bit careful in general It's not it's not too bad But this is also the server everyone's developing on for the programming task So try not to break anything or make something super broken. You don't have admin rights in the Docker container, so you should be fine. But I'm just saying, if you see something like this is a very destructive instruction or command,

Starting point is 01:29:57 maybe don't run it or check with me beforehand. I'll leave them online for a bit so you can play around. I got a request yesterday about perf in the containers for development. I'll see if I can set this up for the next exercise, that you can also just have perf while developing in them, which might be helpful. We didn't have that in the setup yet. I figured out now how to do this. It's a bit hacky because Docker usually doesn't want you to do perf inside of it, because it's like kernel mismatches

Starting point is 01:30:27 and these kind of things. But if you either need it for the task now or want it for the next one, then for the next one, I'll do it. If you need it for the one now, let me know. Write me an email, and we can do a small restart loop with your container to actually have perf in it, if you're curious.

Starting point is 01:30:44 With that, thank you very much, and bye.

CODACE Plant Stand

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Profiling

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Profiling

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.