Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Profiling

Starting point is 00:00:00 okay good morning everyone um in today's session i'm going to talk about profiling um the session is split into two parts so in the first part i'm giving you some background knowledge i will talk a little bit about concepts that is relevant to understand profiling and also introduce you to some terminology and in the second part i will yeah it's a hands-on part in which i want to demonstrate how to use profiling tools so we focus here on perf purpose the default linux profiling tool and we have a small program which executes a very simple query and i want to walk you through how to use perf to understand the application where performance is spent to identify hotspots where maybe some code takes more time than expected and

Starting point is 00:00:55 how to optimize in the end. So this session is also recorded. Feel free to watch it and to execute the commands and pause it as you please. If you want to try to also run the commands during the session it might be a little bit tricky because I will sometimes jump back and forth so I recommend trying it at home. Take your time and as I mentioned stop the video as you please. Okay a brief overview where we are. So we are still in the domain of the CPU and main memory. Also, so in the last week, we had lectures. Yeah, we had the lectures about execution models and data structures. And today, we have the profiling session,

Starting point is 00:01:42 in which we are still in the domain of CPU and main memory. There are also profiling tools for other aspects, for example, to profile GPUs and networking and this and so on. But we will not cover this in today's session. And here's just the basic agenda that should be familiar to you. After today's profiling session tomorrow then we will start with multi-core one lectures next week on tuesday there is the task two presentation and in the same session we will also discuss the solution of the first task and then

Starting point is 00:02:20 continue with a multi-core lecture before then the second half of this course starts. Okay, again in this lecture we will have two parts. In the first I share some common concepts behind profiling and some tools that we are using at the chair and that I'm a little bit familiar with. And please know that by no means this is an extensive list of everything that exists out there. And we also, besides some tools that we're using, we will also share some setup that we've been using at the chair. And the second part is a hands-on profiling session in which we profile a simple query implementation in C++. This session is mainly based on the Unix environment and the focus is here on profiling

Starting point is 00:03:15 on a Linux server and we also have a strong focus on C++ for two reasons. First, because in this course we will mainly develop with C++ and second I'm also not so familiar with other profiling tools in other languages but I guess most of this on most of these things are also applicable for other languages for example you can also use perf for rust and if you want to work with a JVM or Java, you probably might find some other tools. And yeah, again, if you're watching this video right now, feel free to watch the video, run the commands and watch what's happened, try to figure out how perf works and get a feeling for the tool.

Starting point is 00:04:03 Learning these profiling tools is also yeah just a skill that you can practice and you might not have or maybe there was not a big need for you so far but it at the end of the day it could really be helpful if you want to drill down on performance issues and one identify where you can can improve performance. This is rather a very light introduction to the topic. So just to introduce you to some tools and how to use them. And if you're already a Perp expert, then you're probably not going to learn much in

Starting point is 00:04:41 this session. And we just want to show you using these kinds of tools can help you in your application development. And with that, we can already start with the concepts. So here, I just took one of the definitions. So we took this from Wikipedia, since I think it's quite nice. So in profiling in general, we want to analyze something about the program. And the key part here is that it's a dynamic analysis.

Starting point is 00:05:15 So it's nothing that we need to do ahead of time. So we just run the program and then do something to inspect what's happening. And there are different metrics, different aspects that can be of interest. We're usually interested in the time and space complexity, meaning how long does our application run? How much memory does it use?

Starting point is 00:05:36 Which instruction is used? How often are these instruction used? Which functions are called? How often are they called? Stuff like that. And there's a wide spectrum of what we can do. And this is also depending on your goals. So depending on what's your objective in profiling,

Starting point is 00:05:55 you might choose different tools that are out there. Why do we profile? So there are mainly two reasons. First, we want to understand the performance. So in this aspect, we do not want to figure out where is the most time spent and then straight go for optimization. We just want to investigate the application and just identify where performance is spent. This is also especially relevant for us at our chair

Starting point is 00:06:28 where we do research and it's really important for us to be able to explain why our approach that we suggest are better than another one and it's really important to understand where performance is spent and it's important to know where more time is spent and where less time is spent.

Starting point is 00:06:52 And this does not necessarily mean that we want to optimize in this step. We just want to understand the behavior. With profiling, we can also identify bugs. For example, memory leaks. Assume that you profile the memory consumption of an application and you see that the memory consumption is increasing and increasing and increasing. Then there might be or there's probably a memory leak. And then we can identify or further drill down and try to find the cause of this issue of this bug.

Starting point is 00:07:26 And the second step after I've understood the performance, I can optimize or tune the application. So for example, I found a method that is very slow. And then I spent some time on optimizing it. And this is also something that we will go through later in the example. And we're basically trying to see here, where are we spending time?

Starting point is 00:07:49 Or what are we seeing that is unexpected? And based on our findings, in the first step when we're optimizing our application, this optimization can be automatic or manual. Usually, it is manual offered, meaning we have this repetitive cycle in which we run a command, we profile the execution time or other metrics of our application.

Starting point is 00:08:19 Then we see where the time is spent or where we have too much memory consumption. Then we optimize the code accordingly and run it again. So it's rather a manual process. But this can also be automated if you, for example, think about the Java virtual machine, which profiles our application while it is running. And then based on some characteristics,

Starting point is 00:08:41 it decides if it wants to JIT a function or not and there are basically two types of profiling tools one type of tools is based on instrumentation and the other on sampling and I will also briefly explain to you what this means okay first there is tracing and instrumentation. These terms are kind of intertwined a little bit so that depending on the literature, so the terminology is not always 100% strict or clear. Tracing in general means that we want to observe

Starting point is 00:09:22 the execution of the application via traces. And traces are basically logs. There is a slight difference. So I assume that most of you are familiar with logs or that you already have implemented some logging or used logging frameworks. And when you log something, then you usually output semantic important event. For example, the user successfully logged in or you Have an arrow 500 code and you want to lock it because it's

Starting point is 00:09:54 Again an important semantic event that you want the user, The developer to know about. And traces are usually more Capt capture the flow of something. So they rather output information that allows you to capture the flow of the application. So if I want to really dig into something in more detail, if I want to see what function was called first, what other function was called, maybe what other sub functions were called by a previous function. And then tracing is quite... That would be rather tracing, so it's on more fine granular level. And I can also, for example, in the different traces, add some request ID so that I can see for certain

Starting point is 00:10:44 external requests that came in, this request triggered a certain Traces at some request id so that i can see for certain External requests that came in, this request triggered a certain Sequence of functions in my program that i can then identify With my traces. You also might from some Logging frameworks might know these different priority levels with fatal and error at the top followed by warning info and there's debug and lowest priority the lowest level is a tracing or trace so this usually generates a lot of messages and so this is also a higher overhead, but it gives you the most details about your application.

Starting point is 00:11:33 And instrumentation, so that was about the traces and tracing. And instrumentation is basically modifying your code or modifying your application in a way that these traces are generated. And there are two different types of instrumentation. One is source instrumentation in which you modify the source code. And binary instrumentation means that you modify the binary so you are not dealing with your actual source code usually you will be so

Starting point is 00:12:09 binary instrumentation is also a little bit more difficult so what's usually done is the source instrumentation that you modify your application code is more common because it's easier but a binary instrumentation also definitely exists. For example, there are also some profiling tools out there that use a binary instrumentation, meaning that they are simulating the instruction that the application would execute. They modify it, and they add some stuff, and then execute the modified version of these instructions.

Starting point is 00:12:41 And examples are, as I mentioned before, JITing in, for example, Java virtual machines, or also there's a profile tool which is quite powerful. It's called Valgrind. This also uses binary instrumentation. Okay, let's look at a small example. We have our function foo, which also calls some function internally.

Starting point is 00:13:06 And now we want to know how much time does it take to execute the internal or the inner functions bar and bus. And we also would like to know how often function bus is called. And what we could do now with source instrumentation is that we simply modify our source code. So in here we add simply a timer before we execute the first inner function and then capture the timestamp after calling the second function.

Starting point is 00:13:39 And we also, in our bus function, we had a counter that we increase. And yeah, this simply allows us already to measure the runtime of the inner functions and also to count how often the best function is called. Positive or advantage here is that we have the full control of it, so we can modify the code the way we want to. That's great, but it also has a disadvantage here. It can have a high overhead. It doesn't necessarily have a high overhead, but it can, depending on how much we modify.

Starting point is 00:14:20 Also, if we execute timers, or if we execute the timer now to get the current timestamp, this also costs some execution time also when we want to access our counters. And also, it can impact the performance optimizations because we simply add code, which results in additional assembly instructions which can which then also have a certain execution time and there is also that i will i will talk about this later it can also be the case that instructions are reordered and i briefly want to demonstrate demonstrate the impact of such kind of modification. Where's my cursor? Okay.

Starting point is 00:15:08 So here we are in Gotbolt. You should have seen this already in some, in one or the other lecture. It's also a very nice tool, which basically gives us the assembly instructions for giving code. And in the top left corner, we have our code without any source instrumentation. And in the top left corner, we have our code without any source instrumentation.

Starting point is 00:15:26 And in the middle here, we see the instructions that are executed or that our compiler generates when we compile with Clang 16 in this case. And at the bottom, this is the same code with our additional instrumentation code. So we have some includes that we need to add for capturing our metrics. We need this unordered map, which is our counter for the number of best calls, but also that stores the execution time of the functions in Foo.

Starting point is 00:16:05 And yeah, in the very right, you can just to show you that this is significantly longer than the solution in the middle, or than the assemblies instructions in the middle. So here we have a list of about 21 instructions. And here, if we just add these rather simple code fragments, then we have actually a much, much longer set of instructions here with about 110 lines.

Starting point is 00:16:29 The details do not matter that much. Just as a demonstration, it can be significantly longer just when adding some time measurement and counters. Okay, so there are some pitfalls with source instrumentation that you should definitely be aware of. So one thing is for very short measurements that the CPU, so that the compiler and also the CPU can reorder instructions. Some people are not aware of that fact that these two components can reorder the instructions,

Starting point is 00:17:09 and especially when you want to capture or measure only a very small sequence of instructions, maybe you only want to measure one read memory access or another very short instruction. If you have a timer before and after this instruction, it could be the case that your compiler or CPU reorders the instructions. So then in the end, the resulting instruction order is measuring something that you actually don't want to measure. And it doesn't measure anything reasonable. You should definitely be aware of that and if you're yeah just Google online about high precision time measurements you might find this counter so this read timestamp counter here, so this allows you to measure the number of cycles that the CPU spent on executing and it's increased per core. So you have this counter for each individual core.

Starting point is 00:18:17 But yes, so it is good for very precise measurements. That's at least what you might think in the first place, but it's actually very hard to get right. So you will find a lot of stack overflows post about this that is quite tricky because first, these are not consistent with regards to clock frequency. So if you have two different cores with different frequencies, then this might already break. Or at least you should be aware of that. So applications and threads could also

Starting point is 00:18:49 migrate to other cores. And if you have different frequencies, then the meaning of your counter could be different, especially when you translate the number of cycles to execution time. When you only focus on the number of cycles, then sure, your measurements are still fine. You can still use the counter.

Starting point is 00:19:08 But if you want to translate the number of cycles into execution time, you need to be aware of the different frequencies of the cores, which can be very tricky when processes and threads are moved around. They're also not synchronized between cores, so you might need to individually track the counters of different cores. And these are also, so if you want to read the counter,

Starting point is 00:19:39 it's also not a non-serializing instruction. This also means that it can be reordered. So I personally have stayed away from it because there are many voices online saying it's dangerous, it's difficult, but just that you're also aware of the potential option here. Another thing that can easily go wrong is that you use the wrong clock. So for example, the C++ standard library has this high resolution clock. But documentation also says, and I think it's rather at the bottom at the node section,

Starting point is 00:20:15 that it's ambiguous. It's often an alias either for steady clock or system clock. And higher resolution clock and also system clock and higher resolution clock and also system clock they do not guarantee that the time is monotonic so theoretically it can also be the time could go backwards and then your measurement is incorrect so if you want to do duration measurements then also C++ references recommends using the steady clock because here we have the guarantee that the clock goes monotonic. So there are some further shortcomings with instruction, with source instrumentation. We saw the modified code before, and this is just additional code in our application.

Starting point is 00:21:04 Meaning the code that we add here also has the same privileges as our application code. And depending on the privileges of the user with which the application is executed, you might have limited access. You might not be able to access some kernel information here. Just be aware of that. And second aspect is, as I mentioned before, it can be very time consuming, especially if you... First, the overhead can be high depending on how much code you add, but also the profiling process by the developer is also quite time consuming since you have to modify the code when you need to recompile it.

Starting point is 00:21:52 You need to execute it to get your additional information that you added with your source instrumentation. And then you might modify it again, recompile and so on. Especially when we profile we are often interested in hot Paths. We are looking into tight loops. If we add instrumentation code in our hot paths, then this Might even more increase the overhead of our instrumentation code. And there's another approach. It's called sampling.

Starting point is 00:22:31 And so this is one approach to address this higher overhead. So there are two types of sampling. So basically what we're doing, as you can see in the example, we're not storing, we're not measuring a certain metric for every execution of a certain instruction or function call or whatever we want to measure. But we only measure it, for example, event driven, meaning that only every tenth event or hundred, event-driven, meaning that only every tenth event or hundredth event or thousandth event.

Starting point is 00:23:08 Or we can also do it time-driven, that every 10 milliseconds, 100 milliseconds, we want to track that specific metric. It is less accurate than the full instrumentation, but it also has less runtime overhead, because we're not doing it every time. And in this example, just imagine that you have some counter that is stored and increased somewhere. And then in this case, every thousandth iteration here,

Starting point is 00:23:38 we want to track the execution time of the two functions, bar and bus, that are executed in our foo function. And also, okay, we don't have the additional counter here anymore. So while this in general reduces the execution overhead, we still need to modify our code manually. Okay, one very helpful aspect or something that you should definitely have in mind when profiling applications are hardware performance counters. Hardware performance counters are special registers to track hardware events.

Starting point is 00:24:20 They're called performance monitoring counters and they can have different semantics. So there are fixed counters that have a fixed semantic that you cannot change. For example, the number of core cycles, but you also have general purpose programmable counters that you can set the way you want it to be. There is a specific set of events that can be tracked, but this really depends on the platform you're using, the CPU you're using. So you should, if you're interested in that kind of hardware performance counter measurements, you should definitely look it up in the documentation,

Starting point is 00:24:59 what your platform, what your CPU supports here. And it is very sufficient compared to for example source code instrumentation because it's hardware supported so since these are registers they are located so cpu usually has this performance memory a monitoring unit where these performance counters are located in and it has basically no overhead for the applications to increase them, but it has a certain overhead, or it takes also some time to read them. It's rather small, but maintaining these counters,

Starting point is 00:25:35 these metrics, it does not add additional overhead since it's managed by the hardware. And on x86 platforms for for intel and amd for example you have this rdpmc instruction read performance monitoring counter which allows you to read as the current value of a given performance counter and just to give you some examples for example on the Intel Skylake or Sapphire Rapid CPU, you have four programmable registers per core or eight registers if you disable hyper-threading. And on Intel Ice Lake, you have eight with some restrictions and for AMD for example for Zen 2, Zen 3, Zen 4 architectures you have six programmable registers per core. And just to give you some example events here so there could be the number of level one cache misses that you want to track, the number

Starting point is 00:26:40 of cycles that have been executed by the core, the number of core mispredictions. So these are very low level metrics which these performance counters allow you to track. If you want to track, for example, 10 counters, which would increase the number of registers in all of these examples, then what CPUs are usually or what applications using of application or

Starting point is 00:27:05 profilers using these metrics usually you do is a time multiplexing so you have a certain iteration with a with some with the with some given you have an iteration with a given with some duration and for each iteration you measure a different set of hardware counters. So in one iteration, for example, you measure the level one cache misses and in the different iteration, you measure the number of branch misprediction or another event of your choice. So therefore this allows you this time multiplexing in a round-robin fashion, but this is also a little bit less accurate. But I would say for most use cases,

Starting point is 00:27:49 it's still sufficient and detailed enough to get insights about where your application spends performance. There are two different ways how you can use these performance Counters. One easy aspect or one easy Approach is that you simply count these. So you have an application and starting from the beginning of Your application, you increase these counters whenever the Recording or the corresponding events happen.

Starting point is 00:28:23 And then at the end of the Application you get an overview of the values of your specified Counters and this basically allows you to characterize the Application with a set of specified events and another Approach is that you can do event-based sampling. In this approach, let's say the profiling tool has a certain counter, which increases for every event that you want to track. And at some point, oh no, sorry, you

Starting point is 00:29:02 have a register with the number of cycles, and this is increased over time, and when this register overflows at some point with each overflow, it triggers an interrupt, and then when this interrupt is processed, your profiling tool saves the instruction pointer to memory, it might also save some other information such as some stack frames or maybe the entire stack and every time and then it starts again so it resets the counter and then starts increasing the counter again and when it overflows again then it again basically stores a snapshot of your instruction pointer of your stack.

Starting point is 00:29:48 And this is often used in profiling tools to find hotspots. And this is also what perf supports and that we will also see later on. With this event-based sampling, there is one issue that is called skid i will also talk about it later on in more detail skid basically means that there can be a delay between when the actual overflow happens and when the interrupt interrupt is processed so let's say you have a certain interrupt in which you want to store some context for your counters. So you want to save some part of your stack. But the program continued.

Starting point is 00:30:39 So the instruction pointer is pointing to another instruction that was executed after the actual instruction that triggered the overflow. And then when we are now recording the data, collecting the data, the instruction pointer might point to actually the incorrect instruction. And this is called SCID. And this is actually a widely known issue for performance analysis tools or for profiling and this can be reduced by for example process event-based sampling. This is one feature by Intel CPUs. This allows you to store instruction pointers in the memory buffer

Starting point is 00:31:21 so there's no interrupt issued for each sample but it also not supported for all events and AMD and ARM they have some similar approaches for AMD is called instruction based sampling and ARM has something called core site I'm not super familiar about core site and instruction-based sampling. And also for the, yeah, this is also well-documented. So I really encourage you, if you want to look deeper into hardware counters to look up the documentation. There's also one quite huge documentation

Starting point is 00:32:00 that we will, that I have listed later on that also explains the process event-based sampling in more detail. But what I basically just want to show you or tell you is that with this kind of approach, it also allows to reducing the skid and to have even more precise measurements. And this is also often supported by profiling tools. For example, perf also allows increasing the precision.

Starting point is 00:32:29 And for Intel CPUs, then the process event-based sampling is used. Okay. Now let's talk about some profiling tools and some French tools, which are basically not profiling tools, but also very helpful when it comes to performance analysis. First, as I mentioned, perf. So perf is the basic Linux kind of profiling tool. So if you're running on, Assurances program is running on Linux, then this is the tool to go.

Starting point is 00:32:59 It abstracts away all the hardware and software events. So you do not need to be aware of the CPU you're using. So it has a specific predefined set of events that you can use. But if you, for example, are very interested in a specific event of a given CPU, let's say there's a new Intel AMD CPU that you want to run your application on and it has some very new counters that might not have been integrated into perf, then you can also look up the ID of this event and also add it to perf, but then there's no semantic name for it.

Starting point is 00:33:40 It gives you information about cache misses, branch mispredictions, so it also utilizes these hardware performance counters, but there's also a set of other kernel information. That's why I want to show you this scary picture here for short. So here you see the Linux kernel with its stack and different aspects that you can track and that also perf has access to. So for example, you can get information about syscall, you can get information about what do we have page faults about the file system. You can also get some networking information. So Perf basically has access to the performance monitoring counters that you configured and also to these kernel metrics.

Starting point is 00:34:41 So there are a lot of events. I recommend you to just type perf list once, and there you will see, I think it's multiple hundred events listed there. And just for your information, by default, if you don't change it, perf samples with a rate of 4,000 hertz. But you, of course, can also modify this frequency. And I briefly talked about these cycle registers that overflow.

Starting point is 00:35:12 And when they overflow, the profiling tool then stores some snapshots. And for Intel, this is, for example, unhalted core cycles. And for AMD, it's CPU clock unhalted core cycles, and for AMD it's CPU clock unhalted. I don't know exactly how ARM or PowerPC are doing it, but I would assume here that they have also similar approaches for that. For perf, it's well documented. So there's in the kernel wiki,

Starting point is 00:35:42 you can find a perf tutorial. Feel free to check it out and to get more information about it. And there's also a guy called Brandon Gregg and I also got This figure from his website. So he is also very into performance analysis and also does a lot with with perf and he has also Very nice resources videos book recommendation blog posts about performance analysis and also specifically about perf. There's also a nice video listed on his website about performance analysis at Netflix in which he shows many different command line tools basically one liners that you can just execute and get either information about your system or also some approaches how you can easily improve the performance of your system or an application

Starting point is 00:36:35 if you only type in perf in and then you will see this long list of commands, but this is only a list of commonly used commands. So there are way more commands than just shown here. And the most interesting one, or what many people are using is perf record, perf report, and also perf stat. This is also what I'm going to show you later on in our hands-on session. You can either attach perf to a running program,

Starting point is 00:37:12 but it's only allowed to attach it to your program. So you cannot attach it to a programs running by other users. You can also measure system-wide if you're not interested in one specific application, you can just start your command of choice here and collect data for the entire system. But of course you also need certain privileges for that and one aspect or one approach that's also often used which I will show you in a second is using perf for only a single application in which you basically run command for example perf record and then followed by the binary that you want to

Starting point is 00:37:57 execute if you use perf stat perfstat will basically do performance hard, will just count performance hardware counters as shown before. If the metrics shown require the counters, of course, you can also access the kernel semantics or the kernel information is shown before, but it basically gives you a set of counters in the end of the events that you're interested in. So if I just execute perf stat for the command ls, I will have this overview. It gives me the time that it took to finish to execute the number of context switches. I have the number of CPU migrations, so how often a program was moved to another core,

Starting point is 00:38:49 the number of page faults, the number of cycles, the number of instructions. With the number of instructions, it also calculates the instructions per cycle. This is also quite nice metric, for example, in the context of memory accesses. So this can be an indicator if you're, for example, in the context of memory accesses, so this can be an indicator if you're, for example, bound by memory, but there's also referencing Brandon Gregg again, also a nice

Starting point is 00:39:13 blog post about that on his website. The number of branches are given here and also the number of branch mispredictions. And this is the default set of events. So we have not specified a specific set of events that we're interested in. So therefore, PERF will fall back to this default set. But we can also specify a certain set of events that we're interested in. So for example, here for better measurements, we also want PERF to repeat the measurements 10 times and we specify two events that we're interested in here. It's level one data cache loads and level one data

Starting point is 00:39:54 cache stores. Okay. Second tool is Intel Vtune. This is Intel's visual profiling tool. It is a desktop app. There's a desktop app, and there's also a web UI. So the desktop app allows you to run on your local machine. You can profile a local machine, but you can also connect to a VTune backend that is located and running on a remote server that you can then control with your desktop app. can also connect to a VTune backend that is located and running on a remote server that you can then control with your desktop app.

Starting point is 00:40:27 And the web UI you have to connect to a VTune backend running on another server. And it also comes with the command line tool that I personally haven't used that much yet. There were some changes in terms of operating systems that are supported. So before it supported Linux, Windows and Mac OS. But this year Intel also dropped the support for Mac OS. So right now it only supports desktop applications for Linux and windows and yeah it only works for intel x86 but it has a lot of information and different modes so i really recommend you looking into it and if you ever worked with one of the servers in the data lab you can also find the vtune back end there which you can then use for the web ui and let me briefly very briefly show you um how vtune looks like i recently played around with it in a oops

Starting point is 00:41:43 over here. So I don't remember. All right. So the application that we're also profiling later on, I briefly wanted to do a hotspot analysis here. So basically I wanted to identify where most of my time is spent in this application. And VTune comes here with a summary, which now says something like 32.6% of my execution time was spent in that function.

Starting point is 00:42:12 It's somewhat related to an unordered set. There are also string allocations. So we will look into that later on, but it just gives me an overview of the functions that the application spent the most time in. And it also shows me something like a call graph. So I have different functions calls and the percentage of CPU time that the application spent on it.

Starting point is 00:42:39 And here, for example, we have a function called countTouplesWithJoinPartners, and then here I can also see the call graph for this and also see what other functions are called within this function. And what's also quite nice, it also gives me a call graph, a flame graph, if you are a rather visual person, you can also have a look at it and see in what function time is spent okay but as i mentioned they dropped the support for macOS and it does not, for example,

Starting point is 00:43:27 support ARM CPUs. For your macOS machine, if you have one, you could use Apple Instruments. It's also a profiling tool. It's Apple's visual profiling tool that is designed for macOS and iOS applications. And so you could, on an older machine, you could also still use it for Intel CPUs as I'm using it, but you can also, it also supports ARM CPUs. And it comes with Apple's Xcode environment.

Starting point is 00:44:00 And if you have a CMake project, it's actually quite nice that you can just simply generate this Xcode project via CMake commands if you're not familiar with it. And then if you have installed Xcode, you also have Apple Instruments, the profiler in there, which you can also then use without Xcode. And just to very briefly also show you that here I have the same or I profile the same application as before and it gives you some information over here about your CPU usage you also have you can it also annotates your source code with certain weights.

Starting point is 00:44:45 It gives you assembly code. And you can also, if we close this view, see also a call graph in which we see the count tuples with joint partners that we saw before. Yeah, just for completeness, if you want to work on Mac OS, you might want to look into Apple Instruments. Another tool, which is not a profiling tool, but also very helpful is Godbolt. I showed it to you before, and we also used it in some other lectures.

Starting point is 00:45:20 It's a very nice tool, which officially is called Compiler Explorer, but it's developed by Matt Gottbold. And the URL is also gotbold.org. So many people just call it Gottbold. And the idea here is that, as shown before, it allows you to see what assembly instructions a compiler would generate based on your source code. And it supports many, many, many languages. It supports many, many, many compilers.

Starting point is 00:45:49 So you have different versions of GCC, different versions or all the versions of GCC and Clang compilers in there for x86-4. You can even look into PowerPC compilers. So I really encourage you to look into Godbolt. It's quite nice. And it also allows you to include public headers from, for example, GitHub repository.

Starting point is 00:46:15 So if you want to look into some assembly instructions, but you need a certain header, you can simply click or open the raw file in GitHub and then include it similar as you would include, for example, a system library or third party library you can just include and then the link to the raw file, which is quite nice. And what I also just figured out recently, you can also link to common libraries. So here at the picture on the right, you see this library button. And if you click that, there is a long list of commonly used libraries.

Starting point is 00:46:54 So for example, you can also link to Google Benchmark or Google Test. Another tool that can be helpful, but it's very specific, is microops.info, written uops.info. It's maintained and provided by some folks from Unisaland. And it gives you information about x86 instructions, for example, the throughput, the latency, and also the number of micro operations. So some CPU architectures, for example, Intel CPUs internally use micro instructions. So one instruction, assembly instructions translated into a set of micro instructions.

Starting point is 00:47:37 And this, where our website can give you some information about the latency throughput and how many micro instructions one instruction corresponds to. I do not have experience with it that much, but recently at our research chair, we also did a research project on SIMD instructions and during that project,

Starting point is 00:47:59 we also looked multiple times into this website and to the table. It was actually quite helpful if you want to dig deep into low-level details of your assembly instructions. Another friend tool is Google Benchmark. So Google Benchmark is probably, I think it's the most widely used C++ micro benchmark framework. We also use it for several projects and also, yeah, we use it for several projects and it's also popular. It has something called do not optimize, which is also quite popular for.

Starting point is 00:48:36 So do not optimize allows you to... So it prevents the compiler from doing things that you don't want it to do. So in certain scenarios, your compiler might optimize something out. For example, if you're in a benchmark scenario and you only want to measure read performance, a read instruction or a set of read instructions or simile read instructions, then this might usually be optimized out because the result is not further used in our application. And in this case, you can use do not optimize, which applies some tricks to so that the compiler will not optimize it out.

Starting point is 00:49:18 There's also a nice video that we linked here. Feel free to check it out in which a compiler developer basically explains what tricks you can apply so that certain things or that some code, some instructions do not get optimized out, even though the values are not further used in your application. There's also a very small online version. It's called QuickBench, which allows you to set up

Starting point is 00:49:49 benchmark scenario online. You can use it to briefly check if, for example, one data structure would be better for a certain workload than another one, or if a function of one data structure might be more efficient than another one, or if a function of one data structure might be more efficient than another one. But if you are interested in very low performance measurements in which also you want to take into account the used hardware, then this might not be the ideal tool,

Starting point is 00:50:17 because you cannot be sure about the hardware that's used. Then there's another list of some tools that we haven't mentioned yet. So we saw Vtune before. This is from Intel, and AMD also has a similar tool. So it's called MicroProf. It's AMD's visual profiling tool, similar to Vtune. Then there's Valgrind. I mentioned it before.

Starting point is 00:50:41 Briefly, this actually performs dynamic binary Instrumentation for linux. It's also very powerful tool Which also has a sub feature called call grind which allows You to create a detailed call graph and it also has been out There for i'm not sure more than 10, more than 15 years, I think. Then there are also static code analysis tools. In the example that I will show you just in a minute, there are quite obvious things that can be improved. And if you're familiar with C++, you will probably directly identify what's the issue here and what might be improved.

Starting point is 00:51:23 And many of such issues can already be identified with static code analysis tools, for example, Clang Tidy or CPP-Lint, so some linter tools here. Then there are address and threat sanitizers, which allow you to basically find bugs to identify address and threat issues. And as we saw before in VTune,

Starting point is 00:51:48 flame graphs can also be helpful if you're a visual person. If you want to visualize your profiling results, for example, you can also use flame graphs. There's also a tool that you can use for your perf output to visualize, for example, a call graph. If you want to play around with hardware counters, or if you also want to write your own application that can access hardware performance counters,

Starting point is 00:52:16 you might want to look into PAPI. It's a performance application performance interface. It allows you to gather performance counter information. And there are many, many, many more. As I mentioned, this should not be an exhaustive list here, just to show you some tools that we have been using at the research chair and that we like for performance analysis. And with that, I will briefly start with our second part of the hands-on profiling session. Okay, in our scenario we have a C++ application which performs a semi-join.

Starting point is 00:52:58 The join is executed on two string columns and we basically want to count how many tuples of the right side have a joint partner on the left side and it basically executes two loops first we have to go over or reiterate over the tuples of the left side and create a hash map on it and then reiterate over the tuples of the right side check if this tuple has a matching partner at the other side. So basically we do a hash map lookup and if we have a partner or if we have a value stored in the hash map, then we just increase our counter.

Starting point is 00:53:43 There are some things that you need to set up before you can profile with perf. First you need to compile your binary with debug symbols, so with a minus g compile flag. And you also should use minus fno omID frame pointer because this allows or this ensures that the frame pointer is still kept in the registers. If you do not do this, this register might be used as a general purpose register. So therefore, this just ensures that you still

Starting point is 00:54:20 have the frame pointer in the register. And there are also some system configurations that you just need to set up on Linux. One is the kernel pointer restrict, I think it is. And so this allows your accessing kernel symbols. And the second is perf event paranoid. You need to set this to zero or to minus one. So then it allows your application or with this configuration, you can access kernel

Starting point is 00:54:51 performance events at the user level also by unprivileged users. If you want to run this example by yourself, then you can SSH into your development container. There's also perf installed or you can ofH into your development container. There's also perf installed. Or you can, of course, also run it locally if you have a Linux machine. There is one important thing I will just post. I will post an announcement later on in Moodle. In your containers right now, there's perf installed.

Starting point is 00:55:22 But it's unfortunately not linked in one of the binary directories that is included in your path. So therefore you cannot just simply type perf and use it. But I will post a quick tutorial how you can either link it or add it to your path so that you can also just type perf and use perf. Otherwise, if you don't link it somewhere or add it to your path, you have to use the entire absolute path. But yeah, it's installed. You can use it. You will get some more details about it later on.

Starting point is 00:55:59 And yeah, on the machine, either on your development container or your local machine, first clone the repository and then navigate into the project's HTTP profiling session. With that, I simply need to mirror my screen so that I can see something here. Sometimes it takes a while with the screen here. We need to be a little bit patient. There it is. All right. So the folks in the back, is the font size large enough? Perfect.

Starting point is 00:57:00 Thanks. Okay. So what we basically first want to do, so I have some printings here so that I'm not getting confused with it. Right, so, which might be quite obvious for most of you is first, of course, we need to build it so that we can also run and profile something. So we create a build directory okay and CMake has this predefined build type rel with dep info. So this basically allows you to build your binary in release

Starting point is 00:57:51 mode, but it also keeps the debug symbols. So if you wouldn't have debug symbols, and then try to profile or to create a call graph, then you would just see very cryptic names that are not helpful or meaningful at all for us. So therefore, it's important to also have the debug. just see very cryptic names that are not helpful or meaningful at all for us. So therefore, it's important to also have the debug information. And yeah, after we just build it, we can run it.

Starting point is 00:58:24 Of course, building it first. And now this gives us our joint benchmark binary. And this is also based on Google Benchmark. So this output might already be familiar to you if you have used it before. And now we see the total times here but we also have a metric that gives us the latency per tuple and it says 539 nanoseconds what you also can do since it's based on Google benchmark you can open oops that doesn't work you can add minus minus

Starting point is 00:59:02 help to get further information and if if you want to have, for example, median or mean measurements, you can also use minus minus benchmark repetitions. Let's say we want to have three repetitions of this benchmark execution. It takes a little bit longer of course. And then we see the three different iterations with their execution time and also the mean median and standard deviation. Okay. But for us right now for the demonstration, a single run is fine.

Starting point is 00:59:45 Okay, we have our implementation of our query and we have a certain number now that we can start with. And now what we could do is we simply look into our code. Oh yeah, just as a brief explanation, I will not go into details here, but here we have our main function in which we do some setup. So we generate data and then specify our benchmarks and run it.

Starting point is 01:00:14 Not so super important here is the benchmark function in which we execute our function that we want to measure. Some setup, some data generation, not super important. So this is the benchmarking code. Feel free to look into it if you want to, but the important aspect here is our count tuples with join partner function. Again, the people of you that are already quite familiar

Starting point is 01:00:37 with C++ might immediately see some flaws here that we might want to improve. But this is just for demonstration purposes to also see how we can use perf to find these issues and to improve the performance then. So we basically have our function here which gets the vector left side and vector right side with the values. These are the, if you assume a columnar database or a database with the columnar storage format, we have now the column stored in the vectors and we have an unordered set here,

Starting point is 01:01:13 left names, we allocate some memory. Here we initialize. I can also add the line numbers. So in line 16 we have our number of joint partners so we want to count how many tuples of the right side have a joint partner in the left side. And we first iterate over our left side and add all the values to our unordered set, and then we iterate over the right side. And if we have a match, or if we have this value in our HashMap, then we just simply add or increase our counter, and then in the end we return it. And now we visited the HTTP lectures, and we know that branching can be a performance bottleneck,

Starting point is 01:02:03 so we just go through the code and think, oh, what could we improve? Maybe branching here is not so well. So we decide that we go for predication because that's something that we learned about performance analysis or performance improvement. So what we basically do here is instead of checking if it contains the value we can basically say write tuples with partner and then increase the evaluation result of our contains statement Thanks Nothing. Now we just need to build again.

Starting point is 01:03:13 And run it. And we can see this didn't help much. So this was just guessing. We just randomly tried to find something in our code where we thought, okay, we might know something about performance optimization, so we want to apply it. And this is often what people mean by premature optimization is the root of all evil. You just look through the code and you think there might be something

Starting point is 01:03:42 that I can improve, but it often does not lead into any improvement. In this case, the code is still not too bad to read, so the readability is still okay, but you also might add some complex code constructs here that maybe in the end actually do not give you the performance benefit that you're looking for. So i'm just removing it here. Going to the base version. And what we want to do instead is that we use perf to profile. So we want to do the event-based profiling. And this perf allows us to do it with perf record. And we then add the binary that we want to profile.

Starting point is 01:04:38 So this takes some time. And then we can... So now if we check our working directory, we see this entry or this file perf dot data and so this contains the profiling data now and we can view it with perf report and Now we have this view that we can navigate through with our arrow keys. And we see some entries and the time they spend in the very left column in percentages.

Starting point is 01:05:17 And we see that the count tuples with joint partners is about 35.7%. And there is something malloc-specific, unlinked chunk. So this does not give so much information to me right now to understand what's going on so therefore perf also allows to create a call graph so you can add minus G and then perf records So, we need to execute perf record minus g to tell perf that it should generate a call graph and then also add the binary again. And what it does now is that it uses the frame pointer to figure out where. So when we create our snapshot, as shown before, it uses the frame pointer to figure out what functions are we coming from and what function are we right now. And then tries to generate a call graph based on the frame pointer.

Starting point is 01:06:26 And now we can call perf report again, and now we have a quite different view. Again, you can navigate with the arrow keys if you want to try it at home. And you can also see the plus signs at the very left now. So basically, it allows us now to expand each function. In main, for example, or maybe the more interesting one, we can go to benchmark join, bm join, and now we can expand it. I actually use the plus sign to expand it. You can also press H and get a help overview that tells you how you can do certain things in

Starting point is 01:07:07 perf. So instead of plus, you can also press E to expand it and collapse it. And yeah, here in our benchmark join function, we have something libc start call main. This is probably not in our control. So let us expand benchmark join. And here we have our function count tuples with join partners. And if we expand it, there is some 5% for memory movement and 4.6 for basic string operations. So this still doesn't look really helpful here. So I would not be sure how to continue

Starting point is 01:07:43 based on this information or what aspect to improve in my code. So therefore, what we also can do is perf record minus minus call graph. And then in the dwarf mode. So, before we use the frame pointer to try to get some context information what it's doing right now, that at every sample, it stores content of the so it uses the entire stack and stores it basically just a snapshot of it and we could see in the command here before the 1024 this indicates how many bytes of my stack should be stored for each snapshot so by default these are eight kibibytes and we just want to store one kibibyte here because this is sufficient for this example and now if we look into our report again, this now takes some time because now

Starting point is 01:08:46 PERF goes through all the samples that we store during report time. Before it figured out the context information based on the frame pointer during record time, but now during record time it only stores all the samples. And then during the report time, it goes through the samples and generates the context information so that it can create the call graph. And this takes some time. But now we are there. And now we also have a quite different view here. If we go again into our benchmark join function and expand it, but there's some additional time spent on basic string allocators related to a vector so this is not really clear to me right now why we have it's about nine percent a little bit more

Starting point is 01:10:01 than nine percent spent on these string allocations and if we go back to our function Does anyone see where we? Do some string allocation that is somehow related to a vector any ideas? Maybe in the highlighted line 15 right now? Any guesses? the issue here, yes? Maybe because the vectors containing the strings are passed by value. Right, exactly. So we're passing left side and right side by value.

Starting point is 01:10:47 So therefore, every time when we call a function, we would copy the entire vector and its values. And that's actually not necessary here. We don't need to copy all the values of our join columns. So therefore, we just want to switch it to reference it. And our assumption here would be that we do not have that much string copy overhead anymore so therefore we just modify our code go back build it again

Starting point is 01:11:23 profile again with our call graph dwarf option. And now again it takes some time. But it should also get faster over time since we're ideally improving the runtime of our application. Therefore, it's a shorter execution time. Therefore, we should also collect fewer samples of the stack. But have in mind that this is a rather short application. If you do this for a big application and you track longer execution times that you can wait quite a long time. So in a former or in a recent research project we also profiled with retune in which it sometimes took over 45 minutes to actually

Starting point is 01:12:18 generate the results so that we can properly browse through it. But back to this example, now we want to look into BM join again. We expand it, expand it further, and we see 93% spent in count tuples with join partners, very nice. So we do not have the additional string allocations that we saw before. We further expand, and what we're seeing here is std unordered set. This should be fine. We learned from our database lecture that hash joins are expensive.

Starting point is 01:12:53 I mean, they're fast, but a lot of time is also spent on accessing the hash map, first building it, later accessing it during probe time. But there's also two additional entries. Some basic string here and also at the bottom. If we look further into it. Okay, so here in these lines we can see there is some time spent on string construction. So yeah, this would be something where I think we should look into right now. And we already eliminated the string copies before do you see any other Issue with strings where we might add or create strings

Starting point is 01:13:50 That are not related so this is important that are not related To the unordered set. That's something we will look At later. Yes. Yes, very good. So here in these for loops, we have auto value. So these are values. Therefore, we create copies all the time. And also here, we can change it and instead use references. Oh, and one aspect that I actually haven't mentioned.

Starting point is 01:14:31 So now in our previous iteration, by just changing our vectors to vector references, we could reduce our per tuple latency from, it was 530, yes, to 480 nanoseconds already. Now we changed our for loops in which we now use references instead of values and if we just run our benchmark again We see further improvement. So before we were at 480, now it's 127. Very good. And now let's profile again. And you might play around with the frequency rate a little bit. So sometimes you can also play around with the number of bytes that you collect from each stack. For example, right now we have 1024. You can also try to reduce it, but this could also lead to perf not having enough information that it can generate proper call graphs. But you can also try to reduce the frequency, for example, to reduce the number of samples so therefore also reducing the preparation

Starting point is 01:16:06 time for perf report so i'm but for this example i'm just using the default frequency and okay now we have a quite different view here so we see that 51 is spent on count tuples with joint partners you can also see something zero x fff i think this is for anonymous functions without any naming information so for example lambda functions and if we look further into our account tables with joint partners, now we see that all the time spent here or reported here is mapped to our unordered set. That's great. If we go back now to our application, imagine now your colleague comes to you and mentioned,

Starting point is 01:17:02 hey, I saw a very nice blog post about benchmarking hash maps. And there is this one hash map provided by a guy called Ankur. And he just recommended to you that you want to improve the hash map because now most of our time goes into our unordered set. And we want to check if this hash map hash set provided by our colleague or suggested is much better. And fortunately, we already included it in our example. So we can basically our std unordered set by this set and this was the wrong include give me one second It should be unordered dense. Okay, now we just changed the hash set. Do you have any guesses how much performance improvement we

Starting point is 01:19:27 Would get now? just random guesses. There's no wrong answer. Factor 1, factor 10. Martin, what do you think? 30%. 30%. Okay. Any other guesses? Yeah, 50 percent. 50 percent. Okay, let's see. So before we had 429 nanoseconds. And now 146.

Starting point is 01:20:01 So a factor of 3 actually. But now we reduce the latency from 530 in the beginning to 145, 147, but I believe we're still not done yet, so we want to profile again and check the report now you can also see that it's significantly less events so 8 million it's just basically because as I mentioned the runtime is shorter now and we collect fewer fewer snapshots and here's our count tables with joint partners again now we can also see that page falls are now becoming more relevant they were not more before not relevant enough did not have enough relative execution time so that they were not shown in perf but now with the reduced latency execution time these page faults are also visible here but we focus on count tuples and if we so we still have the four entries for our hash set and if we look into these we again see string

Starting point is 01:21:28 construction so therefore let's have a look at our code if we still do some string construction that we could reduce does any of you have an idea now it's based on our profiling data now it is related to the unordered dense set. So there's something in the unordered dense set we can Hopefully do to avoid the string allocations. Yes? i would say right where your Person is right now. Before the emplace would construct a string, i think. Right. Do you have a suggestion how i could change it?

Starting point is 01:22:12 We could change the type of our set to just take pointers to Strings instead of actual strings. Okay. Right. Do you have a concrete suggestion for pointers to strings? Okay. Maybe i think this would also work but there's also something. It's a nice c++ feature. It's called string views. So we basically could create instead of strings could create String view. String views here. And what it basically takes is a pointer to my string and it stores the length of my string.

Starting point is 01:22:49 And with that, we do not copy.1 nanoseconds per tuple. So we dropped our latency from 530 to 38. Again, this was a rather obvious example, but it's just to demonstrate how you can use perf in an iterative manner to figure out where time is spent and how you can improve it and work with that too. Okay, our display needs some time again, so we need to be patient. Yeah, but basically we are not at the end here, so feel free, if you want to try this at home to dig deeper, there are still some optimizations you can do. What we did so far was rather a physical optimization. So we looked at the physical level, how we can improve our code,

Starting point is 01:24:13 maybe how we can also change the data structures. But, for example, this change of a hash set is nothing that you would just spontaneously do. So be sure if you do this in a bigger application that you have Proper test suits that also check that your functionality Is still the way you want it to be. There are certain guarantees in different data structures. So just demonstrate how different data structure can Also improve the performance here. Feel free to dig deeper, especially in the database

Starting point is 01:24:44 Context. You can also do logical optimizations and often logical optimizations result in better performance improvement compared to physical optimizations there is one aspect that i still one more aspect that i wanted to show you about perf it's called annotate so what we did before was that one second I don't have it mirrored I cannot see much So, what we did before was perf report. Yeah, actually we can simply profile it again because the execution time is quite low and then report also only needs to process 3 million events and then we should be able to See it much faster Somewhat

Starting point is 01:26:00 And we can so before we just checked and expanded, so we checked these entries and expanded the entries, but what we also can do is to press enter and then annotate certain functions. And now we see different instructions with their corresponding source code above it. And we also see here for this move instruction that 42% of our application was spent on move instructions. So we're here in the context of the unordered dense of the hash set and the source code above tells us that we are somewhere in, yeah, that we are dealing with iterators and we are in

Starting point is 01:26:47 this wild line and we have certain so we have certain accesses here in in our hash sets where move instructions take more than 40% of our runtime does anyone have an idea why move instructions should Have such a large fraction of our execution time Yes. Caches and memory somewhat matches. So basically what we are doing here, we have to access the memory. So we have to wait for memory. And when we are waiting for memory, we cannot do much other things here.

Starting point is 01:27:35 We could also not execute the following instruction. So this is a compare instruction here. But we are moving data from memory from RAX to EDX and then we need EDX for the comparison so we cannot do much here we just need to wait for our memory access being completed and that's basically what can happen in applications where of course most of time not everything fits into the cache where you might access memory that is not cached and therefore you have to wait for memory.

Starting point is 01:28:08 Okay, I will not maximize it in full screen otherwise we have to wait for the display again but I'm also almost done here. One thing that I've already talked about is skid this is the Delay between the actual interrupt at which our profiler Or application that uses hardware counters would store Data now and the time when it's actually collected and when it's Actually collected our instruction pointer could be at A different i at the following

Starting point is 01:28:47 Instruction. So what actually often can Happen, and this is, yeah, it's quite common when you use Performance or profilers or also perf, is that the 40% that we See here for the move instruction is mapped to this compare instruction. And this is only because that we had this skit. And it's important for you to know it because this could be very misleading if you see the 42% for the compare instruction. And have no idea about this quite common issue with profiling with this delay so always have this in mind when you see some um some time spent on a instruction that you assume is

Starting point is 01:29:35 actually quite fast because a compare instruction is really fast here and as i also mentioned before different cpu providers have certain options to do a more precise sampling. For example, PEPS, as mentioned before. And when you use perf and list your events, you can add the suffix colon p, colon pp, or colon ppp to increase the precision. But there's also, as briefly mentioned before, a huge documentation by Intel it's very long it's about 5,000 pages long but this

Starting point is 01:30:10 should contain a lot of information that you need for Intel development and with that we are at the end of the session yeah in summary we learned today about profiling about different types of profiling. I talked about instrumentation, sampling-based profiling tools. Sampling often reduces the overhead significantly, and profiling-based tools also do not require you to modify your code regularly, compile it, run it, modify it again. And I also introduced the two Hardware counters. Hardware counters are your

Starting point is 01:30:48 Friends. You can check them out if there Are new cpus that you want to work with, that you want to run Your application on. It can be very insightful to See if there might be new counters for maybe a nice new Feature of the cpu that you can use. And we did a short hands-on session with perf. I showed you how you can use it to improve your or to identify hotspots, to identify performance issues or certain code executions that you did not expect

Starting point is 01:31:19 or that you did not expect to be so expensive and how to prove it we also briefly talked about a few setup configurations with c++ to profile and also have in mind this this rel with dep info flag that comes or that cmac provides and yeah are there any questions related to this profile session or through profiling tools or perf in general? All right. If not, I already mentioned there's quite a lot on documentation, especially about perf. Feel free to check it out. Then our next session is tomorrow. Tomorrow, the lecture will be about multi-core parallelism

Starting point is 01:32:06 and then in the next week on Tuesday we will discuss the solution of the first task and we also will introduce you to the second task and also please have in mind that the deadline is also next week on Monday and with that

Starting point is 01:32:22 thanks and goodbye

Pet Camera - EBO Air 2

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Profiling

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Profiling

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.