Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Profiling
Episode Date: May 24, 2023...
Transcript
Discussion (0)
Okay, hi everyone, let's get started.
So today we're going to, well I'm going to speak a bit about profiling.
Microphone's on, yes, okay good.
So this is going to be split in two parts basically.
I'll give a short introduction to some of the background behind profiling,
some of the terminology, just a bit of light background.
And then the second half is going to be more of a hands-on part.
I just sent around a notification in Moodle.
If you want to follow along, that's how you can do it.
The slides are in Moodle, and they contain all the commands
that I'm running and everything and some screenshots.
So this should hopefully be OK for you to follow along.
If you'd rather pay attention to what I'm doing
and not follow along, then this is going to be recorded.
So if you're not here and watching this as a video,
feel free just to run the commands
and pause as you please.
Yeah, I'll just get started.
And let's see how this goes.
So just a brief overview again of where we are.
So yesterday, we had the data structure lecture.
Today, we're going to look a bit about profiling.
So this is still mainly in the CPU and DRAM domain.
So also the profiling we're going to do is going to be in that domain.
There's some other things that you can do to profile GPUs and network and disk and that stuff.
But we're not going to cover that today.
Just basic timeline.
I think this is still correct.
So we're here right now.
Next Tuesday we'll have the introduction to the second task and the discussion of the solution of the first task.
So reminder, Monday evening is the deadline for task number one.
And then yeah, this is still the tentative stuff for the second part of the course.
So, yes, in this lecture, again, we're going to have two parts.
In the first part, I'm going to basically just share some concepts behind profiling and also some tools that I use.
So this is by no means an extensive list of everything that exists out there.
This is some of the setup that I've been using or we've been using at the chair that I'm a bit familiar with.
So again, this is mainly based on in the Unix environment because this is what I work with.
So this is going to be a Linux server and I'm briefly going to just show something on Mac OS.
But the focus is going to be on Linux server today.
And this has a strong focus on C++,
just because I don't have that much experience
in profiling other languages.
But I guess most of this also applies to other things.
So the tools, for example, perf that we'll be showing,
you can use this with anything.
So this also works with Rust.
I guess if you work with JVM or Java,
there's going to be some different profiling tools there.
If you have any questions regarding that, you can also ask me.
I have a bit of experience in that, but I'm not going to cover this here in the lecture
because for the lecture, we're mainly programming C++.
So yeah, if you're watching this at home, feel free just to watch the video,
pause, run the commands, see what happens, play around with this a bit.
Just to start off basically by saying,
everything I'm showing you today
is just gonna be a very light introduction to the topic,
just to show you some of the tools
or the main tools I've been using for this.
If you're a perf expert already,
you're probably not gonna learn that much today.
But in general, this is,
using these kind of tools
is also just a skill that you need to practice.
So for most cases, you've probably not run into this
because it's not been super relevant.
But maybe down the line, it'll be relevant to see
where performance is being spent or where you're
losing performance if things are doing the right things
that you expect them to do, and so on and so forth.
So maybe just kind of run the commands, see what they do,
get a feeling for what you can do with them,
because especially perf is an incredibly powerful tool
and we're only going to scratch the surface of what it does
because it will probably take, you could do the entire semester on perf lectures
to show what this stuff can be done.
Okay, so profiling.
Let's start with the mandatory kind of random definition here.
I picked this off Wikipedia just because I think it's quite nice.
So basically, if we look at profiling,
we want to analyze something about the program.
Key part here, although it's like a dynamic analysis,
so it's not something we do ahead of time,
we run the program or an application, and we kind of do something to inspect what's happening
and usually we're interested in something like the space or the time
complexity so how long is it running how much memory does it use which
instructions are used how often are they used or maybe how often are certain
functions called how long the functions take and these kind of things so it's a
very very wide spectrum of what we can do. We want to inspect in the program.
And then depending on what your goal is,
you might want to use different tools or different approaches
to cover certain things.
So yeah, basically, why do we profile?
Why do we want to do this?
And there's actually two reasons to do this.
So first of all is to understand performance.
So for example, here when we do some research,
we're writing a paper, it's really important for us
to be able to explain why our approach, for example,
is better than another one.
And for this, it's really important to profile
and to see, OK, we're spending maybe less time here,
more time there.
And this is not necessary to optimize it,
but just to understand why is something behaving
like it's behaving.
Or if you're just running some benchmarks,
it's important to know why am I only achieving this performance?
Because I'm spending a lot of time in this function.
It's very slow, so I can't be faster than that.
And also, we might want to profile to find bugs,
or something like a memory leak, or something like that.
I'm profiling.
I'm seeing my memory is increasing, increasing,
increasing, so maybe I found a memory leak. We can also do this through profiling.
And then the second step, after I've understood performance,
then it can be optimizing or tuning your application.
So let's see, I found a method and it's very slow.
Then I'll spend some time on optimizing this.
And this is something what we'll kind of go through in the example later.
We're trying to see, okay, where are we spending time?
Or what are we seeing that's unexpected?
And then how can we apply a few basic fixes to do this?
And the optimization can be automatic or manual.
So often it will be manual.
You'll run some commands,
you'll see where time is being spent,
and then you'll optimize your code accordingly.
But this can also be done automatically.
So basically, if you think of the JVM, for example,
this will profile your code while it's running and then
decide based on some characteristics
if it wants a JIT function or not.
So this can also be hidden from you.
But it's basically the same thing.
And there's mainly two types of profiling tools.
And there's instrumenting profilers and sampling
profilers.
And I'll get into both of those in a second.
OK, so the first one is tracing or instrumentation.
And they're kind of intertwined a bit.
So the terminology is not always 100% strict or clear.
So I'm just trying to give an overview of what they generally
mean and how they stand to each other.
In general with tracing, we mean that we want to observe the execution of an application
via traces.
Traces are basically just logs.
The main difference is if I do logging or normal logging, so most of you might have
written an application where I want to log something. Then this is often some semantic event.
So let's say user logged in successfully or there was an error that returned 500 codes, so I'm going to log this.
And this is often like an important semantic event.
While traces are usually used more to capture the flow of something.
So if I want to really dig into something in more detail, I want to see, OK, so first this method executed,
then this method executed, and this one.
And maybe I have a request ID in all of these logs
so I can kind of see how this went through the program.
This is usually a lot more detailed than general logging.
And if you've used logging frameworks,
you'll often see the log levels you have.
Like you have the fatal and error at the top,
and you have warning info debug,
and usually under or below debug you have tracing,
which usually means this is something
that is gonna be very, it's gonna be expensive to do
because I'm logging this all the time,
but it gives you the most details,
the most fine-grain information that you can get.
And then instrumentation basically is just the process of adding traces.
Again, kind of un-technical, but we modify the code in some way
so that we can observe or monitor this behavior that we're seeing.
So yeah, we're basically just adding traces.
And then there's two main ways to do this.
First one is source instrumentation.
So this basically just means we're going to modify the application code.
Most of you have probably done this in some shape, form, or whatever.
It's just like I can add a timer and say, okay, start, end.
This is basically tracing already.
And this is a lot more common than binary instrumentation for users because it's a lot easier to do.
You're most likely not going to modify your code on a binary level.
But there is also binary instrumentation, and this basically actually really modifies the binary code.
And again, this is used, for example, for jitting in VMs,
or there are certain profiling tools like Valgrind that actually do this for you.
But yeah, most tools do this.
If you do it manually, it's going to be usually source instrumentation, and then existing
profiling tools like perf, for example, also does binary instrumentation.
OK.
So let's say we have this very, very basic method,
and we wanted to analyze some performance characteristics
of it.
Let's say we wanted to see how long it
takes to run the bar function in this context,
and we wanted to understand how many times we call Baz.
Then we can do something very simple,
and we can just say, OK, in the foo method,
let's add a start timer.
Let's run the calls.
We do an end timer.
And then we track somewhere along the way
this took this much time.
And then for Baz, we can say, oh, OK, I
have a counter somewhere.
Just add plus 1, because then I know that this was executed.
I just prepared something.
Actually, I'll do this first and then.
Okay, so the big, big advantage of this
is that it gives you really full control of what you're doing.
If you're interested in something, I can just say run this.
If I'm not interested, I can say don't run this.
If I'm not interested in the BazCounter anymore,
I can just remove it. But if I'm interested in even more detail, then I can add even't run this. If I'm not interested in the baz counter anymore, I can just remove it.
But if I'm interested in even more detail,
then I can add even more details here.
So this really gives you full control.
But generally, this has a very high,
or can have a very high overhead,
depending on the granularity that you use.
So if we look at the baz function, for example,
it just does a division by 10.
And now suddenly, we have some map access that we need to do.
We need to increment something.
So this will suddenly make the code a lot bigger, which I'll show you in a second how much this changes stuff.
There's an overhead of timer calls.
I'll get to that a bit later.
And this may even impact compiler optimizations.
So let's say we have a very small function and the compiler decides to inline this.
Maybe we've modified it now in a way that the compiler decides,
oh no, it's too big now, I'm not going to inline this.
So then suddenly you're changing the characteristics of the application.
And just to show you a very small example of,
okay, this is, how can I get this to go across?
Okay, we can see now, so we have this, where's my mouse?
Okay, so we have this small code example here, which is the same, very, very basic.
And now we have the code here at the bottom that adds the timers here, which is just a
clock and we add this here. So this is the left of the code on the top. That adds the timers here, which is just a clock.
And we add this here.
So this is the left of the code on the top.
We can see we have the division, which
is converted to some fancy multiplication here.
And we can see this is all inlined here.
And on the right, which is very hard for me to read now,
we can see that there's suddenly within Baz,
instead of just doing the division,
there's suddenly all these move and call instructions.
We need to get something from the map.
And then we call another something to delete something.
And suddenly, this has made our code a lot more complex.
So this obviously heavily depends
on how you implement it, just to kind of showcase that.
I think everything is still in line here, actually.
But you can see there's just a lot more going on
because we have different components flying around.
So you just need to be aware that this might be expensive.
How do I get back to my PowerPoint?
OK, perfect.
Yes.
OK, so there are some pitfalls that you can run into
when you're doing this.
I'm just going to mention a few very briefly.
So this is generally a problem if I want to do very, very short measurements.
If something's running for five minutes and I add a timer before and afterwards,
it doesn't really matter because, well, this is going to be very quick.
But especially if you're interested in very, very tight loops.
So, for example, in the first task, the loops we're working on, the instructions, there's maybe only a few dozen instructions.
And modifying this can and will have an impact on performance.
So some things that people kind of tend to ignore or may ignore is that the compiler and the CPU can reorder instructions. So if I'm interested, for example, just in the duration of one single instruction,
and then maybe the compiler decides to reorder something,
then I'm not actually measuring anything meaningful.
So this means that the timers can start and end
in the wrong place, and you kind of need to pay attention
to how this is done.
If you go down this rabbit hole and you're really
interested in very, very, very accurate performance
measurements of cycles, then at some point
you'll probably read some blog post or Stack Overflow answer
that says, use this read timestamp counter instruction
on Intel, for example.
But that's incredibly hard to get right.
So this is basically a hardware register
that tracks the number of cycles that are processed.
So if I'm really interested in how many cycles have passed,
I can just read this.
But it's not consistent with regard to the clock frequency.
So if on different cores I have different frequencies,
then all of this doesn't really work.
And yeah, so it's not synced between cores.
So for example, if my application switches from one core to another,
then kind of this all breaks.
And again, this is a non-serializing instruction,
so it can be reordered and very likely will be reordered.
So if you're really looking at a very, very, very low level of performance measurements,
then this can actually go wrong and you're just measuring something incorrect. And then there's one
thing that again applies mainly to very short measurements. For example in C++
there's something called high resolution clock which people like to use because
it says it's high resolution, but this is actually by the standard not
guaranteed to be monotonic.
So while this might work, it technically doesn't have to work.
And by not monotonic it means technically if the system clock adjusts,
it can go back in time and then weird things happen.
So also like if you check CPP reference or something,
it says it's recommended to use steady clock if you want to check durations
and use system clock if you're
actually interested in walk time that has passed. So yeah there's a few things here and obviously
again if you're looking at very very short measurements then getting the time actually
has a cost. So it really depends on the system and what you're running on but getting the
the time stamp now will cost you maybe 10 to 30, 40 nanoseconds.
It really depends on what you're running on.
Sometimes it's very fast, sometimes it's a bit slower.
And if you're looking at measurements at this frequency or this granularity, then adding
a time on every instruction will probably really, really mess up your measurements.
Yes?
One quick question regarding all the TSC.
So it does count cycles, right?
It would still tell you, like, let's say two calls are
at different speeds, and your code
consumes this many cycles.
It would consume the equal amount of cycles on each call,
just the quicker call would be quicker to process it.
So it's still useful information.
Yeah, so this is definitely useful.
The question is just what do you want to do with this?
So often, so if you're purely interested in cycles,
then this is probably okay and probably good to do.
But if you want to convert this into time, for example,
then that doesn't work.
But if you're purely interested in cycles,
yes, this probably works.
If you also consider that it's non-serializing
and that you need to do certain things
to guarantee that it will be executedserializing and that you need to do certain things to guarantee
that it will be executed at the right point in time with regards to stores and loads and
whatnot.
There's a lot of Stack Overflow answers on this and a lot of discussion on how to do
this and how to do this correctly.
And most people say try to avoid it if you don't really know what you're doing because
this is probably one of these things that you run into and
You think ah, it looks like I'm doing the right thing, but maybe I'm doing the wrong thing
So yeah, but in general just cycles. I think this works
Okay, but I I've kind of stayed away from it also because a lot of people say it's dangerous
Yes, like I just mentioned tracing tracing everything can be very expensive.
So if we just go quickly back to this example, I'm doing this every single time.
We've just seen this is very short.
These measurements are very short.
The instructions that I execute are very few.
So doing this on every iteration is actually very expensive.
So instead, let's just sample this.
So let's say we'll do this every 10th instruction,
every 1,000th instruction, or maybe do this every 10 milliseconds,
every 100 milliseconds, or something like that.
So what we can do is we can just add, again, manually,
we can just say, well, if there's some counter somewhere,
and that is now a multiple of 1,000, then we'll track this.
So we sample every 1,000 events.
And so obviously, this is less accurate
than the full instrumentation, because we still
have the modification of the source code,
so we're kind of still messing with what the compiler will do.
And we're not tracking every event, but only a sample of it.
But generally, this has a lot less overhead during runtime,
because we're not doing expensive calls every time, but only if we do something that's fairly cheap to compute.
So here probably also like modular.
If you really wanted to do it this level,
doing something modular 1,000 is probably a bad idea
because it still involves multiplication and everything.
If you do something that's a multiple of,
oh, if it's a power of two, then this will just be a shift
and it's a lot quicker to check and these kind of things um so again if you're really looking at low low um or high performance
measurements then there's a lot of things to pay attention to um and yeah so one main downside
let's say is that this still requires manual code modification so writing this in all of your code
is going to be very very annoying um so if you're generally just saying, oh, where am I spending time?
What's going on?
Then I don't want to write this for every single function.
Sure, if I only have five functions I'm interested in, then this is OK.
If I have a code base with a million lines, then this is probably going to be very messy,
especially if we think of something like I need to keep track of some counters somewhere
and these kind of things.
That gets very messy very quickly.
OK, so let's kind of switch now from the source modification that we did.
And let's look at something that hardware provides for us
that we can do to inspect some of the code we're running.
And the first thing we kind of look at there
is hardware performance counters.
So these are special registers to track hardware events.
So this also means they run in hardware,
and this also means they're very fast.
And you don't really need to care about them.
Because for example, if you have an Intel chip or power chip,
I think basically all of them do this.
And every clock cycle, some counters
are incremented somewhere in hardware registers
that at some point I can read.
So there is a small cost in reading these registers,
but they're constantly tracked, and there's
no cost for the system to really do this.
So again, on Intel, for example, there's this read performance
monitor counter instruction, which was introduced,
I don't know, probably 10, 15 years ago.
It's been there for a while, to extract information
from these counters.
And Skylake, so this is now, I don't know, what, seven-ish
years old?
I don't know.
Not super important.
Skylake has four programmable registers,
or eight if you disable hyperthreading,
because then you'd have two logical cores that share this and now Ice Lake for example
has eight programmable registers per core but they have a wider range but now
they have a bit more restraints or restrictions on which instructions or
which events I can track in there and again this is very hard to like
generalize because it really depends on the system you're
running and the CPU generation you're running, which kind of events are supported and which
aren't.
But generally, this will be something like L1 misses, number of cycles, branch mispredictions.
So very low level metrics that you want to extract that are basically impossible to track
otherwise with this level of detail. low-level metrics that you want to extract that are basically impossible to track otherwise
with this level of detail.
And I said there's four or eight registers that I can just program and say, please, in
this register, track this counter.
In this register, please track this counter.
Let's say I wanted to look at 10 counters.
Then I don't have enough registers.
So what most tools will do, or systems will do,
is to multiplex this.
So for one cycle, for one iteration in whatever time period I have,
I'll put this value in the register,
and once that's done, I'll take the other value and put that in the register,
and then I can kind of time multiplex and scale the values.
This is obviously a little less accurate, but this is how most tools will do this.
And for the majority of things you're looking at, this is probably also fine.
Okay, so we want to, like I said, we're looking at hardware events, and we want them to be very precise,
because we're looking at, like, we're want them to be very precise, because we're actually interested in very low level
performance metrics.
So basically, all major CPU vendors
now have something that lets us do
some form of precise event-based sampling.
So in Intel, this is called PEPS.
On AMD, it's called instruction-based sampling.
On ARM, it's called CoreSight. It On ARM, it's kind of called core site.
It's not 100% the same, but kind of covers the same topics.
And the main idea behind this is that I can set a register,
and I can say when this register overflows.
So I'm saying basically this is a limit.
Let's just say 100.
If my register now has counted 100 branch mispredictions,
then it will trigger an interrupt.
So then the CPU will say, OK, this register's overflowed.
Interrupt, and then the application monitoring disk
can say, OK, let me read the values that are there now.
And then I can just store this information.
And this is, for example, what many profiling tools do.
So I'm interested, OK, when this register now is overflowed I get an interrupt I'll
check okay what is the current instruction pointer where's that
pointing to what are the registers what are the values here and I can collect
all of this what's the stack the more my core graph right now and I can just dump
this somewhere or store this somewhere and then I can just continue executing
normally so this is what basically all tools you mainly well let's say the this somewhere or store this somewhere and then I can just continue executing normally.
So this is what basically all tools, mainly let's say the focus here is on perf.
So this is definitely what perf does and Intel VTune for example also does this.
This is all hidden from you.
You don't really need to care about this.
I'm not really sure.
I think on the next slide I have some more information on perf for this in a second.
So this is basically some of the basic,
the fundamental background on how these tools operate.
So if something like perf,
I'll get to in a second, we'll use these hardware counters,
we'll use this precise event-based sampling to monitor what's going on,
and then give you this information.
Another question.
So now, nowadays, lots of code is executed in virtualized
Cloud environments. Would you say that the
Ritualization just prevents that or are there different levels
Where you could say, like, linux-based containers, you
Get everything, get the permissions, but general
Hardware-based virtualization breaks this and you can only go To software? i'm not 100% sure.
There is a lot of discussion.
So, obviously, people doing virtualization are still interested in these things.
But I say if you think of it on, like, an abstract level,
then probably virtualization will say, I don't really care about this.
So, it's not necessarily going to pass this on.
And also, for example, if you run something in Docker,
then you need to run everything in privileged mode
so you can actually even get this information.
Because by default, you're saying this is important information
and can be used for security reasons or whatnot.
So there are ways to expose this, and it does work.
I'll show an example.
This does work in Docker,
but it really depends on if you want to expose this or not.
But I wouldn't say that virtualization in general
prevents this, because also in virtualization,
you physically need to run somewhere.
And then it just depends on the tooling around it
if it can stitch everything together again.
I guess I don't mean whether it prevents it,
but let's say we two run something
on the same physical box. I look at the hopper counter. I do nothing. I it prevents it, but let's say we two run something on the same physical box. Yeah.
I look at the hopper counter, I do nothing.
I look at it again, the counters increase because you did something.
Okay, so that's an interesting question.
So the question was basically,
what if we have two applications running,
one is mine, one is his,
and now we're reading counters and they change.
So first of all, this is only per core.
So you can monitor system wide.
But you need privilege access to do this for security reasons.
So in a truly virtualized environment,
this won't happen.
And you'll generally just read the counters that
are assigned to your program.
So that generally shouldn't happen.
But again, this is kind of tooling around it.
You, A, shouldn't really care about this.
If you're building virtualization yourself, you
Probably need to deal with this. But in general, i'd say tooling
Or the kernel or the cpu will take care of this for you that
You don't read values that someone else has done.
Yeah.
As long as the calls are not
Shared, you're fine.
If the calls are not shared then this is a zero problem because the counters are per core.
If they are shared, then it's basically if I'm scheduled off the core, then you would need to manage this somehow.
And I assume there's probably all, I'm not 100% sure, but there's probably an interrupt for that, too.
Then you can probably track something, say, okay, I've been scheduled off this core, store everything.
And then when I go back on, I can, wherever I'm scheduled to next. I can start.
OK. Let's dig into some tools and some friends.
So this is just some basic profiling tools
that I've been using a bit, and also some tools that are not
directly profiling tools, but that help you with profiling.
So profiling in itself is often a bit of an art form,
let's say, because it's usually not super obvious what's
going on.
So the example we'll show later, it's a bit more obvious,
but often it's not that obvious.
So you need to dig into a few things
to understand where time is going.
So the first one, the first big one I'm going to share is perf.
So this is just the standard Linux kernel profiling tool.
So basically, if you have Linux, you probably want to
use perf.
And it abstracts all these hardware and software events
for you.
It doesn't matter if you're running on Intel or AMD or
ARM, there's a set of things that it just fully abstracts
for you.
Obviously, if you're interested in CPU-specific
counters or events, then you need to know which hardware you're running on.
But something like cache misses and branch mispredictions
and these kind of things, this will just abstract for you.
And it doesn't matter where I'm running.
I can just say, give me cache misses, and it will do that.
So these are like hardware events.
But also, it has a lot of software events
that you can kind of look into.
And that's why I'm going to start with this scary picture here.
And you can see in the kernel there's a lot of different things going on and I
can
get information about syscalls or networking
or file systems or page faults.
For example, I think there's
core migration which would be I've been switched to another core
and these kind of information I can all get that and perf does all of this for you, which is quite nice.
If you run perf list, there's a list of all these events.
And I didn't fully check because there's
multiple lines per output.
But they're in the order of thousands of events
that you can use in perf to track.
Most of them will probably never be relevant for you.
But some might be.
By default, currently, perf samples
at about 4,000 megahertz just FYI.
You can obviously change this too,
but that's just what you usually get.
And now just to cycle back to these hardware counters.
So on Intel, it tries to see to use this
unhalted core cycles counter. So basically when that reaches an overflow
with precise sampling then it'll take a snapshot of everything and continue.
On AMD it's a different instruction, on ARM again different instructions. You
really need to dig into some of the documentation to figure out what's going
on underneath but in most cases it's also not very relevant and yes so
there's some really good resources on how to use perf so if one is just the
official Vicki from the news Colonel and the second one is where I also got this
picture from from guy called Brendan Gregg who does a lot of perf things
there's a very nice talk about how they use perfect Netflix on this website from a guy called Brendan Gregg, who does a lot of Perf things.
There's a very nice talk about how they use Perf with Netflix
on this website.
There's a lot of one-liners that you can use just
to check different things.
So it's a very, very good resource
if you want to get into Perf.
So yeah, Perf, like I said, is a very, very mighty tool.
And if you just run Perf without any arguments,
then this is the list of commands you can run.
And this says this is the list of the most commonly used
commands.
So there are even more.
It's just not showing all of them.
Again, most of them are never going to be relevant for you.
This was originally also designed for kernel profiling.
So a lot of these things are probably only relevant
if you're actually in the kernel or if you're
doing very specific things.
And the most commonly used ones, don't take my word for it.
Obviously, it depends on what you're doing.
It's just going to be perf record and report,
which says record something and later on, once I've recorded it,
show me the output.
Or perf stat, which just counts the events and outputs them
directly.
And some nice things.
So you can attach this to a running program in Linux,
as long as it's your own.
So you can't attach this to someone else's program.
You can do system-wide monitoring.
So I can just say, please track all page faults system-wide.
You need to have administration privileges for this but
yeah it's possible most often you'll probably just run a single application
so you'll say perf record my application then my application finishes and then I
can see what's going on afterwards which is also what we're going to be doing
later okay so just very brief output here.
So if we have perfstat, for example, for the command ls,
so you use this perf, then the command you want to do,
and then this is usually the program.
For this very basic example, I'm just showing ls.
And this is just the default set of events that perf will emit.
So we can see how much time it took,
how many context switches there were.
This is the CPU migrations I was talking about.
So how often was I switched from one core to another?
It will track the number of page faults we had,
how many cycles have passed, how many instructions
were executed.
From this, we can also then see how many instructions
per cycle there were.
We can see how many branches were taken, I guess, or how many branches
there were, and how many branch mispredictions there are in here.
So these will just give you a general idea, and depending on the output then sometimes
this will also give information on like, oh, you're spending a really, really high amount
of time somewhere here, you're bound by something, really depends on what you're running. And then just a bit more specific, if you let's say I wanted to run this 10 times
just to make sure I'm actually getting good results and I'm only interested in
L1 data cache loads and stores, then I can just specify the events I'm
interested in and then I get only these events as output. So I can kind
of mix and match here what I want to do.
OK, next one is Intel VTune.
So as the name says, this only works on Intel.
It's an Intel tool for x86.
They have a desktop app that you can run,
which you need to run on it.
So you need to have an Intel CPU to actually even
run this locally.
But you can then connect it to a remote server.
There's a web UI, which means you can run the, it has an application on the server and
connect through the web, which I'll show you in a second.
And they have a command line tool.
I usually use the web UI because I have a MacBook.
I don't, I can't, don't have Intel anymore.
But if I want to profile Intel servers, and I can actually, I'll show you in a second.
The command line tool, I don't use that often,
because actually the one main advantage for me personally here,
this is a visual tool, so I can kind of click through it,
and if this helps you, then that's good.
Like I said, it only works on Intel x86,
but then it has a lot of more specific information that might be
only available on the systems so just one example all the arrows a bit messed up so one example here
I've picked for example is 4k aliasing which is like a cache effect when I when I have I'm
accessing memory in four kilobytes strides then usually this is really bad for caching because depending on how your cache is designed,
I'm always evicting the same cache lines just by how the addresses are calculated.
So this can give me information on that, which again is very specific to the actual CPU and how the caches are designed.
If you ever run anything on the data lab servers, this is installed on them.
You can opt in to one API somewhere, somewhere.
There's an application called VTune Backend,
and you can run that to actually get this in the web UI.
And then this looks like this.
I have no idea what I actually profiled here,
so this is something old.
Let's see.
And then you can see there's some, this is probably not going to be super, oh this is
some, okay.
And then you can see some things here, and you have like a call graph and you can see
where time is being spent and I can't show the assembly probably right now.
OK, I can't show that right now because the path is all gone. But then you can see the assembly and the code
next to it in which instructions are being executed.
And all these things, it's very visual.
And you can see there's a lot of information here
on where this came from and I don't know which function.
And there's a flame graph here of how much time you're spending.
So these are all very nice visual features all in one.
If you're profiling Intel, then it probably
makes sense at some point to look into this.
And this also gives you some nice insights
into memory consumption and this also gives you some nice insights into memory consumption and memory
bandwidth and on which memory channels this stuff is running on if you're interested in these kind
of things question so this yeah this is generally i'll get to that in a second if you're profiling
you always basically want to build with debug symbols. It doesn't mean you need to have a debug build.
You can do release with debug symbols, which can be 0203,
just with a minus g to add this.
And then there's the debug information
that then the tools can pick up.
Otherwise, you'll just get cryptic names here,
which is like 0x2499 underscore something.
Whatever it can pick up.
If it has information, if it doesn't, then you won't get anything. like 0x2499 underscore something. Whatever it can pick up.
If it has information, if it doesn't,
then you won't get anything.
OK.
Yeah, so this is a nice tool.
If you want, play around with it and see what can be done.
Another one is Apple Instruments, just
for the sake of completeness here.
This is Apple's visual profiling tool.
This is mainly designed for mac os
and ios applications so it kind of lives in this xcode environment you can use it without and i
have used it without i'll show you in a second this only works for mac os but it supports arm
if you have an m1 or m2 macbook which is quite nice because if if you have like intel on the
one side x86 it's a very, very massive tool.
Tooling specifically for ARM compared is very low.
There are some commercial ARM profilers, but the feature is free.
So that's usually a problem if you just want to play around with it.
And I have something here.
I can switch that across.
So you can see here there's some code I ran for some benchmarks a while back.
Without digging too much into this,
you can see here how much time we're
spending in different instructions.
And we can see it as a call graph here
and seeing how much time we spend.
So you'll kind of see the structure again later in Perf.
These tools kind of all have a similar structure.
But without digging too much time,
we can see how much time we're spending, basically all we're spending here in this
hash bucket get function. Again, Apple Instruments, feel free to play around with it if you have a
MacBook and if you want to profile some of your code there locally. Okay, some friend tools. So
you've probably seen this in the lecture already. It's Godbolt. It's officially
called Compiler Explorer, but the guy who invented it is called Matt Godbolt, so everyone just calls
it like that. There's the link. I'll also show this later in the hands-on part.
Yeah, so basically what it does, it just takes the code you have and it shows the generated
assembly for it. So this is incredibly powerful if you want to discuss what the compiler is actually doing the main motivation for this actually was that
Two people were arguing about will the compiler optimize this or not and with this tool
You can actually say the compiler will or will not do this
It works for many many many languages. It works for many many many compilers if you want to
look at GCC and how all the versions if you look at Clang and have all the
versions it runs on ARM it runs on x86 you can compile stuff for power on there
you can use super esoteric compilers that no one really uses but they have it
so definitely check it out and just one fun fact that I didn't know for a long
time actually is for example you can also include public headers from GitHub.
So if you go onto the raw view on GitHub, I can just take this, include it, and C++, for example, and then I can just use public code there to show, let's say I have a dependency, and that's the easiest way to do that. Okay, the next one, and this is a bit more specific, but I've been using this a bit recently
for some things.
I've been digging into SIMD for example, and there's this website called uops.info.
This is run by some people from Unizarland.
And on this website you can get four x86 servers so this is AMD and Intel you can get and
certain instructions for example this VP V div PD instructions which is probably
some division of some floating point something 100% sure double yes okay and
this is really helpful if you really
want to dig deep into something.
So most of this is probably not relevant.
There's just a link here.
Again, if you're looking at assembly generated code,
you're looking at some profiling output,
and you kind of want to see why instructions are behaving
like the instructions, this might be helpful.
The third one, which we're also going to use in the example,
is Google Benchmark.
So this is a widely used, probably not the mostly widely used C++ microbench framework.
Something that it's also kind of well known for is this do not optimize method.
There's a nice video explanation of what it does.
It's basically a compiler developer telling people how to trick the compiler into not doing certain things.
Because if you're benchmarking something and the compiler
sees, oh, you're actually never doing anything with this
value you're reading, then I can just remove the entire
function call.
And then you're profiling nothing.
This usually happens when you profile something or
benchmark something, and you see I'm getting 10 trillion
events per second, then this is probably happening.
And there's an online version of this for QuickBench.
So this is kind of similar to Goldbolt.
I'm not going to open it now for the sake of time.
But you can add some code here in this Google Benchmark format.
And then you can just run it, and it'll give you a brief overview.
This is not going to be super precise,
because you have no idea which server it's running on.
But you can see conceptually which of the two might be faster or not.
So if you want to look at really, really low-level optimizations for specific hardware, this is probably a bad fit.
But if you're looking at general things, this is actually quite a nice tool.
And you can then just share a link and show someone, hey, I ran this somewhere without needing your own server.
Okay, just a list of other things.
Again, I said profiling is usually not done in isolation.
Actually, the first one is still in isolation.
So, Valgrind, there's an entire tool set around this.
This does dynamic binary instrumentation.
So, it kind of says, I'm simulating the CPU instructions,
adding something, and then actually executing it on the real CPU.
So, this is an incredibly powerful tool. I personally don't have that much experience with
it, so I can't share that much information here right now. But again, a lot of people use it.
It's very mighty. It's been around for 20 years. There's other tools you should definitely use,
like static analysis, static code analysis. So I think for the example I'm going to show you in a second,
most of this will actually be found by a static code
analysis tool.
Something like address sanitizers,
thread sanitizers will help you find bugs.
So then it kind of also, you're guiding in where you need
to profile and where not.
If it's telling you how you have a problem there,
then you can kind of look into some deeper. Use flame graphs. Like I showed you, it might be very
helpful to get a visual representation of where time is being spent. If you are a visual person,
there's a lot of tools out there for it. VTune can do it. There's a plugin or like a script for
perf to do this. And there are many, many others. So if you ever need to profile something,
just Google and maybe there's a tool out there that actually does it for you.
OK.
So I'm going to switch over to the hands-on profiling part now.
This is the first time we're doing this.
So let's see how this works.
Oh, actually, just a very brief setup before I get into it.
So the scenario we're running is a very, very, very basic,
simplified semi-join
on strings with a count afterwards. So we're just saying, please count how many elements in my right
relation have found a join partner in the left relation. So I mean, basically, it's just two
for loops here. We add stuff to a hash set. If we found a partner, we increment a counter, and at the end, we return that counter.
Very small, very basic thing, because I can show this
on my computer here so you can actually see the full code.
Again, for doing this like these toy examples,
there's always a trade-off between showing purely toy
examples or actually showing something
where there's a bit of progression.
But usually, you'll do this in code bases that are a lot larger or you're looking at very, very,
very specific things that are not very easy to do, both of which are hard to showcase in a short
example here. Yeah, this is just general information, like the question we had up here
before. If you want to profile something, always use minus G flags for C++.
This will add debug symbols to your build, because otherwise the compiler will have no idea,
the profiler will have no idea which function you're actually in or what the name is or where you are in the code,
because you're explicitly saying, I don't care about this.
And one small random detail, you need to have this well you should
generally use this no emit frame pointer command here to actually keep the frame
pointer in the frame pointer register one thing is that Intel didn't use to
have the do it back in the day Intel didn't have that many general purpose
registers so someone decided to say ah let's use the frame pointer register also as a general purpose register.
Then you lose this information.
So if you don't do this, it might happen.
And then if you want to do call graphs and seeing where I'm jumping through the code, this will probably break if I don't do that.
And for perf specifically, you need to have this small command here at the bottom, which actually allows you to profile kernel information.
Otherwise, Linux will just say you have no rights to do this.
OK, so let's see if I can kind of switch to this here now.
And I want to move.
And now I actually want to stop.
I want to mirror something so I can actually see what I'm doing.
OK.
I've prepared something.
My printout, so I actually know what I'm doing here.
Because otherwise, it's easy to kind of lose track of where we're at.
OK.
So the very first thing we want to do, I just like I
said, this is going to be full run-through. So first of all we
probably want to actually this is probably... is this okay? Okay good. So very basic setup.
Build type.
So generally, what we want to do is, first of all, we want to build something to run.
That shouldn't be a surprise for most of you.
So here, again, we're doing, in CMake,
there's this predefined rel with that info,
which says this is release.
Build it with debug information, though.
If you want to follow along, feel free to follow along.
For the sake of time, it might be tricky to keep up,
because some commands will take a while.
I'll switch back will take a while.
I'll switch back and forth a bit.
This is all recorded.
The slides contain all the commands, all the information.
If you want to do this later, feel free to do this.
So, yeah, first of all, I just want to show you the code we're running.
So at the bottom here, we just have some random setup code.
We're doing some initialization.
We're doing the Google Benchmark code here,
which is not super interesting.
We're generating some data.
Again, feel free to dig into this.
But what we're actually looking at
is we're looking at this count tuples with join partners
method here.
So if you have some experience writing C++,
you'll probably also immediately detect some issues with this code, which we'll kind of go through step by step.
But let's say this is the code we want to profile now.
Now we are a database engine that compiles its queries, and we have this super fancy generated code here, and we see it's kind of slow.
So this is kind of what we want to do for the first thing.
So first of all, we kind of want to build everything.
While this is building, let me just quickly go into here.
I think your tabs have different font sizes, right?
So like the first tab has bigger font size than the others?
Oh, okay.
Okay.
This kind of...
Okay.
Okay.
As long as people can read stuff.
Okay, so what we do, we have this small benchmark here that we can run.
Let me just run this.
This will give us some output.
Hopefully, this works.
Okay, it's a bit slow actually right now.
Maybe someone else is working on the server.
So first of all, let's just rerun this.
But it's kind of hard to...
Yeah.
It's weird if you usually type with a different keyboard and then you use this one.
Oh, it's repetition, isn't it?
Okay.
So, generally, we want to do things a few more times to actually see if the numbers
we're seeing are good or not because if they're not, then it's kind of hard to see if we're
making progress if they're very random.
And for larger code snippets, they usually get a lot more noisy.
So this will hopefully show us that the performance is somewhat the same.
We can see it's somewhere between 590 and 600ish nanoseconds per tuple.
So if we go back to the code now, this kind of shows, we learned in the lecture a while back,
that there's something like branch mispredictions on ifs,
and so we can use predication to solve this.
So, this would be like, we learned this in the lecture.
So, let's actually fix this and say that we want to count.
So let's say we want to do this. Sorry for the weird indentation here.
I'm not very good with Vim.
And it's a bit slow.
OK.
OK, and let's say we have this.
Actually, can I set number?
Can I unset it like that?
No, OK.
Huh?
It's like no number.
Nice.
So let me just copy.
Oh, it doesn't have anything.
Okay, never mind.
I wanted to show you in Godbolt.
We have vector, string.
These can all go.
We don't care about that yes then we want to do yeah I'm on that so we've come running stuff with GCC 11 and there's a 3 so let's see
if this hopefully does what it won't but we want it to do. So yeah, let's quickly go back to the other code.
And we can see here, if we go to the Revealed Think Code,
we can see there's a jump here.
So if this value is not equal to whatever,
I don't know whether to check, the comparison's right here.
If it's 0, then we jump, which is somewhere up there.
Oh, well, it doesn't really matter right now,
especially if we have a branch, and then we add something,
and then another branch somewhere along the way.
We want to get rid of that branch because it's slow.
So we do this.
We compile it.
And then we see that exactly the same code appears.
So again, we've just
randomly guessed that something doesn't do any difference. I can compile this and
I can guarantee you the performance will be exactly the same because it's the same
assembly code. So this is one of these things. Well we learned
something. This is generally what people usually talk about when they
mean premature optimization is the root of all evil.
So I'd argue that this code is still OK to read.
But people making random guesses about performance,
optimizing something and saying it's going to be faster, well,
it's usually not.
Interestingly enough, if I switch this to 16,
just for the sake of comparison.
This has a predicated version in it, one second.
And actually, if I switch to the other version, it also uses predication.
So Clang apparently says I always do predication, while GCC says I don't.
Very hard to generalize.
Let's see.
At least it did yesterday.
Let's see if I least it did yesterday.
Let's see if I can.
Yeah, same code.
The question up there, yeah?
Is there any difference in the compilation time when you try to compile what we're doing?
No, no.
This is tiny compared to whatever's going on.
This is completely irrelevant.
This is a very, very, very low level performance decision that the compiler makes.
So just generally to show you, this is actually
completely irrelevant.
So I'm actually, for the sake of time,
took a bit longer than I expected to.
Let's just get rid of all of these.
Just make, again, just to make sure we have everything.
Okay, so what we actually want to do is now, if we want to, oops.
We actually want to record this, which I said, perf record instruction.
So this will run now for a bit, shows the output. So you're running the normal command but in the background you can see it's woken up five times to
write data. I'll show you in a second how much data has been written. The first
time for some reason it always takes a while.
Okay and then we can use the perf report command.
And then we can look at this and we can see, well, we're spending some time here in some
random instruction.
We're spending some other time in another random instruction.
Another time here in node pointer static cast something.
What was this?
This is some range hash.
This is some hash table internal move instruction.
So it's very hard for us to actually understand
what's going on here.
Personally, for me, it would be very, very tricky to do this.
Yes?
You were using tab there just to go through all the instructions.
Oh, yes.
Oh, yes, okay.
Yes, so I'm using tab, the tab,
so I can switch between the hottest instructions.
So you can see I'm here at 21.
If I go backwards, then I'm at the lowest one.
And this is just cycle down through the, yes, thanks for the pointer.
So this is not really helpful.
So what we can do is we can do perf record minus g, which says we're interested in the call graph.
So we do the same thing.
Now I want to see where the function call is coming from.
What am I spending time on?
So this will run.
I can perf report again.
And now we can see there's these small pluses on the left hand
side there somewhere.
And we're interested in our join something here.
And we can see there's our join and some libc start call main,
which probably is not in our control.
And then we have our count tuples with join partners,
and there's std hashbyte.
So I'm using the plus character symbol to go into these.
We can see there's some c3 glibc call hashbyte something.
So for me, this is very memmoved.
It's not really clear what's going on.
So there's another way that we can go into this,
which is, execute this.
Okay. So, what we can do is we can say,
currently what it just did,
it used the frame pointer to detect.
So, if I just do minus g, it detects the frame pointer,
where am I, where have I come from, and it didn't have enough information for us. So, so if I just do minus g, it detects the frame pointer. Where am I? Where have I come from?
And it didn't have enough information for us.
So another thing we can do is we can use the dwarf format.
So what it does, at every snapshot,
it takes a snapshot of the entire stack.
And then now when I run this to report it,
now it will go through all of these snapshots
in the background and try to figure out
what was the information that I've stored previously.
And it uses the dwarf debug information that we've said because we have minus three, minus g.
So we have the debug information and we're using this to actually get some insight.
So this will take a while at the beginning.
And this will now hopefully be helpful.
So this is all done at runtime now.
It takes a bit longer.
At report time, the other one actually did it at recording time.
So when this is done in a second, we can see here, this is now a bit cleaner.
We can see we're interested in our benchmark join function here.
Okay, so this is the count tuples with joint partners
this is the method we're interested in because it's a hot loop but interestingly
enough we see that about like was this nine percent eight nine percent of the
time we're spending in some vector logic so this is kind of unexpected let's we
just jump to to the code can someone maybe tell me why you think this is kind of unexpected. Let me just jump to the code.
Can someone maybe tell me why you think this is the case?
Why are we spending so much time, or like 8% of our benchmarking
loop in some vector code, even though we're
working with an unordered set?
Any guesses? Why would we be spending time in vector-specific methods in this?
Cool, yeah.
Yes, very good.
Because we are copying the vector.
Like I said, any static profiling tool will probably detect this for you.
We're only ever reading left side and right side here.
But we're copying them on each invocation of the method.
So this kind of the guide here kind of told us,
hey, we're spending some time.
And unfortunately, I can try to zoom out.
But you can see there's a vector construction and destruction
right at the right.
This is always a bit shitty to view in perf, unfortunately.
But yeah, so basically the very, very simple fix here
is just to say we're copying these.
We're not copying these.
We're taking these by reference.
And this will actually, so we see here we're at somewhere like 580 nanoseconds before.
And now we're at 530.
So this gave us about a, let me just
see if this aligns with the numbers I had.
Yeah, this gives us about 40 to 50 nanoseconds improvement. So now again, next step, we're
still not convinced that this is good. So we just run the same record command again.
We're going to stick with this dwarf 1024. Information on 1024, by default, dwarf tracks 8 kilobytes, so 8,192 bytes of the stack.
We're telling it to only do 1024, because in this example, we don't need more.
And B, if I have more data than the report on this side here, takes, oops.
Yeah, didn't clear clear didn't work oh I had Q clear that's why thought
so if I have a kilobyte then this will have I don't know a few millions more
and this will take like two minutes to run so we'll try to avoid that if possible.
So yeah, the dwarf one takes a bit longer, but it gives us more information.
And so generally, you need to be a bit patient.
But now we are there.
We're still interested in our benchmark join information.
So just also FYI, there's some stuff like here, 0xFFFFF.
This, I think, is the way to represent anonymous functions
it has no information about, so lambda.
We can also see there's some libc3 stuff in here somewhere.
So we can see libcso6.
So there's a lot of stuff in here.
But currently for now, again, I really
encourage you to play around with this if you haven't yet,
just to get a feeling for what information is there.
What's the difference between opening up
the children, so pressing the plus button,
and all the different lines right now?
So if you go into the join, you get all the functions
that are in the join function, right?
Yeah, yeah, yeah.
So what are the different lines traveling?
So this will, so basically, I'm not 100% sure
if this aligns fully, but basically,
this line that's yellow right now should also be
somewhere down here, somewhere below.
And if I'm not mistaken, it's probably this one here.
By the looks of it, no.
So this one here, where's the other set?
So yeah, this yellow one is probably this one here.
We can see it's 44%, 44%.
This probably aligns somewhat.
You need to pay attention here that it depends on if you have
Relative or absolute percentages.
So i think this is, they're all absolute percentages here.
Yeah.
Yeah, they're ordered by the time spent in them.
Exactly. So now i'm just stepping in to say, okay, within this join tuples with join partners,
this one we're interested in, these are the methods that are being called.
So you can also invert this.
I'm not going to show this for the sake of time, but you can also just run.
There's like no children, I think, was it?
So that you can just use a different instruction when doing perf report, and it will invert this.
So then, for example, if I'm interested in exactly this yellow method here, it'll show
me all the places it was actually called from to see, for example, if that's correct.
And the next thing we're kind of seeing in here again, so we have count tuples, we have
our unordered set, which is expected.
So we know from our database two lecture that hash joins will take some time.
They're fast, but they're not incredibly fast compared to other things.
So we're seeing an unordered set here.
We're doing some stuff.
But we can also see that we're spending some time in a basic string constructor and a basic string destructor,
which, again, is about oh and
not eight nine ten percent and so that's again something we're a bit confused
about because we shouldn't be thinking about creating strings should we so you
can someone maybe tell me some where we are creating strings that is not within the hash set itself.
Distinction here, important distinction.
We'll get to that later.
Maybe focusing on the line that is currently
highlighted in white-ish.
Yes?
.
MARTIN SPLITT- Perfect, exactly.
So we have just moved to C++ from the Java land.
We've never taken care of these things.
And suddenly we see, ah, this again makes a difference.
Before, for every element on the left side, which is the vector,
I'm currently creating a string copy of value.
Now I'm saying, please just use the one that's in there.
Same here.
So this is, again, a very, very, very common C++ beginner
mistake.
Also, not so beginner mistake if you're not thinking about it.
This happens all the time.
Again, this is why you probably want some tool to tell you,
are you sure you want to take a copy here?
Because you're only using it as a reference or a const ref,
actually.
And that's not super important right now.
OK, so we've written this.
Let's make our code again.
And then I can join benchmark.
And hopefully we can see another speed up.
We are now down from 540 nanoseconds
to 460 nanoseconds per tuple.
Yay.
So again, we want to go to the same thing.
I'm still not convinced we're at the end of what's going on here.
So we can again do the call graph and get some information on this.
Perf report.
Again, every time this will be a bit quicker, luckily,
because we're collecting less data because everything's quicker.
But also in VTune, for example, this sometimes, if you do a big benchmark on VTune, I've waited for an hour for this to happen.
It's also interesting that the local copy of the string was not removed by the compiler, because you would guess that from a very small for loop there the compiler could actually remove the copy yeah so yeah the compiler didn't optimize anything out
maybe the compiler says i can't do that i'm not sure if it's legal c plus plus to do that
maybe it could again it didn't we saw that in profiling we're spending time in string
construction that we don't want to um yeah again here we can see this a lot of time is spent in some anonymous function within
Google Benchmark.
So that's not super interesting because we know we're in the count tuples with joint
partners.
And now we can see we're spending all of our time in the unordered set.
So that's kind of good.
We're kind of at a spot where we're saying, well, OK, this is how fast we can be.
We're not wasting a lot of time around with other things.
So in our code, now the performance bottleneck
seems to be we're spending all our time
in unallocet methods, which is good in general,
because that's exactly what we want to have,
because we know that the hash join is probably
going to be the most expensive part here.
So now our colleague comes along and says,
hey, I found this super, super cool new hash set
that we can use, which is way quicker.
Unfortunately, we've already included this in our example,
which is unordered.
For some reason, I can never spell unordered correctly.
So there's a unordered dense set.
So we can actually replace this.
And so what we want to do here is, what was it?
Dense set.
So we're just changing that.
We're just swapping the one implementation here.
Out of curiosity, how much do you think this will speed things up?
Random guesses, please.
Twice.
Twice? Okay. Do we have high, do we have lower?
Sorry?
How much time did we spend on the operation? We have lower. Sorry? Before we had, what is this, 30, so yeah,
basically all the time in our count tuples,
50% of the entire partners and annotate the instructions, I would guess that you would see that even all the stuff in there is also hashing and hash table based stuff.
Basically, I would expect that you'd spend 95% plus in hash table logic.
Okay, so we have about 2x.
That's a good guess.
Let's see what happens if I make this and run it.
We actually get a 3x improvement.
So this is nice, because now 2x was not that bad.
And also, Thomas commented here about we're
seeing how much time we're spending here.
This is one of the fun parts about profiling,
is that these numbers are always kind of hard to interpret.
So because we're saying we're spending 94% of the time
in the join benchmark function, and then we're
having 50% of this in the method we're at,
but we're also spending 50% roughly in whatever this is.
So there's, again, some other things.
And we're not really sure.
Again, it's a bit tricky.
It's a bit convoluted.
So actually, yeah, this is one of the takeaways
I think that's important from this lecture
is that it's actually quite tricky to fully understand
what's going on.
It might take a few iterations.
You might need to combine this with different tools.
You might need to use a flame graph to visualize something,
just to kind of go step by step, understand what's going on.
And again, this is a very, very basic example.
If you're profiling something, so the VTune thing I just
showed you was from Velox.
We did something for a research project, and this is
Facebook's new query engine.
Profiling that's very, very hard. It's a big mess.
First of all, you don't understand the code.
You don't know where time is being spent.
It's doing some random things you don't understand.
And then kind of digging through this, stepping through this.
So this is often paired with debugging, trying to see, okay, I set a breakpoint in some method.
Where did I stop? Where did I come from?
One specific problem I had is actually I was taking two completely different paths, but
I wasn't sure, I couldn't see that in profiling.
I only figured that out while debugging.
So this is always kind of paired together in these kind of things.
So for example, if you had the string constructor method you're taking a lot of time with, I
kind of told you where this was coming from.
And another approach you might have to actually set a breakpoint on construct string, run your code, see where am I coming from,
and then I can see, ah, okay, I'm spending time in the construction of my method here,
but while calling my method, which is kind of unexpected. So let's, okay, we saw we're
actually a lot quicker, but I'm convinced we can still be quicker.
So let's just profile this again.
Let's report.
And we can see we've gone down from 25 megabytes
to 9 megabytes.
Or was it?
I think it's.
No, it's megabytes.
Before, I think it was.
Yeah, it's megabytes.
I don't think that's true.
I think it's event.
It says here 9 megabytes. It says 25 megabytes it says 25 megabytes and that was exactly
what we have one megabyte okay so we're in here again oh we're page falls so
also now that we're everything is getting a bit quicker the output here
starts to get a bit more noisy because some other things suddenly are important and we can see
we're spending all the time again in the hash set which is good but now I'm
convinced that we can actually dig into the hash map and look into this a bit
deeper and surprisingly we can again see some time being spent on some basic string constructor.
And I think there's a destructor somewhere here.
Probably somewhere down here, maybe, or not.
It says, again, we're spending time in constructing strings,
but we don't really want to construct strings, do we?
So I would usually, maybe I would debug and see where we're getting this from.
Can someone tell me why we are seeing a lot of string constructions
in this example?
While I. So we've constantly been talking about we've been copying strings.
Where are we still creating strings in this code?
Within the hash set? Yes?
.
Yes, very good.
So we're replacing a value here.
We've taken something by reference,
but we're actually storing an std string in this hash set here.
So what we want to do here, which is use a nice C++ feature,
which is called string view, which is basically just a pointer and a length.
So I have my string somewhere in memory, and now instead of copying the entire string
with all that belongs to it, with short strings and 24 megabytes of overhead and whatnot
and memory allocation on the heap and whatnot, I can just say,
here's a pointer, and please read these many bytes, and memory allocation on the heap and whatnot, I can just say, here's a pointer,
and please read these many bytes.
And that's my string view.
And that's what we're doing here.
So again, we'd only need to change this one very nice line.
Then we can make everything again.
How much speedup?
This is less obvious.
So we're seeing we're somewhere in here in some vector,
because that's probably what it does underneath the store
something.
There's a basic string here.
This is all string construction.
How much speedup?
This is more of a question.
Do you know why it's not showing any number anymore?
Like, on the line?
This here?
Probably because it doesn't have any.
I honestly don't know.
So here we can see there's, for the create,
there's somehow numbers again here.
You can see there's a lot of time being spent here in allocations.
Just random guess.
There's no wrong answer.
Ricardo.
3x.
OK.
Why do you think 3x?
OK.
That's a good answer.
OK.
Let me just run this join benchmark.
And we have 4x.
So it's somewhere between 3 and 4, depending on how much load is on the server.
But yeah, so actually, again, creating these copies was very expensive.
And if I kind of dig into this, again, this is kind of the example is a bit oversimplified. We're kind of digging through this, but I can see there's a lot of methods here around basic string construction.
This one here is going to be a killer.
We kind of know that memory allocations are expensive.
We don't want to do memory allocations.
So getting rid of these will have a huge impact on performance, which is what we did.
Just about your general workflow.
So I guess every optimization step, you should really go back and run your unit test suites
or whatever again, because everything you did,
like, let's exchange this data structure.
Now you have different guarantees.
Yes, yes.
Let's now not copy.
So what about ownership?
Yeah, so this is a very, very, very good point.
Obviously, you need to test things.
We're actually well developing this.
I think Richard had some weird things where he started
creating or making references or not copying things,
and stuff broke.
So obviously, you need to have very good test suite for this.
Probably, you're not going to exchange your hash set
in the running system.
That's a big decision.
So a lot of these things try to be drop-in replacements
for std unordered map or unordered set. But I think the thing that kills performance A lot of these things try to be drop-in replacements for STD
unordered map or unordered set.
But I think the thing that kills performance most in STD
unordered map and set is that they guarantee pointer
stability if you do some weird things.
And basically, that limits your design so fundamentally
that you can't really apply that to other things.
Because then you need to have pointers to everything
and have indirections.
Because otherwise, you couldn't keep the pointer stability, which is why the sets are as slow as
they are, because the C++ standards requires this to happen. So if you care about performance,
you probably don't want to use them. So yes, we are at the end of the example here, actually,
which is good. Good timing.
I'll show you one or two more things.
For those who want to try this, we're still not at the end here.
There are some things that can be done.
You can still optimize this, which is a very nice example of... So these were kind of like, let's say, physical changes.
We kind of did some properties.
The other fix is actually a logical fix,
which kind of also hints that in database systems,
if you do logical optimizations, this
will make huge difference compared
to physical optimizations.
So for this example, kind of consider how,
if you're looking at this, how are we building the hash set,
and which events go building the hash set and which events
go into the hash set.
So yes, with that, I will only show you one more thing in the slides, because I kind of
showed you the, actually, did I?
Now I can see it's only 3 megabytes, which is coincidentally also exactly what it says here, Richard.
Yeah, so you're collecting exactly 1,000 megabytes.
OK, just some fun.
OK, let's look at this.
Actually, which one did I want to annotate?
OK, let's just annotate the entire thing here.
So here, if I'm in this overview,
so this is usually like perf report is usually
the overview we're going to be in.
Well, let's say we're interested in this line.
If I press Enter here on the count tuples,
it says annotate count tuples with join partners and some other things.
But let's just press Enter on this again.
And then this will give us the instructions that we have, the x86 instructions with the
code lines above it.
So in this case, we're seeing here this mov instruction.
And this is the C++ instruction that gave this to us.
And now we can actually see a single MOV instruction here
is causing 40% of the entire execution
is spent on this one instruction, which is quite interesting.
Can someone maybe tell me why this is the case,
looking at this line of code?
How can one move instruction be so incredibly expensive?
So generally, in a hash table, why is a hash table slow?
If it's slow.
Yes?
Okay, so that's actually a good guess.
But looking at this instruction here, or this line of code,
does this look like we're handling a collision right now?
Actually, it could be.
So again, let me rephrase.
Would collision handling be done in one move instruction?
Okay.
Still, it's a very good guess, and it's definitely not wrong.
But why are we spending so much time in one move instruction here?
Okay, I'm just going to tell you.
So basically what we're doing is we're loading data.
This is a random memory lookup.
So what we're doing is we're loading data. This is a random memory lookup. So what we're doing is we're issuing a move instruction.
We're reading data from memory into our register.
And if I have to wait for it to be fetched from memory,
this is going to take forever.
So what we're seeing here, this instruction will just wait.
Maybe I'll do one or two instructions below.
But actually it can't, because it
needs to compare
what we're reading into memory.
So we're reading from whatever is in RAX into EDX,
and then we're comparing that.
So we can't really move on.
I need to wait for this to happen.
So we're spending a lot of time here on memory loads,
which is actually also kind of a good thing
if we're thinking about the design we're using.
50% of the time we're just spending on waiting.
We're not going to be able to optimize this
unless we have a different structure in the hash table.
But if it's sufficiently large, everything
is just going to be memory bound at some point.
OK.
And I think with actually, I can also show you something else.
I showed you how to do it this way.
Perf annotate. How do I spell this?
Annotate.
Why was it wrong?
Oh, okay.
Yes, and it just goes directly to where we were.
And we can see move.
There's the same modern structure again.
So it's just a different way to get there.
Okay.
With this.
Yes, there's a question, yes.
So if I want to optimize and reduce my CPU cycles, is there a way in Google Benchmark
to say it's pre-warm and fill everything in buffers, I don't want any IO stores?
So Google Benchmark, first of all, you should generally do more iterations,
or multiple iterations.
You can do multiple repetitions.
And you can play around with setup time.
And I think there is actually explicit warm up time, too,
where you can tell it.
So basically, if I go here, let me see.
So in Google Benchmark, I didn't really show you this.
I glanced over this.
So this is basically the main benchmark loop.
I'm saying for auto underscore because we don't care
about this in state.
This is how I tell Google Benchmark,
please run my benchmark.
And this will run then for x iterations
for a given amount of time or a certain amount of iterations.
And this can be done a few times without it measuring anything
when it warms up.
So then you're kind of executing it in a real setup.
Yeah.
So again, I definitely encourage you to check this.
If you're curious on a bit more complex example of this,
feel free to also check out on our group's GitHub page
the C++ repositories that were mainly done by me.
You can find a lot of Google Benchmark setups in there because yeah I use Google Benchmark a lot and
it's very helpful to figure out stuff okay just one one tiny bit of theory at
the end so in the example I just showed you at the end it was quite nice because
you can see we're spending 50% of the time on the MOV instruction.
Sometimes, what you will actually see is that we are spending 50% of the time on this compare
instruction.
So, this is not in this example, but something you may see while profiling.
And this is actually quite common because of how this information
is gathered we kind of know that it's the more instruction we know the compare
instruction is not that slow it's a very fast instruction and we know that the
more instructions the one that's slow but that's something which is kind of
hard to guess if you don't know and the reason for this is something called skid
so this is a problem that there's
a delay, like a physical delay, between the time
there's an interrupt and the time the information is
actually stored.
So there's an interrupt.
And by that time, my frame pointer
might be on the next instruction already.
My instruction pointer might be on the next instruction
already.
So in this case, my MOV instruction is executing.
But my instruction pointer now has moved on to the compare.
And then it says, oh, I'm spending time on this compare,
while it's actually still, the interrupt was triggered
while in the MOV instruction.
So this actually happens that we've
recalled the wrong instruction.
And then this is, for example, if you're
looking at it at this level, first of all,
it's very helpful just to get a general understanding and knowledge of how these things work.
But second of all, maybe something like the UOPS table page can help.
Actually, I might just be able to open this just for the sake of showing it.
So this is the website here. I don't know where my mouse is.
So we have this nice table here.
Let's just compare.
I mean, there's a lot of things here, and we can kind of see
the throughput and the latency of these instructions.
And then there's these instructions.
And then there's MOV instructions. It's not showing the latency for memory moves, is it?
Damn it.
I was hoping to show you something else.
Never mind.
So this might be helpful if you're looking at instructions
and how long they take and to see if this actually aligns.
Again, it's always kind of a figuring out and playing
together with this.
And if you're interested in very specific instructions,
I said at the beginning that perf uses the clock cycles
to trigger overflows and then get the information.
If you're interested in very specific events triggering
things, so if I'm very interested in which
instructions generate cache misses or branch miss
predictions or something like that.
I can use this colon P in perf behind the event.
So I showed you on one side I can do the dash E,
give me the L1D cache stores.
And I can put a colon P behind that, or PP or PPP,
which gives us higher precision.
And then that basically says, OK, I'm going to explicitly
track not the clock cycles that did something,
but the exact instruction that gave me the cache miss.
And this will give you more precise output
for these things.
By default, this happens for clock cycles.
And there's always a bit of timing issues in there.
But for this, again, at some point,
you need to start googling or checking the documentation.
So there's this 5,000 page Intel manual that has all the information you will ever need
about developing stuff for Intel.
And they have a long list of things there.
And that's why I also didn't have anything on the slides here, because it very, very
much depends on the system you're running.
Okay.
So just to wrap up, perfectly in time, I guess.
Today we looked at profiling.
We kind of looked at instrumentation,
how to kind of do this manually, which tools do that
with the sampling, what are hardware counters.
These are the three things that are important to remember.
Hardware counters, if you're looking at this, are your friend.
And in the hands-on part, we kind of looked at using perf perf some of the how to play around with this what can we see what we maybe
cannot see and how to how to navigate this a bit and also just like one or two
things or information one or two small hints on the setup for profiling in
C++ that are important to remember just if you haven't heard of it there's
something in CMake which a lot of C++ programs are built with today,
has rel with depinfo, which exactly says,
please build and release, but give me debug symbols
so that I can do this, which is very helpful.
Because otherwise, there's no way for you
to know what's going on.
If I had more time, I would show you
what the output is without debug symbols.
It's very messy.
OK, so next session will be on Tuesday,
and we will discuss the solution of task number one.
And we will introduce task number two.
This session will not be recorded,
because we're discussing the solution.
And if I have a recording of the solution online,
then next year we cannot use this task again.
Next Wednesday, there will be a lecture
on multi-core parallelism, not given by me.
And with that, are there any questions, comments
on profiling, anything we covered today,
or any other lecture, I guess, also, in regard to profiling?
OK, no questions.
Perfect.
I wrote in Moodle, there's a setup.
You can all use this.
You can all use the perf.
All of the commands I ran here, this is all also in the Docker.
So this is not a different setup.
This is exactly what you can do.
You can SSH into it. You can run these things
For this to happen I need to start docker with privileged modes. So I'm not hundred percent sure how much you can break
Please if you play around with this a bit be a bit careful in general
It's not it's not too bad
But this is also the server everyone's developing on for the programming task
So try not to break anything or make something super broken.
You don't have admin rights in the Docker container, so you should be fine.
But I'm just saying, if you see something like this is a very destructive instruction or command,
maybe don't run it or check with me beforehand.
I'll leave them online for a bit so you can play around.
I got a request yesterday about perf in the containers for development.
I'll see if I can set this up for the next exercise, that you can also just have perf
while developing in them, which might be helpful.
We didn't have that in the setup yet.
I figured out now how to do this.
It's a bit hacky because Docker usually doesn't want you to do perf inside of it, because it's like kernel mismatches
and these kind of things.
But if you either need it for the task now
or want it for the next one, then for the next one,
I'll do it.
If you need it for the one now, let me know.
Write me an email, and we can do a small restart loop
with your container to actually have perf in it,
if you're curious.
With that, thank you very much, and bye.