CppCast - Performance Analysis and Optimization
Episode Date: December 6, 2018Rob and Jason are joined by Denis Bakhvalov from Intel to discuss C++ performance analysis and optimization techniques Denis is C++ developer with almost 10 years of experience. Denis started ...his journey as a developer of desktop applications, then moved to embedded and now he works at Intel, doing C++ compiler development. He enjoys writing the fastest-possible code and staring at the assembly. Denis is a father of 2, he likes to play soccer and chess. News Meeting C++ / Embedded conan trip report Introducing the C++ Lambda Runtime SIMD Visualiser Announcing Live Share for C++: Real-Time Sharing Denis Bakhvalov @dendibakh Links emBO++ 2018 Denis Bakhvalov about Dealing with performance analysis in C and C++ Code alignment issues Basics of profiling with perf Performance analysis vocabulary Sponsors Backtrace JetBrains Hosts @robwirving @lefticus
Transcript
Discussion (0)
Episode 178 of CppCast with guest Dennis Bakvalov, recorded December 5th, 2018.
This episode of CppCast is sponsored by Backtrace, the turnkey debugging platform that helps you spend less time debugging and more time building.
Get to the root cause quickly with detailed information at your fingertips.
Start your free trial at backtrace.io slash cppcast.
And by JetBrains, maker of intelligent development tools to simplify your challenging tasks and automate the routine ones.
JetBrains is offering a 25% discount for an individual license on the C++ tool of your choice.
C-Lion, ReSharper, C++, or AppCode.
Use the coupon code JETBRAINS for CppCast during checkout at JetBrains.com.
In this episode, we discuss a new feature in Visual Studio 2019.
Then we talk to Dennis Bakvalov from Intel.
Dennis talks to us about performance analysis and optimization. Welcome to episode 178 of CBPCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Bervin, joined by my co-host, Jason Turner.
Jason, how are you doing today? I'm doing all right. Rob, how are you doing?
I'm doing okay. No real news on my side. How about you?
No, I guess. Well, next year is starting to get busy, by the way, for training activities for me.
I've gotten a lot of contacts at the end of the year, but that's about it.
It's always exciting. Yeah, we're approaching the end of the year but um that's about it it's always exciting yeah we're we're approaching the end of the year very very quickly november real fast the next thing i have coming up
is the first week of february for uh c++ on c where i'll be teaching that one day constexpr
class and so i have all of january to uh make sure everything's fully prepared and ready to go
that's good. Very good.
Okay, well at the top of your episode I'd like to read a piece
of feedback. This week
we got a tweet from
Barney Deller and
he was replying to our episode
with Lenny last
week saying, it's nice to hear that I'm not the only
one mob programming in C++.
Thanks CppCast and Lenny.
And yeah, mob programming sounded really interesting.
I've heard of pair programming, but never mob programming.
Yeah.
Well, I've only heard of it from Lenny.
Yeah.
Well, we'd love to hear your thoughts about the show.
You can always reach out to us on Facebook, Twitter, or email us at cppcast.com.
And don't forget to leave us a review on iTunes.
Joining us today is Dennis
Bakvalov. Dennis is a C++
developer with almost 10 years of experience.
He started his journey as a developer of
desktop applications, then moved to embedded,
and now he works at Intel doing
C++ compiler development. He enjoys
writing the fastest possible code and
staring at the assembly. Dennis is
a father of two.
He likes to play soccer and chess.
Denis, welcome to the show.
Hi, Rob.
Hi, Jason.
Nice to meet you.
I'm excited to be here.
You know, I should say probably that I'm your regular listener from the episode number one.
Oh, wow.
Yeah, so I'm really happy to be here as a guest, not just a listener.
Yeah.
So you were actually listening from the beginning,
or did you go and catch up at some point?
No, I was actually started exactly at probably, okay,
maybe I'm lying a little bit.
It was not the episode number one.
It was probably the episode number three.
It was the episode with Manu Sanchez.
He was talking about the PyCode.
So it was kind of really in the beginning.
I'm kind of with you guys.
Yeah, I think that was episode three, right?
That's awesome. Thanks for being with us for so long.
Yeah.
Okay, well, Dennis, we got a couple news articles
to discuss. Feel free to comment
on any of these, and then we'll start talking about
more of the work you're doing at Intel
with performance analysis, okay? Sure.
Okay, so this first one is a Meeting C++ and Meeting Embedded
trip report from the Conan team. And I think this is the
first year with Meeting Embedded, right? Yes.
Yeah, so they went to both conferences and they said the Meeting Embedded
conference was,
was quite good. Uh, they presented a talk there and, uh, yeah, I'm, I'm happy to see that they're,
uh, going forward with this, uh, new conference. I hope it does well for them.
Yeah. I still want to listen to that talk. Uh, we stopped teaching C that's still on my
list to do, but I haven't, have you looked to see how well the videos are going up?
I haven't.
I meant to because I was interested in watching Andre's keynote from meeting C++, which is also covered in this trip report.
Right.
Yeah.
So the what is the next big paradigm was Andre's keynote. And I'm a little curious as to what it was really about, because in this trip report, he says he first explored how the next big things for people's programming in the general world were perceived at the very beginning.
Threads, online voting, NLP, privacy and ranges.
I'm not sure what they mean by online voting.
Like, do they mean political voting?
I don't have any idea.
Yeah, so I was curious about that.
I need to go and watch his talk.
Yeah.
Dan, did you go to Meaning C++ or Meaning Embedded?
Unfortunately, not yet.
Yeah, but actually, I also saw that there were a number of talks
about the performance and about the data-driven development.
So this kind of, I think, becomes the trend.
Well, I'm not sure, but that's my just gut feeling.
I might be wrong, but probably everyone sees how it wants it to be.
Right.
Well, I just checked their youtube page and i don't see any uh conference videos from meeting people's
2018 yet i agree i i did not find any we're gonna say something else jason no that's what i was
gonna say okay okay was were there any other uh highlights either of you wanted to make about the Meaning C++ trip report?
No, I don't think anything else jumped out at me on that one.
Although it does look like there's a lot of interesting things that went down here.
Yeah, I'm looking forward to some of these talks, maybe online.
Yeah.
Okay, the next one is introducing the C++ Lambda runtime.
And this is from the AWS blog.
I have never heard of AWS Lambda before.
Have you, Jason?
I have.
I have friends who use AWS for all the things all the time because of work.
Right.
So I guess AWS Lambda just allows you to run like ruby or c sharp or other languages
and just the code is kind of run without having to worry too much about the server configuration
is my understanding of it yeah it tends to be like this like simple like snippet like you have
this thing that you want to do and just do it and then give me the uh the artifact results back or whatever
from it right so now you can do that with c++ uh they have an sdk and they have a pretty
thorough example here both doing a hello world and then doing a more complex uh ados lambda
application with c++ yeah and the example they give here is basically like,
I think, kicking off a transcoding of something.
I think that's the final example.
C++ encoder.
Yeah.
Yeah.
So, yeah, seems pretty exciting
if you're interested in doing more in the server and web world.
Mm-hmm.
Okay.
And then the next thing we have is uh the simd visualizer project and we were talking
with jeff amstutz two weeks ago about simd so i thought this was a pretty interesting tool um you
can run it just by going to this page in your browser and you'd uh you know it basically has
a bunch of code in in simd and you can kind of see what it does line by line in a nice visualization.
Did either of you play with this on the website?
Yeah, I did.
Go ahead.
Yeah, sure.
So I played a little bit with it, so it looks really cool.
Yeah.
I think it's just another way how we can leverage the Clang and its tools,
because I assume it is made based on Clang.
It's really cool.
I still, myself, still am using the paper and the pen
kind of to visualize how the vector code works here.
So this tool is really handy for beginners, at least.
Right.
I was kind of hoping that the example would not rely on intrinsics,
but have something that showed what the compiler would generate.
But this doesn't,
it's almost like I want like a melding of like the output from compiler
explorer piped into this.
So I can see what the compiler actually did.
It's actually in a class I was teaching last week.
We were looking at some SIMD stuff
that was being generated by the compiler,
and I was reasoning my way through about three-quarters of it
and kind of wishing there was a way to easily visualize it.
Right.
Yeah, it'd be nice if we could kind of take the type of code
Jeff was talking about with his SIMD wrappers
and be able to visualize that.
Right.
Okay, And so...
Go ahead, Denis. Sorry.
So probably, like, with assembly instructions, it's also not obvious, right?
I mean, just even from their names, it's not obvious what they are doing sometimes.
Oh, yeah.
So just to have this kind of tool that will tell you,
okay, now I'm doing addition with two vectors.
Now I'm doing subtraction.
So it's pretty handy.
Yeah.
Yeah.
Okay, and then the last article I have
just was announced the other day
that the Visual C++ Visual Studio 2019,
which there is now a preview out for, is going to have the Live Share feature.
And I think they first announced this feature.
I'm not sure if they mentioned C++ when they first announced it.
But this is going to allow you to, from one Visual Studio instance,
kind of send what you're working on
and let someone collaborate with you
who's, you know, maybe at some other remote location.
And as long as they're using Visual Studio
or if they're using Visual Studio Code,
the two of you can then work together.
You can see what you're debugging together,
see the call stack, see, you know, different variables.
So it seems like it's a pretty powerful
feature and should really help uh you know remote developers yeah then the example they say visual
studio or visual studio code the thing that's not obvious to me would be if somehow it could be a
mixed environment which i would assume not because that sounds like that'd be extremely difficult to get right. But no, actually, no, it does look that way, doesn't it? Yeah, yeah. One person using
Visual Studio could share to someone using Visual Studio Code. That is my reading. That's what it
says. Yes. All right. Well, that's kind of awesome. Yeah, yeah. And this is definitely something I see
myself using because I work with several developers who are in other locations.
And, you know, we talked so much about pair programming and mob programming last week,
but being able to do pair programming with someone who doesn't, you know, sit right next to you would be nice with this.
Well, and it says there is a host and guests, so you could do mob programming with it as well.
Okay.
Okay.
Multiple people watching you and then have like a Skype session or something to talk to them?
I guess.
Yeah.
Very cool.
The guest even gets Intel incense from the host.
Yeah, that could be pretty impressive.
So, Dennis, why don't you start off by telling us a little bit more about the work you do at Intel?
Yeah, sure. So my team and me, we're doing mostly the development of ICC,
which is also known as Intel Compiler, but not limited to that.
We're also contributing to LLVM open source project and GCC.
So we basically make sure that we,
I mean, we, the compiler,
generate the high-quality optimized code
for x86 platform for Intel architecture specifically.
We also enabling new instruction set architectures
for new CPU generations. Like, for example, when the new CPU is going out, we need to have support
for them in the compiler. So the compiler will generate new instructions for new types of hardware.
So that's basically what we're doing.
So when you say you contribute to LLVM and GCC,
that is the kind of things that you contribute as well
as to their optimizer code generator?
Yeah.
Okay, that's cool.
So I guess building on what you just said you said when a new processor
architecture is going to come out you're going to make sure that it's supported well so if it's if
i understand you correctly you're going to be making sure that gcc and lovm are ready when the
cpu goes live yeah correct yeah okay very cool so um what type of work do you do with optimizations exactly?
Yeah.
So basically we kind of try to find new optimizations in the compiler.
We also try to kind of tune existing ones.
So just for example to give you an idea, we have...
So let me maybe first explain the subtle difference between what we are doing.
And let me call it the standard development.
So we have the fixed set of benchmarks, and we are not touching its source code.
So we have them fixed
but
each new day we will have
the new compiler
built from sources. Say it will be
like the new Clang
that was built
from the latest revision
and then we will take this compiler
and we will try to build all these benchmarks.
And then we will run them.
So, and well, it can happen.
And it usually happens that new version of compiler
will generate the different assembly,
different code for the same sources.
Yeah.
So, and that can cause the performance regression or gain. So that's how we are tracking the compiler performance, yeah?
And, of course, if there is a regression, we should go and fix that.
Well, if it's fixable, let's say.
And then we are, of course, also trying to find new optimizations,
like what we can do, for example,
to improve this benchmark.
Well, I should say that
those benchmarks that we're working on,
they are not contrived, in a sense.
They are real-world applications,
but just limited, or let's say,
cut to resemble a real benchmark, yeah?
So, for example, I don't know,
let's say, the Perl interpreter,
or, for example, GCC compiler.
So we use our new version of compiler to build GCC,
and then this GCC will compile some sources, and we will benchmark it.
Oh, wow.
Yeah, so that's kind of what we are doing.
Speaking of optimizations, well, the most typical optimization... I can speak probably about my real work,
because it's kind of proprietary,
but just to give you maybe a taste.
For example, we tune the inlining.
For example, should we inline this particular function or not? Then we also try to, for example, should we inline this particular function or not?
Then we also try to, for example,
to loop unrolling vectorization.
Like, should we vectorize this loop?
Should we unroll this loop?
If we should unroll this loop,
then how much and stuff.
But also, like, for example,
like,, the most
part of the optimizations are
trying to deduct something from the source code.
For example, if you have a virtual function call,
but then you can see
the whole program, and then in this whole program, there is only one instantiation of the base class.
So there is essentially only one client that implements this interface.
So you can be safe in just converting this virtual function call into direct call.
So that's kind of the basic optimizations,
what we're doing,
what we're tuning.
So you said...
Go ahead.
Please go ahead.
You said a new version of the compiler
will often change things.
And if I understood, you said you're doing
nightly builds of compilers.
Do you see on a
daily basis that your performance
characteristics
change from
nightly builds of LLVM? Are you
monitoring it that closely?
Yeah, frequently.
Okay.
Yeah, frequently.
Wow, that's...
I must say that, of course,
there is always some noise there.
So we kind of try to filter this noise.
So, for example, if the benchmark regressed by 0.3%, we probably will not even look into this.
But if the benchmark regressed, I don't know, 50%, then, I don't know, it's
kind of a red flag for us.
Right.
Well, if it also increased
by, like, got better by 50%,
is that also a red flag? Do you assume
something got broken in some weird way?
Well,
it's not always broken, yeah?
Because, for example,
let's imagine the benchmark which consists only of one hot loop.
And, for example, yesterday the compiler was not able to vectorize it.
But today, the compiler suddenly starts vectorizing this loop.
And, wow, we have 2x performance.
So, that's possible.
I mean, it's not probably common for, let's say, it's not happening every day.
And it's not happening for the, let's say, for mature benchmarks.
Okay.
Because, let's say, most of the benchmarks have multiple hotspots, not just a single loop, right?
Right. not just a single loop, right? So it's really rare that we can see on the nightly builds,
we'll see such jumps in performance.
Okay.
How do you go about finding new optimizations?
Yeah, so this is the most complex part of
our work.
Sure.
So, well,
so basically we're doing
performance analysis. So, what is
performance analysis?
Well,
usually we start
with profiling the benchmark.
So, what is profiling? We try to find the with profiling the benchmark. So what is profiling?
We try to find the hot places in the benchmark.
Then you kind of can just go...
Probably, if you look into the profile,
it will show you the hot source lines or maybe assembly instructions.
If you will go into this assembly view,
and then you probably can spot some inefficiencies there.
Or you will come up with some idea how you can make it better.
Well, so this of course also requires you to have some knowledge in reading assembly.
And I know that not a lot of people these days are doing this,
I mean, looking and reading assembly.
But still, this knowledge is really essential for doing good performance analysis.
Yeah, so what you can try to do next is, for example, you can try and just hack the assembly.
I mean, if you can do this, so doing
quick experiments.
Okay.
So take the binary and say, well, what happens
if I replace this instruction?
So you cannot
modify the binary, right?
Okay.
It's not trivial to do this,
but you can
emit the assembly listing from your program
and kind of go from there.
Or actually, so there are maybe also the higher level tools that you can use.
For example, the compiler has something that is called optimization reports.
And it's also actually integrated into the Compiler Explorer.
So there is a separate
kind of window that you can
put on the side
along
with your source code and assembly.
And those
optimization reports will show you
what Compiler did for you
with your loop, for example.
Was the function inline or not?
So, just
even without looking into the assembly,
you can
know what to expect
when you will look into the assembly, right?
So, for example, if you
see that the compiler inlined your function,
well, okay, you probably
you will probably not
find it
in the binary,
right? Because it was inlined.
Right.
And the same goes for loop unrolling.
It shows you the
factor
with
which the loop was unrolled.
It shows you the
vectorization
report. So, was the loop was unrolled. It shows you the vectorization report.
So was the loop vectorized?
If yes, then what was the vectorization factor?
And so on, yeah?
So this kind of a higher level.
For example, if we see that the loop was not vectorized,
well, then we probably will think,
like, okay, is it possible at all to vectorize?
If yes, then probably it's a matter of cost model,
and we can probably tune it.
Yeah.
Okay.
So when you make one of these determinations,
you say, okay, this loop could probably be vectorized,
you prove that you did,
and then you implement some changes in the compiler.
How often do you end up causing a regression
in some other bit of code?
Oh, yeah, that's actually the question that I was expecting.
Yeah.
So, and it happens all the time, really.
I, well, yeah, unfortunately it happens all the time.
I actually have a number of great examples that you will like.
Yeah, so.
Okay.
Yeah.
So, like, usually when you have, like, really the big suite of benchmarks, it's not really possible that you will optimize one benchmark
without regressing the others.
It's just, let's say, a reality.
We should all agree with this.
But actually, the reason for this, well, okay, and you might actually, might be doing really a good thing, yeah?
I mean, you can do a real good optimization without really harming everything else, but still have a regression on some benchmarks. So, like, for example, your optimization removes
a couple of
unnecessary assembly
instructions, yeah?
That's clearly...
That sounds like a really good thing, yeah?
Right.
And there is no way how it can
affect, let's say,
other benchmarks in a negative way, right?
But for some reason you see that, oh, you have a regression on some other benchmarks.
So, and the reason for this is actually quite interesting.
And now let me maybe give you an example.
So, imagine you have a benchmark with only one hot function.
And just a simple function, let's say,
it's take one array, it's just iterating over this array
and increments each element by one, for example.
And life is good, you have some numbers for your benchmark,
you are tracking it,
and then one day,
someone inserted another function
that is completely called.
No one is calling this function,
but it happens to be just before
the function that you are measuring.
Yeah?
Okay.
So it kind of just stays in the binary.
It was not eliminated from the final binary by the compiler.
It's just there.
But what happens is just actually that your hot function was moved down in the binary.
And now it has the different offset.
And just by doing this, I saw the cases where performance goes
up or down by
25%.
25% is just huge.
It's just enormous
for us. Most of our optimizations
are inside
2%, 2-3%.
If we implemented some optimization
which gives, I don't know,
5%, it's like, I don't know, we can celebrate right now because it's just a huge money for Intel.
Yeah.
So, yeah.
So, I actually wrote a blog post about this.
So, yeah.
With exactly this thing, this notion. Well, this problem
is called
code placement or
code alignment. It depends
who you ask.
Yeah.
So, and actually
if we look into this,
the only thing that changed
is the
layout of this function in memory, right?
Nothing else was touched.
And it's not only limited to the fact that was now my function.
So is now my function occupy multiple cache lines or not?
it's usually
go ahead.
I was just thinking through what you're saying.
If it was in one
cache line but code pushed it
and now it's across two cache lines or code pushed it because it was in one cache line, but code pushed it, and now it's across two cache lines,
or code pushed it because it was across two cache lines, and now it's in one cache line.
Yeah, so we should probably say that it's instruction cache line, right?
Yeah, but it's not limited only to iCache, to instruction cache.
There are also a number of structures in the CPU frontend
which kind of
can be the bottleneck in this
in those kind of cases
Okay
So what do you do about it?
Once this has happened
do you have any way of fixing this?
It seems like it would make
the binaries get very large if you tried to
always ensure that every function
started on a cache alignment or something.
Correct, yeah.
So the first problem is that the binary size goes up.
But the second problem is that
if you will try to insert knobs into the binary,
okay, you can probably insert them before the function.
That probably will cost you nothing
because they will be not executed.
But actually, the functional alignment
is not the end of the story.
So we can also align the loops.
And if we will misalign or choose the
bad alignment for our loops,
we can also cause damage
to, let's say,
to our performance.
Right.
And if we will try to align loops,
we can insert the knobs that will
be executed. That's probably
still cheap,
but not, let's
say, cost-free.
Because we still need to
fetch them and decode the knobs, right?
And then we'll probably just
throw them away. But still, we need
to fetch and decode them.
So, and actually...
Go ahead.
It seems like the knobs are also taking up space
in the iCache.
Yeah, absolutely.
Okay, just making sure.
In the iCache, and also there is a dedicated uop cache.
It's another structure in the front-end, which kind of caches the instructions after they were decoded.
Okay, I didn't know that existed.
It's kind of when we already fetched the instructions
and now we try to fetch it again, we will not fetch it
because we already have it decoded in our UOP cache,
so-called UOP cache, or decoded cache.
Yeah.
Okay, so I'm going to risk making this more complicated
in the questions here that we're talking about,
but you, so a function gets added,
it changes the cache alignment,
you do something to tweak things
so that you get back the performance loss, whatever,
maybe you come up with a no-op that is worth the cost,
and then, well, you're, I don't know, hypothetically,
you're testing on an eighth-generation core processor.
And then do you go and see what the impact was
on adding that NOP to a fourth-generation processor?
Or do you test regressions backward?
Or are you always on the latest hardware?
Well, yeah, I mean, we do this.
We also track the previous generations.
But let's say to some extent, yeah.
Because, well, again, there is always noise.
We should somehow defend from this noise.
So probably what we can do actually, and we are doing probably this,
we're making the threshold for noise for the older platforms a little bit bigger than the newest generations.
So we are looking closely to the CPUs from new generation.
We're looking a little bit loosely for the previous generations.
It also, of course, depends on how big the regression or gain is.
So if the gain is, let's say, 10%,
well, it's big enough for us to start investigation.
But I actually should say that, say, 10%, well, it's big enough for us to start investigation. But, well, I actually should say that, okay,
I said 25% is probably somewhere on extreme.
Usually, we see the jumps around, like, from 1% to 5%.
It's probable that some benchmark will go up or down by that amount
just from the code alignment issues.
Wow.
Yeah.
And now I'm curious, because I know that GCC and Clang at least have this option,
and I have not spent a lot of time with ICC,
but I believe it tries to at least be command
line compatible with gcc if i have that correct where do these things come into play if i do like
dash m tune and say i want to tune it to this specific cpu or something do you then like do
you do these kinds of tiny little details of tuning, taking into account these things between processor architectures?
Yeah, so I'm actually...
So speaking of code alignment, I'm not sure if there is a...
Well, okay.
Well, code alignment or other similar kind of things.
Oh, yeah, sure, sure. Yeah, of course.
Yeah, so actually, in general, I mean, it's a good idea to pass those flags.
If you know that your application will run on a specific type of hardware,
let's say on Skylake's CPUs on sixth generation of Intel core architecture,
then, of course, you should just go for it and pass all the flags
needed, special
flags for targeting specific
generation of the
CPU.
Yeah, sure.
Of course, yeah.
But, well, if you
want to be a little bit
conservative in this, you can probably
go for a minimal
version of
the CPU that you know
for sure this is the minimal.
No one will try to
use the older CPUs
for running your application.
You can just choose
the more appropriate
flags.
Okay. I want to interrupt this discussion choose the more appropriate flags. Mm-hmm.
Okay.
I want to interrupt this discussion for just a moment
to bring you a word from our sponsors.
Backtrace is a debugging platform
that improves software quality, reliability, and support
by bringing deep introspection and automation
throughout the software error lifecycle.
Spend less time debugging
and reduce your mean time to resolution
by using the first and only
platform to combine symbolic debugging error aggregation and state analysis at the time of
error backtrace jumps into action capturing detailed dumps of application and environmental
state backtrace then performs automated analysis on process memory and executable code to classify
errors and highlight important signals such as heap corruption, malware, and much more. This data is aggregated and archived in a centralized object store,
providing your team a single system to investigate errors across your environments.
Join industry leaders like Fastly, Message Systems, and AppNexus
that use Backtrace to modernize their debugging infrastructure.
It's free to try, minutes to set up, fully featured with no commitment necessary.
Check them out at backtrace.io.cppcast.
So we've talked a lot about benchmarking.
Can you tell us a little bit about what types of tools you use for performance analysis?
Oh yeah, sure.
So I think the go-to tool for profiling is Linux Perf.
So this tool is capable of doing most of the things
the engineer needs
to perform
to do performance
analysis and optimize
the application. Usually
when Linux Perf
is not enough, I go for Intel VTune.
It has a nice
GUI and a little bit more capabilities than Perf is not enough, I go for Intel VTune. It has a nice GUI and a little bit more capabilities than Perf.
But actually, besides from that, I also use the good old binutils.
It's just ObjDump, NM, the tools that we are all familiar with. So the point here is that well, it's actually
nothing stops you from
doing performance analysis
right now.
So most of the tools
are free and
they are amazing, like for example Perf.
It's just
amazing what Perf can do.
Yeah.
You've mentioned Compiler compiler explorer i have to ask
how often do you use compiler explorer to give you a quick snapshot of comparing
what your work has been across different compilers right correct i i actually was
thinking about how i how i can how i can make this work for me. But unfortunately, you know, we have the brand new compiler
every day. So
for me, it's kind of impossible to
integrate
a new
version into Compiler Book Explorer.
I actually, I just haven't spent time
on this. Maybe it's possible.
Yeah, if you're running it locally, it would actually be
pretty easy to do.
Well, I don't make a build
every night but I do
just use whatever my nightly build is
right but then you can
you should integrate them all
into
into compilers
I mean if you have a number of compilers
nightly snapshots
which were built
so you should somehow
integrate it and then keep the history
and then what happens
for example
most of our work is
on remote machines
so I'm not developing
on my laptop
we have a number of
dedicated servers
that we are just
SSHing to and then
go from there, kind of build the
compilers, build the benchmarks, run
them. So it's kind of
I think will be
pretty hard to do. Right.
Yeah, it would be
a lot of work. Probably possible.
See if Matt's listening and he decides that you can
add a wild card for your search for your compiler or something and right yeah so i'm curious like
as you're like optimizing this code and working all of these benchmarks what kind of things do
you see is like really hard to optimize like what should what should c++ programmers stop doing so that they get better code out of our compilers?
Yeah, so...
Well, the most obvious thing is, like,
do not try to do the compiler's work
because the compiler is probably better than you at this.
So, yeah, like, you know,
probably the general advice, like,
do not unroll loops yourself.
Compiler will probably do it better.
Do not try to inline things.
The compiler will probably still do it more aggressively than you can.
What about the inline keyword itself?
I assume you should say you can. What about the inline keyword itself? Or always inline?
I assume you should say don't do always inline
because the compiler knows better.
Yeah, yeah.
Okay.
Yeah, so speaking of inlining keyword,
from the top of my head,
I'm not sure if this keyword is actually still makes sense.
Well, obviously, the no inline attribute makes sense because it's kind of prevent the compiler from inlining.
Yeah.
Okay. Remember there was a great post by Simon Brand who actually dived deep into specifically this problem.
Like, is inline keyword still make sense?
He has a great article on this topic.
Okay.
I'll look for that.
So, yeah.
So, don't try to do the compiler's work.
Yourself.
I like that
um but then what i like uh it's almost a little depressing how you said something so simple like
adding a function can push our code around and have implications that we never expected
yeah what do we do about that like do we worry about that at all i mean yeah uh so actually we spend uh some
efforts on on trying to to figure out how we can solve this problem um and actually so we we haven't
uh we we haven't i mean uh got to any any any good good decision about this. Whenever we choose, for example, to align all the loops by, for example,
by this 32 bytes boundary, let's say,
some of our benchmarks go up,
but some of the benchmarks still go down.
So in the end, it's still kind of the same.
The performance is still kind of the same. The performance is still kind of the same.
So this problem is still kind of unsolved.
And this problem, I should say, is probably the most harmful for us,
for compiler developers,
because the problem here is that you cannot rigorously test your optimization, yeah?
Because some of the benchmarks will go up, but some goes down, and you want to know why they go down.
But the reason probably is just because of the noise and different placement, different layout.
Yeah.
And there are actually
more also interesting problems.
Like, for example,
let me now ask you the question.
How do you think, does debug information
affect performance?
Like, okay, so
we pass the minus G option to the
compiler, it will emit all
the sections
in the binary with debug information.
I feel like it must have some effect
because it's making the binary larger
just because of that reason.
Then we start thinking that
the debug information just stays
in the binary and in the runtime
it's not even loaded into the main memory.
It just stays on the binary, and in the runtime, it's not even loaded into the main memory, right? It just stays on the
disk.
I mean, well,
probably, in an ideal world, it should
not affect performance, right?
That sounds like a trick question, honestly.
Right. So, actually,
what we did...
Just saying upfront, I saw the
cases when...
And I actually worked on the cases that you pass the minus G option.
And you have...
Okay, so you're building the same application.
You pass the minus G option and building the binary number 1, number A.
And then you build the same application without debug information,
so without passing the minus G.
You have the binary number B,
and then you strip the debug information from the binary A,
and just dump in the assembly and compare it.
Uh-huh.
And it's different.
Yeah.
Okay.
It's not that different, let's say,
but there is some difference still.
So for me, it's still kind of magic.
I mean, there is no way how it should be different.
But I definitely saw the cases,
and it definitely affects performance.
So I mean, in an ideal world,
the answer to my question is no,
the debug information should not affect the performance. But I mean, we're not living in an ideal world, the answer to my question is no. The debug information should not affect the performance.
But, I mean, we're not living in an ideal world,
so the answer is maybe, or it depends.
Is that because the compiler has to emit extra things
for the debugger to be able to know,
okay, this does have a memory address or something like that?
Is that what's going on?
I'm not sure. I'm not an expert here.
It's also still an open question for me.
But I tend to think that probably it could be just a bug or maybe some heuristic.
For some particular optimization, there is some specific heuristic that sees that, for example,
if my function is that long, I will do this.
But if my function...
And actually, so all the debug information is stored.
For example, for LLVM, it's stored with some specific function calls,
like LLVM debug and metadata.
So maybe just some optimization,
just take this debug information into account
and they should not do this.
So I'm not sure what's the real answer for this is.
Maybe if there are some really compiler experts
that are now listening to this episode
can maybe comment on this problem.
Yeah.
Yeah.
Yeah.
So, yeah, in general,
a performance analysis is quite a tricky thing to do
because, for example,
once I was working on some regression,
which was 5%, 5% regression, okay?
And actually, what I immediately spotted was that the good case
has the 50%
less instructions
retired.
Or let me
say just for simplicity
executed.
So the good case
executed 50% less instructions.
So this should be
definitely better
and good.
And when I started looking into this benchmark,
I saw the patterns.
In the good case, there was just one assembly instruction
that was doing, for example,
incrementing the value in some memory location.
So it will be like in the x86 assembly, it will be like inc, increment, and then some
memory location.
In the bad case, this exactly instruction was just split into three assembled instructions
because still it's a read-modify-write
operation. So we are first
loading the value from memory,
then incrementing it, and then
storing it back, right? So it's still
read-modify-write.
In the bad case,
it was the
same instruction, but
just unfused, let's say.
It was explicitly load, explicitly incrementing some temporary register,
and then storing the value back.
So, yeah, when I spotted this, this was like, well, it makes no sense.
Like, there is clearly like 50% more instructions executed.
This should be the reason
for the performance regression.
And then, I actually
just went back
and then
kind of
consulted with one of my colleagues
which told me that
well, but still it's the same
amount of work for the CPU to do.
So then
what I actually
done, I just
put these
assembly instructions in a tight loop
and just benchmarked
only
this tiny loop with
just only one assembly instruction
in the good case and just three assembly instructions
in the bad case. And it showed
exactly the same performance.
So,
the thing here is
what I wanted to tell is that
you can be easily
fooled by
just
a number of, for example, instructions that were executed.
Right.
And it's not that obvious that it can cause the performance regressions or gains.
Yeah.
And in my case, the number, the fact that the bad case executed 50% more instructions
was not the reason for
a performance regression.
Yeah.
So it's always tricky. You always need to be
prepared.
You always kind of need
to
dive deep. You always need to
know how the hardware works
and stuff.
Yeah. Yeah.
All right.
Okay.
Well, Dennis, I'm definitely going to put a link to your blog in the show notes,
but do you want to share that with listeners who might want to read up more about some of these performance tuning examples you have?
Oh, yeah, sure.
So I think you will probably just share it in the shared notes.
Yeah, I mean, the link.
Sure.
So actually, I also wrote a number of the beginners kind of friendly articles,
starting from the basic things like what is profiling?
What is the, I don't know, for example, the instruction retired?
What is the reference clock?
How you can collect the other performance counters, what you can do about it, about those
counters, how to read them, how to collect them, a little bit more of advanced
topics like what are other capabilities of the performance
monitoring unit and how you can leverage that?
Yeah, so what you can do with Perf, for example.
Yeah.
Well, where can people find you online, aside from your blog?
Yeah, so I think the best place to find me is on my Twitter.
My Twitter handle is D-E-N-D-I-B-A-K-H
DendyBuck.
Yeah.
Okay, cool. Okay. Well, it's been great
having you on the show today, Dennis. Thanks.
What's a pleasure.
Thank you very much, guys.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook
and follow CppCast on Twitter.
You can also follow me at Rob W.ving and Jason at left to kiss on Twitter.
We'd also like to thank all our patrons who helped support the show through
Patreon.
If you'd like to support us on Patreon,
you can do so at patreon.com slash CPP cast.
And of course you can find all that info and the show notes on the podcast
website at cppcast.com.
Theme music for this episode was provided by podcastthemes.com.