CppCast - Performance Tuning
Episode Date: August 5, 2021Rob and Matt are joined by Denis Bakhvalov. They first talk about building Minesweeper in C++ with SFML and a paper on throughput prediction on intel microarchitectures. Then they talk to Denis about ...his blog, book and video series focusing on C++ performance, and his vision of the future tooling and techniques of writing performant C++ code. News Making the Classic Minesweeper Game using C++ and SFML Hot Reload support for C++ Applications Spdlog 1.9.1 released. Support for compile-time format string validation Accurate throughput prediction of basic blocks on recent intel microarchitectures Links easyperf.net Performance Analysis and Tuning on Modern CPUs Performance Ninja Course LLVM+CGO Performance Workshop - Performance Tuning: Future Compiler Improvements Proebsting's Law Sponsors PVS-Studio Learns What strlen is All About PVS-Studio podcast transcripts
Transcript
Discussion (0)
Thank you. In this episode, we discuss throughput predictions on Intel microarchitectures.
Then we talk to Denis Bakhlof.
Denis talks to us about performance tuning
and his vision of the future of writing performant code. Welcome to episode 311 of CppCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined today by my guest co-host, Matt Godbolt.
Matt, welcome to the show.
Thanks for having me. It's fun to be here.
Yeah, it's been a while since we've had you on as a guest. How's everything been?
Been pretty good, thank you. Yeah, some things have changed in my life, but all for the better
and surviving the COVID thing just per normal. And, you know, I found a new way to listen to
podcasts rather than the commute now. I use it to listen to in the gardening. So I'm all caught up
on CppCast and other things. And so pretty excited to be here as a co-host very cool yeah that's good i i should
come up with more ways to listen to podcasts i do it while you know cleaning around the house and
stuff like that but uh there's probably more i could do when uh you know more things i could do
and listen to podcasts at the same time uh On that note, I think we have mentioned
before that you started your own podcast as well. Do you want to tell us a little bit about that?
I did. I mean, again, yeah, there's COVID sort of standards, really. I started baking bread,
making sourdough like everyone else did. I got a dog, which hopefully you can edit out any dog
sounds in the background, but hopefully we won't have any and i started a
podcast so i think that's all the standard things one does when when one's locked in one's house so
yeah two's compliment um we're on about episode 10 ish now with a colleague at work but it's just
a sort of general programming podcast rather than anything specific to c++ very cool well i'll make
sure to put the link to that in the show notes in case listeners want to go find that yeah and at the top
of every episode i like to read a piece of feedback uh this was a suggestion from the include discord
where uh you know we were getting some ideas of uh you know different topics and things to do for a
podcast episode and antonio uh wrote to do a crossover episode with other podcast hosts might
be interesting mostly thinking about the c++ related podcasts like cpp chat twos compliments and um adsp or ndr can be another occasion to
talk and think about c++ even if breaking the interview format so we're going to do an interview
today but it is a bit of a crossover it is a bit of a crossover right i'd also add uh tlbh.it as
well as another sort of tech space podcast that i think some folks you've had on the show I know are involved with.
So, yeah.
Here we are, the crossover.
There are plenty of CBP guest guests who have gone on and started their own thing over the past year in addition to you.
But you were first.
You were the inspiration for us all.
Well, it's been nice to be doing this for all these years.
Yeah.
Okay. Well, we'd love to hear your thoughts about the show you can always reach out to us on facebook twitter or email us
at feedback at cpcast.com and don't forget to leave us a review on itunes or subscribe on youtube
joining us today is dennis baklav dennis is a compiler developer passionate about software
performance he currently works at intel dennis is an author of the book Performance Analysis and Tuning on Modern CPUs and a creator of the
Performance Ninja online course. Dennis is also a writer on the easyperf.net blog, host of
performance tuning challenges, and regular Twitter spaces talks about software performance. Dennis,
welcome to the show. Welcome. Hi. Hi, hi guys i'm glad to be here and um
yes seriously just wanted to echo matt that you know um you guys are really uh legends like like
rob and jason like you know uh being you know so consistent over over the years like i i actually
you know started listening to cpp cast like since 2016 from the episode number three i guess so you know
like that requires a ton of discipline so that's uh you know really kudos to you guys um you know
and uh yeah and thank thanks for inviting me it's a great pleasure for me to be here yeah well thank
you so much and it's uh great to have you on and uh i was just gonna say that matt you were telling me
before you've uh met dennis before is that right we've chatted over email i think over some of the
things that dennis is gonna talk about so i'm i'm super excited to sort of well first of all meet
him in person person but virtually quotes and uh and then hear what he's been up to and what he's
done since we last spoke awesome okay well dennis uh we got a couple news articles to discuss feel
free to comment on
any of these and we'll start talking more about what you're up to okay sure okay so this first
one is a youtube video and this is making the classic minesweeper game using c++ and sfml
and this one was really fun to watch it's only about seven minutes long, but the author kind of goes through the whole process of creating the game and the bugs he's run into during it.
And yeah, it was very fun to watch, very entertaining.
It was a huge amount of fun.
I laughed myself silly on that.
But I have to feel like as a sort of moral, if not actual representative of Jason, there was an awful lot of best practices that were being broken in the source code if you actually looked in the background and paused the video or went
and looked at the github repo i mean it's a great format to get people interested in to show them
how relatively straightforward it is to make a game um like that like minesweeper and also with
all the nice effects and everything but i would say you know i would question the the quality of
the code a little bit it could be you know sharpen up a bit, but it was a lot of fun.
And I definitely encourage people to go and watch the video and have a laugh.
Yeah, definitely.
Okay.
Next thing we have is on the Visual C++ blog, and this is edit your C++ code while debugging
with hot reload in Visual Studio 2022.
So this is pretty cool looking.
The C++ edit and continue feature
has been a part of Visual Studio
for I think a pretty long time,
where when you're on a break point
and the debugger has stopped,
you can change some variables or insert some more code
and then resume and actually change the program as it runs but now
you can actually just change your code without being on a breakpoint and then just hit this
hot reload button and it instantly changes your application it looks pretty amazing
i was gonna say as a compiler author you must appreciate the sheer amount of magic that's
going on behind the scenes to make that even feasible right i i can't imagine what's going on yeah yeah it looks very impressive yeah they must be doing a
lot of stuff under the hood yeah definitely it's probably the well apart from visual studio the
editor itself which is great you know as a package as an ide uh it's probably the one thing i miss
from spending my time nowadays away from windows is is the ability to have edit and continue which you know as you say it's been
around for like 20 odd years now I remember using on like the original Xbox as a kind of oh gosh we
got to the place where the game crashed ah hit the break point oh is it this oh yeah we can fix it
and carry on and see if we get past it all those kinds of things which make a huge huge deal when
you've just been playing something for 15 minutes yeah very exciting yeah definitely okay and then the next thing we have is just a new version update for the speed log
library and this is that they're adding support for formats uh compile time validation which i
think we talked about a few weeks ago when uh when format was updated and we you know went over a blog post about how this
compile time validation works so now speed log can take advantage of that as well uh which would
be a really good change to uh you know get compile errors if you put in something wrong
absolutely especially as those log lines that are least likely to be hit are probably the ones
that are erroring in like panicked situations that you're not expecting and if they happen to
have format string problems with them that's the last time that you want to find out is when they're
actually executing it's like oh we crashed why did you crash oh because there was a bad situation
and while logging it we threw an exception right right okay and then uh matt i i'm gonna ask you
to go over this last article because uh you suggested this paper and it is very large i read
through the abstract but i could not read through it's very much like it's my favorite thing to
happen this year and it's certainly given the confluence of being invited to to co-host this
and knowing that dennis was going to be here it seemed too rude not to bring it up as something
that just came out it's a paper that uh describes like the new state of the art of simulating the throughput of Intel x86 processors,
including all of the really unusual internals that have to be modeled in order to get decent performance figures out.
And there have been an awful lot of papers and work on essentially reverse engineering what's going on inside the chip
that have been happening over the last few years.
Agnafog and some of the other folks who are listed in this paper have done some things before
but this is like within like a cycle or two of real world measured performance for all the
different various flavors of chip so it's it we're close to like a grand unified theory of what's
really going on inside the silicon that dennis won't be able to tell us about for various reasons
i'm sure but it's super exciting to me as somebody
who works in performance in my day job and it's just super cool to know and to understand how
complicated these cpus are inside and how many clever tricks they're pulling on our behalf to
make our code go fast so i i was just even just reading the abstract and having an idea of it is
is cool i think yeah this is this is actually a great paper, yeah. So I just, you know, went through it, like, really quickly.
Like, I spent maybe, like, 15 minutes or so.
But yeah, so it seems to be much, much more precise
than, you know, than everything that we had before,
like, tools like Ayaka and, like, LLVM MCA.
And then, so, yeah, and then the problem is that, you know,
first of all, it's really hard to statically predict performance, right?
Well, I mean, I don't know, but it's maybe even impossible to predict it super correctly.
I mean, well, okay, to some extent, that's really hard to do.
So, and those tools that existed before this new tool that they created in their paper.
So they largely discarded the effects of various non-deterministic CPU units.
Like, for example, any front-end effects essentially were discarded by those tools.
So what do I mean by that? Is that, for example, you know, I mean, the alignment of the code was not in the model.
Also, the tool, you know, those tools always assume that everything is in the cache.
So you don't ever hit the cache misses.
And all the branches are predicted, like, pretty much 100%.
Well, I mean, that's obviously not always the case, right?
So, but okay, so going back to the paper, right?
So I think their work is based on the work that they did previously on this uops.info, right?
So they somehow leveraged these giant tables, which actually seem to be pretty much, again, accurate
in terms of latency and throughput of individual assembly instructions.
They've got it down to kind of a fine art of how to measure one instruction
and sort of see all the effects that it has and deal with the dependency chains
and all these clever things that they talk about.
There's a micro-bench, a nano-bench library that they talk about.
But yeah, it's very exciting. And for me,'s as you say it replaces things like iaca which hasn't really been supported for four or five years and as you say
llvm mca is brilliant um but it's driven by llvm's interpretation of what the cpu does which is not
necessarily very accurate and having something more accurate can only make that better and given
that llvm mca is used by clang to do the scheduling of the emitted code it kind of will actually work
its way backwards into a compiler eventually and hopefully we'll get faster code which is you know
nice everyone loves faster code for free right yeah all right well dennis uh i mentioned while
reading your bio that you have a book a blog and an online video course where you focus on performance and
i just wanted to start off by asking how you got to have this focus on performance in c++
uh well let's see um well i mean i don't know um i was i was always you know interested in
performance making things faster um i think the first, like,
so I remember, you know, those
funny days when I just, you know,
like, first tried
Intel VTune. I think at that time,
it only, you know,
it required that you instrument your
code. It was, like, back in maybe 2009.
So you first, you know, had
to wait maybe half an hour
or an hour, well, I mean, depending on your application. But while first you know had to wait maybe a half an hour an hour well I mean depending on
your application but uh well you know when when the vtune will instrument your code first and
then it will actually run and and you know show you the profile so those were the funny days and
you know I must say that at the time I had no idea what what it is doing how it's working and
so on and so forth but I mean you know uh when it showed you the list of hot functions and, you know, hot places,
well, I mean, it was super cool and interesting.
So I don't know.
I mean, and then fast forward, I don't know how many years, maybe six or seven years,
I joined Intel.
Well, I mean, so this was, you can say it was a natural step for me.
I don't know.
So that's how I, you know,
really started working on performance
on a daily basis.
Well, I mean, more or less.
Yeah.
So that's roughly the story.
And then if you're like wondering about the book
and how I get to that, this also actually
came to me, let's say, organically.
Okay.
Well, I mean, so I started my blog roughly at the time when I joined Intel.
This was around 2016, 2017 maybe.
And so I began documenting things that i learned about performance about the
the work that i'm doing at intel like you know this whole this profiling thing and uh
optimizing code and stuff like that um and then and then uh well i mean i collect some some
knowledge let's say on my blog and then uh a couple of uh folks reached out to me like hey
you have this you know good information
on your blog but it's you know scattered in a way it's not you know systemized like like maybe you
should write a book well and and so this is actually how my book uh you know uh uh let's say
uh was was born i would say okay so so So, yeah, and then the online course,
I think the same thing.
My book simply lacked the practice,
the practice piece of it.
So, well, I mean,
and then I decided to go beyond that
and maybe I can create something,
some place where folks can practice optimizing and tuning
their code. So this is how this Performance Ninja got to life. Could you tell us a little bit more
about that? I mean, do you find that people need to practice it or is that what made you
move to make a site to be able to practice to practice these things or is it you know i mean i
know from my own experience that like until you run that profiler your intuition is 100 wrong every
time even when you've accounted for the fact that you know that your intuition is wrong you're like
i know where it's going to be and you run the profile it's like no it's in struck copy what
i didn't even call struck you know those kinds of things so um is that the kind of thing that
led you to go like, no,
people actually have to put this into practice themselves
and maybe tell us what Performance Ninja is.
Yeah, sort of, sort of, yeah.
Yeah, I mean, so I actually think that the performance optimization
and tuning, you know, is, well,
and I think it may become even more important than it was like during the
past uh three decades three four decades maybe and i think there there there's a there are a
couple of reasons to that so uh let's see so first of all i think that you know uh and it it is driven
sort of you know by the uh by the current technology uh like like you know i and it is driven sort of, you know, by the current technology.
Like, you know, I mean, if you would ask me whether the Moore's law is dead,
I mean, well, I mean, it's not the easy answer.
But I think, you know, if you look from the software perspective on this question,
from the software vendors perspective, then i think the answer is yes and what do i mean by that is that um now software vendors should spend more resources on optimizing
their code because hardware does not provide you know major performance boosts anymore unfortunately
i mean those transistors are going into things that aren't making things faster they're just
giving us more capabilities or they're making caches bigger or things like that that aren't necessarily directly related to performance in the same way as they used to, say, 20 years ago when you could guarantee, like, I'm going to get one more gigahertz in four years' time.
Right.
Right.
Yeah.
So I think, you know, like especially, you know, I'm not talking about the performance improvements are completely gone, but if you
look at the single-threaded performance, and this is still very much important.
Not only multi-threaded performance, but the single-threaded performance is also very much
important for many client applications that run on your laptop.
So that's the first thing.
And then the second thing is that
previously in the so-called PC era,
where the companies that sell software,
so this software will usually run on the client's machines.
So software vendors did not pay for the electricity bills uh for their customers right right right there so but now the
situation is changing so now now everything is moved it moves to the cloud and then so software
vendors now pay their bill their so-called cloud bills themselves.
Right.
So suddenly it's their problem now.
And so suddenly they were motivated to fix it.
Whereas before they were sort of distributing it by, well, I have a million customers.
I've got a million CPUs.
Now they're like, no, I'm paying for the CPUs.
Right.
Right.
And I think that those are two fundamental reasons why we'll see more people start focusing more on performance.
And I think that those people
that program in Python and other languages,
they may be affected even more than us.
But I mean, well, we'll see.
You talked about tuning a bit,
and I just wanted to go into the tooling maybe a little bit more
that you focus on when you're talking about performance tuning
and teaching performance tuning in the Code Ninja class.
Can you tell us more about your experience with tooling
and what you like to use, what you get out of it?
What I like to use? Let's see.
Well, I mean, you know,
Intel VTune is my sort of
go-to tool.
But I
actually
think that more frequently
I prefer just running
Linux Perf because it is
installed on every Linux
distribution that I'm
eventually touching. I'm, you know, that I'm eventually touching. So,
so, you know, I mean, for me, it's just simpler and faster to run Linux perf than, you know,
than to, you know, unleash the whole power of Intel VTL, let's say, if you will. So, so, so,
yeah, I mean, but for the most part, actually, Linux Perf is also a great tool.
It supports most of the things that Intel or VTune can, but just, you know, there is lack of graphics interface, right?
And that's sometimes that's actually hard to replace even. in Vityam, you have this timeline where you can zoom in and filter in
and in the particular
interval of when the program
was executing. So that's a really cool feature that you cannot
unfortunately do with Linux Perf. But for the most, Linux Perf is a great tool as well.
I was going to ask, actually, what kind of things do you do?
So maybe just for those who aren't as familiar with what Perf is,
could you explain the kind of things that both VTune and Perf are doing
behind the scenes and what kind of information you can glean from them?
Sure.
Yeah, so Linux Perf is actually a basic performance profiler, let's say.
So what it does is actually just really simple.
And the best way to explain it is everyone is familiar with the debugger, right?
So debugger is essentially the simplest performance profiler, if you will.
So you can view it like that.
If you will run your program under debugger and you will
interrupt it once per, let's say, per 10 seconds and you will record the place where you stopped and if you will repeat this process like
over, let's say, a thousand times or ten thousand times, then you will collect, you know, ten thousand samples and if you
build a histogram on that, then you will be able to see where the
program was interrupted the most number of times. So this will be your hottest
place. And this is an oversimplified explanation of what profilers do under the hood.
So they essentially are capable of interrupting your program
a high number of times during a small period of time.
And that's how they tell you where the hottest places in your program are um yeah so that's um
that's i think the the the simplest explanation right right i know i know from personal experience
that it's not just necessarily time that you can interrupt and say like you know every 100 seconds
you can say every 10 000th cache miss tell me what the heck was causing the cache miss and again you
can sample and group those and so that's the kind of useful information you can glean that is very hard to do in like a
in any other way right i can't tell you where is the where am i missing the cache the most and
you're like oh yeah i'm walking a linked list here well of course right okay maybe i shouldn't do
that that kind of thing right yep exactly yeah so then vtune on top of that is a more graphical
way of getting the access to the similar kind of
information that's correct yes so yeah so so so matt as you described like right so so this is
actually more a little bit more advanced uh usage of you know of the profiler if you if you really
want to know where like where exactly which line of code which assembly instruction misses in cache a lot or for example which branch mispredicts, right? So that's where you
sample not on cycles but on some other event like cache miss for example.
I mean yeah so this is a more advanced usage and actually I wouldn't say I
suggest this workflow if you will. So let me first start why.
And I think that it requires you to have a knowledge
about all the different performance counters that are available.
But there are hundreds of them.
And they're changing in every CPU generation.
So you cannot, let's say, be...
Well, I mean, of course, you can study documentation and learn what they're doing, what they're measuring.
But I wouldn't recommend that.
There is a better, I think, approach here.
And there is a methodology called top-down analysis.
This actually allows you to characterize your workload from the CPU perspective.
It's like where the bottleneck in your application is, like from the CPU microarchitecture perspective.
For example, there are like four major categories.
It's retiring, front-end bound, back-end bound, and best speculation.
Okay, we will not dig into that, but the main point here is that it actually abstracts away the knowledge about performance counters.
And it collects all the needed counters automatically.
You don't have to care about any specific meaning of them.
You just run the tool.
It will show you where exactly your bottleneck is.
Like, for example, you miss in, let's say, L2 cache,
or you go all the way down to the main memory.
And after you figure out what is the type of bottleneck in your application,
only then you locate the exact line of code where this problem happens.
So this is a really powerful technique i use myself with you know
i use i i use this technique myself you know pretty much you know uh it's the first thing
that i do actually right um so it's like gives you an overview before you go down the rabbit
hole of like which of the 150 000 different counters should i be measuring it's like no well
the first thing you need to know is that
the problems with cash misses or the problems with bad speculation.
And then you can dig and dig and dig down.
And it gives you a very high level, high level, relatively speaking, right?
A high level view of where your performance
or the kind of performance problems you've got.
Right. Yeah. Yeah.
Okay.
The sponsor of this episode of CppCast is the PVS Studio
team. They developed the PVS Studio
Static Code Analyzer. The tool
helps find typos, mistakes, and potential
vulnerabilities in code. The earlier
you find an error, the cheaper it is to fix,
and the more reliable your product releases are.
PVS Studio is always evolving
in two directions.
The first is new diagnostics.
The second is basic mechanics, for example, data flow analysis.
The PVS Studio Learns What Sterlen Is All About article describes one of these enhancements.
You can find the link in the podcast description.
Such articles allow you to take a peek into the world of static code analysis.
You were talking earlier about what you think is going to be driving performance
in the near future.
Where do you see us going with
focusing on performance?
What can C++ developers do to
better handle the way programming is going to change as we get new hardware, new software, new compilers?
That's a complicated one.
So what can drive further improvements, right?
And I think that, well, I mean, I would like actually to approach it from this perspective.
So there are opportunities and there are challenges.
So let's, I don't know, maybe let's start with opportunities first.
For example, what can drive, right?
So, I mean, what we as software developers can do.
So I think, you know, I think we can.
So first of all, better compilers exist, I mean, in the future.
And so I actually gave a talk on LVM performance workshop on February this year.
And I tried to answer the question, what will drive the future improvements in the compiler? And so I was caught by this question.
And so what I actually did, I asked a whole bunch of compiler experts
what their view on this question is, and I just summarized this whole wisdom.
And so there is this
Propstein law which says
that
advances in
compiler technology
make your software run faster.
I mean, double
the speed of your code
every 18 years.
So it means that
the
performance improvements of the code that is generated by the compiler
double every 18 years. Of course it doesn't hold for some reason.
So what it means is that if you take the Clang compiler now,
and, for example, the Clang compiler nine years ago,
and you can expect the 50% faster code, right?
Just from the compiler, only from the compiler.
If you run it on the same hardware, on the same operating system,
and so on and so forth, on the same platform,
but just a newer compiler.
So some people actually think that it's now
not the 18 years, but the 50 years,
something closer to 50 years.
So, you know, I mean, that's a really pessimistic view,
I would say.
And then, so going back to how we can, you know,
make compilers faster, well, you know,
there are actually multiple directions.
But what I was able to... I mean, the major directions are, first, machine learning.
I mean, compilers are full of heuristics and cost models, so we can replace those with machine learning models.
So that's first.
Then the second is that there is a lot of work going on with synthesizers and super optimizers.
So, like, you know, compilers sometimes miss some, you know, peephole optimizations
and, you know, not being able to generate the optimal uh assembly sequence like
for example if if we're talking about the the simd instructions it's uh it's sometimes hard to
find the best the best pattern of of assembly instructions with all those you know crazy
shuffles and uh and you know um and and so on and so forth so uh and that's the work that John Reger and Jeff Langdale and some other people are doing.
So that's, so, and oh, yeah, so what I actually wanted to say is that the conclusion around,
you know, the, and actually, so one of the questions that I also was asking all those experts is that, like, what is the, you know,
what is our current headroom
in existing LVM optimization passes?
And I mean, I think you will be surprised,
but most of the people think that, you know,
we are chasing the tail now.
Oh, really?
Yeah, and the headroom for us is probably around 1%.
And so just to understand that,
you're saying like with the existing infrastructure, with the existing
set of things that we know, there's only
a few percentage points left without something
else changing to make
it go faster. Am I understanding
you right? I think that's
yeah. So
what do I mean by that is that if you will
now go and try to polish all the
existing LLVM
optimization passes, you you are not
expected to you know to to gain more than i would say one percent on on average right on average wow
so i mean that's again uh unfortunate and you know pessimistic view um but but yeah um so what
do you think needs to change where are we going to find the next explosion of performance?
Does the language need to change?
Do the compiler infrastructure,
is there something we're missing somewhere in the line
between programmer writing code and optimal code being produced?
Is there something that we haven't thought about?
Where is it going to come from?
Right.
Well, I mean, and then, so I actually saw and some of the folks also think that we syntax tree and so on and so forth,
so we can build better tools that will help developers
to improve the performance of their code more effectively and efficiently.
When you say tools, you mean tools like Vintune and Linux Perf?
Right, yeah. you mean tools like vintune and linux pair yeah yeah yeah yeah so and then actually and then one
of the one of the one of the tools that i'm personally dreaming about is is something like
that so imagine you are writing let's say for i mean for example imagine that you are writing a
matrix multiplication right so i mean it's not it's not it's not super super i mean it that's
a tricky one but at the same time it's not it's not super easy to mean, that's a tricky one. But at the same time, it's not super easy to write super effective code,
super optimal code for every hardware platforms.
But okay, so what I would like to see is that, for example,
there is some recommendation system that will detect that I'm writing
the matrix multiplication code and will say,
hey, I think you're writing matrix multiply.
You're saying that Clippy will pop up and tell me
it looks like you're writing a matrix multiply
if you considered using the Intel library to do this instead.
For example, yeah.
Or, for example, it will, okay,
so let's say that there is no library function,
library implementation for the thing that I'm writing.
So this tool will tell me, hey, you're writing this algorithm.
I have the version that is optimized for your hardware with exactly the same semantic meaning.
And here is the diff.
Do you want to apply it?
I mean, this will be great but i mean it requires you know a database let's see of of golden golden code
written by performance ninjas uh that you can effectively just you know just uh dropping
replacing in in your code right so this sounds like sort of like github copilot i'm just gonna
say what were your thoughts on GitHub Copilot?
I mean, I know that's not necessarily targeted
at improving performance,
more just improving programmer productivity,
but it's kind of on the same lines
as what you're talking about.
Right, yeah.
So this, I think, will be great.
And in order to solve that,
we need to solve the code similarity problem, right?
Like, I mean how how we can how we can uh
how we can detect whether the two uh the two uh pieces of code are have the same semantic meaning
right and or do they differ in ways that are intentional or accident you know accidentals
like hey yeah you don't you start at one and go to n plus one versus naught to n you know but but
they're near
enough the same but or is that important for some other reason right that's a tough problem to crack
and also not annoying the programmer i know from my own personal experience if if a clippy like
thing popped up every time i would probably i don't know what i'd do but it wouldn't be very
pleasant to my computer but it's interesting that you think the next wave of improvements
will come from tooling which will help programmers write things more optimally.
Is that a fair characterization of what you said?
I think yes.
I think yes.
Because some of the transformations that can improve performance of the code are just so hard to do from from the uh from the compiler perspective
like for example a os to soa transformation like arrays of structures to structures of arrays
that's uh that's something i think that's that's not easy to do on the compiler level like you need
to to have the whole view of the program uh you know you need to to to
know whether you know someone uh someone you know has some references in in between of of this data
structure so you need to do really complicated analysis and you know um to make that happen so
this is something that much much easier can be done on the on the software level by the developer
himself or herself and so if we
if we would give the better tooling and we will be able to detect um that such transformation can be
made or should be made then i think uh this will be much more effective you know uh spending of of
of our time and resources as a as a compiler developers as a as a tool tooling developers
that that makes a lot of sense to me actually now that you've put it in those terms i mean like as a compiler developers, as a tooling developers.
That makes a lot of sense to me, actually,
now that you've put it in those terms.
I mean, the thing that I've always wondered is would it be possible for compilers to reorder
private members of class structures to move things
that are pack better or more cache effective
or whatever like that.
But obviously, there's definitely a point
where you can't do that without there being changes
to the behavior of the program which are visible, but not important but the compiler can't make that distinction the programmer
can and so those kinds of things are really yeah that's fascinating now that's that's it sounds
like a sort of a different way um you know like you not really part of as you say it's a tool it's
separate to the compiler you kind of fire up your program in it and sort of maybe get some pgo style
samples in and say okay recommendation system tell me what
the heck i should do here and be like yeah have you considered moving these things closer together
or that kind of thing right that makes that's exciting what an exciting world we live in but
i mean i guess short of having like tooling that can do that i guess the only thing we can do is
have better education right for the developers better Better education. Yeah, yeah. Well, I mean,
that's the mission that I'm on, but
yeah,
at the same time,
I don't know. I mean,
I'm not sure if we would want
to be in a world full of
performance ninjas.
I mean,
we don't want to
make this as a requirement, right?
So that's why I think we need to have better tools
and better compilers maybe.
Because, yeah, I mean, performance is hard.
Performance, yeah.
So it's a huge topic. I would say that I think it's easier and we'll be more forward-looking
if we will focus on better tools and developer productivity
and allow them to have more insights in the performance of their code.
So something that I always think about when I think about performance and allow them to have more insights in the performance of their code.
So something that I always think about when I think about performance is that sometimes performance is genuinely a feature.
Absolutely it's important that this is critical as fast as possible.
I mean, I spent a decade working in the trading industry
where that kind of thing is kind of why they're interested in me.
But there's always an antagonistic...
Oh, sorry.
I believe there to be an antagonism between performance and sometimes readable testable lovely beautiful code right and i've spent like the last
five or six years trying to argue that the compiler mostly lets you do the left-hand thing of writing
nice readable code but it still generates really optimal code on the right
hand side so you know you get the best of all worlds but the kind of things that you're talking
about here is that maybe i'm going to have to change my tune in the next few years because
we're going to reach the point where the compiler can't help you anymore right we've done everything
we possibly can we've inlined everything we could inline we've pulled everything out the last thing
you're going to have to do now is take your lovely class that looks lovely and is model
modeling a real world
object and then split it into three little classes that are in three different arrays somewhere which
makes it a very different proposition to what extent do you think um that's true do you i mean
is that the case do you think or or am i missing something here is it um well no no i think i think
uh you know where you can have the best of those two worlds.
And I think that you only need to go for ugly, let's say,
performance optimization tricks in those places where it's only needed.
So I think there is no contradicts here.
So yeah, I mean, you, you just first, you know,
profile the code and you see where, where the,
where are your shortcomings and then you fix those.
And then, I mean, you know, sometimes even, you know,
even, even the simplest and cleanest code might be,
you know, not the best code.
Like for example, if you take a look at the,
at the std max and min functions, right?
I mean sometimes, you know, you know, I mean in and the compiler will generate will
will likely generate the branches for you
like if it will say like if if a is less than B then then then I choose B
Otherwise, I will choose a so so compiler will will you know will usually generate branches
But what if what if you know this branch mispredicts?
Then you better go for a C move instead of the branch.
Well, I mean, okay, anyway, so it's too much detail here.
I think these are important distinctions, right?
These are the kind of things that if you're interested in these things,
but there are trade-offs there that I think compilers don't necessarily know how to uh
to to to do even experts you know i've seen chandler cruz did a live presentation at cpp
con one time and he was like i don't know why this is happening and it's like it's generating
a seam of and i keep going don't generate the seam of generate the branch for god's sake and
he's like fiddling around with the code and eventually eventually profiles it and goes, no, the CMov was actually
sorry, the branch was the right thing to do in this case
because it was absolutely predictable, or
whatever it was, I don't remember the specifics, but
these things are hard, and they're maybe data-dependent,
they may be, so
it's, yeah, so you were
saying that StudMax
might be
generating branches or not, and it might not
be something that you want to do in all
cases and i'm sorry i interrupted you so please carry on yeah yeah yeah no i mean that's exactly
what i meant yeah yeah so but but again so as i said right so i i think that you only need to
like uh to uh you know uh to to make to to sacrifice the the readability of your code
only those places where it really
gives you any benefit.
Right, you need to have the
four-line comment above to justify
the one small weird little thing
that you did that wasn't obvious because
that's where you're paying for it. You're like,
I have to explain the apology comment that says,
no, we really do need to do this funny thing here.
Funny thing because that's the code, doesn't it?
And then for me, it's like, and then write a benchmarking test if you can to try and make sure that nobody accidentally undoes that or a compiler revision doesn't come along and make it so that that is no longer true or whatever it is you're relying on.
You know, try and put a test around it.
Sure.
Yeah, this will be great.
Yeah.
So your idea about, you know, tooling that could suggest performance improvement techniques to developers writing code,
do you know if anyone's working on that sort of thing?
I think yes.
And I think actually it's an area of research called machine programming.
Okay.
Yeah, and I actually wrote a blog post i
think about it maybe last year it's on easyperf.net on my blog yeah so so that's uh uh that's uh that's
a let's say high level vision that some someday maybe uh the machines will be able to program
themselves um but but hopefully after i've retired yeah don't
put us all out of the job please yeah well i was just interested you know that there's we've talked
about only in passing some of the things that are in your like online course what is there something
you can tell us a little bit is there a teaser you can give us is it a free course or how does
the course work for for performance training i would love to. Yeah, sure.
So this is a free online course.
It is also self-paced, meaning that, well, okay, you can come,
you can work on it whenever you have time. So the idea is this, that we actually build a set of lab assignments.
So at the moment, we only have seven, I guess.
So those lab assignments,
they focus on a specific performance problem.
So those are small, minimalistic code samples
that experience some performance problem
than you require to go and fix it.
Right.
So like mini puzzles almost that you have to solve
using the tools and the techniques
that you're teaching people in the course.
Yeah, sort of.
So at the moment we have
lab assignments on
vectorization,
on function inlining,
on loop interchange,
on data packing
and I think compiler intrinsics.
And maybe I missed some.
Yeah, but the point is here, is this.
Yeah, so this is an online course, right?
So I recorded some videos, right?
So the idea is that you first go and watch the introductory video
where I sort of, you know, give an introductory
into the specific performance
problem. Then you go and try to fix the code yourself. It can take from half an hour to like
four hours on the lab assignment. Well, I mean, depending on your background and the level of
complexity of the lab assignment itself. And then after you fix, you can actually submit it to the
GitHub. So the course is on GitHub, it's for free,
and there is an automated verification and benchmarking attached to it.
Sweet.
Right, yeah, so when you submit your code to GitHub,
it will be automatically picked up by GitHub, and it will benchmark.
I want to point out that, you know,
this is actually a good performance benchmarking.
I mean, it's not benchmarking your program in the virtualized whatever environment on ARM
or any other, you know, low-end CPU.
So I actually offloaded all the benchmarking
into my own, like, Linux box here at my home.
You're very brave.
I'm very brave, yeah.
I created an image so that if anyone hacked me, I would just reformat everything. And in half an hour, I will have reformat everything,
and in half an hour I will have a cleanup system running.
But I am prepared for that.
But the key here is that because CI systems generally,
be they GitHubs or anything you can find out there,
generally they're multi-tenanted, they're virtualized.
It's very hard to do performance analysis work
when you're on a noisy machine
full stop even if it's your own dedicated if it's a desktop under your desk it's hard to make sure
that you know you're not actually monitoring slack or or whatever other thing that's like
causing your cache to go wrong so you've got a system where you you take the user's code
and you run it in a very very controlled environment away from virtualization and
then you can give them pretty good results about whether they've made it better or worse or whatever that that's very cool that's very cool yeah yeah and then and then
so and then in the end uh there is a summary video where i explain how it how it should be how it
should be done how it can be fixed and what how you can measure uh and and so on and so forth yeah
cool yeah very cool sounds great we have to check it out yeah okay well uh dennis it's been great
having you on the show today obviously you know we we talked about the book the blog and and the
the video course uh and we'll put links to all those in the show notes is there anything else
you want to tell our listeners about before we let you go um well i think no we we we covered
a lot of fun topics yeah it was uh to be here. Thanks for inviting me.
Thank you so much for coming on today.
Sure.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.
You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter.
We'd also like to thank all our patrons who help support the show through Patreon.
If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast.
And of course, you can find all that info and the show notes on
the podcast website at cppcast.com. Theme music for this episode was provided by podcastthemes.com.