CppCast - Performance Tuning

Starting point is 00:00:00 Thank you. In this episode, we discuss throughput predictions on Intel microarchitectures. Then we talk to Denis Bakhlof. Denis talks to us about performance tuning and his vision of the future of writing performant code. Welcome to episode 311 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Rob Irving, joined today by my guest co-host, Matt Godbolt. Matt, welcome to the show. Thanks for having me. It's fun to be here. Yeah, it's been a while since we've had you on as a guest. How's everything been?

Starting point is 00:01:31 Been pretty good, thank you. Yeah, some things have changed in my life, but all for the better and surviving the COVID thing just per normal. And, you know, I found a new way to listen to podcasts rather than the commute now. I use it to listen to in the gardening. So I'm all caught up on CppCast and other things. And so pretty excited to be here as a co-host very cool yeah that's good i i should come up with more ways to listen to podcasts i do it while you know cleaning around the house and stuff like that but uh there's probably more i could do when uh you know more things i could do and listen to podcasts at the same time uh On that note, I think we have mentioned before that you started your own podcast as well. Do you want to tell us a little bit about that?

Starting point is 00:02:08 I did. I mean, again, yeah, there's COVID sort of standards, really. I started baking bread, making sourdough like everyone else did. I got a dog, which hopefully you can edit out any dog sounds in the background, but hopefully we won't have any and i started a podcast so i think that's all the standard things one does when when one's locked in one's house so yeah two's compliment um we're on about episode 10 ish now with a colleague at work but it's just a sort of general programming podcast rather than anything specific to c++ very cool well i'll make sure to put the link to that in the show notes in case listeners want to go find that yeah and at the top of every episode i like to read a piece of feedback uh this was a suggestion from the include discord

Starting point is 00:02:51 where uh you know we were getting some ideas of uh you know different topics and things to do for a podcast episode and antonio uh wrote to do a crossover episode with other podcast hosts might be interesting mostly thinking about the c++ related podcasts like cpp chat twos compliments and um adsp or ndr can be another occasion to talk and think about c++ even if breaking the interview format so we're going to do an interview today but it is a bit of a crossover it is a bit of a crossover right i'd also add uh tlbh.it as well as another sort of tech space podcast that i think some folks you've had on the show I know are involved with. So, yeah. Here we are, the crossover.

Starting point is 00:03:31 There are plenty of CBP guest guests who have gone on and started their own thing over the past year in addition to you. But you were first. You were the inspiration for us all. Well, it's been nice to be doing this for all these years. Yeah. Okay. Well, we'd love to hear your thoughts about the show you can always reach out to us on facebook twitter or email us at feedback at cpcast.com and don't forget to leave us a review on itunes or subscribe on youtube joining us today is dennis baklav dennis is a compiler developer passionate about software

Starting point is 00:04:02 performance he currently works at intel dennis is an author of the book Performance Analysis and Tuning on Modern CPUs and a creator of the Performance Ninja online course. Dennis is also a writer on the easyperf.net blog, host of performance tuning challenges, and regular Twitter spaces talks about software performance. Dennis, welcome to the show. Welcome. Hi. Hi, hi guys i'm glad to be here and um yes seriously just wanted to echo matt that you know um you guys are really uh legends like like rob and jason like you know uh being you know so consistent over over the years like i i actually you know started listening to cpp cast like since 2016 from the episode number three i guess so you know like that requires a ton of discipline so that's uh you know really kudos to you guys um you know

Starting point is 00:04:53 and uh yeah and thank thanks for inviting me it's a great pleasure for me to be here yeah well thank you so much and it's uh great to have you on and uh i was just gonna say that matt you were telling me before you've uh met dennis before is that right we've chatted over email i think over some of the things that dennis is gonna talk about so i'm i'm super excited to sort of well first of all meet him in person person but virtually quotes and uh and then hear what he's been up to and what he's done since we last spoke awesome okay well dennis uh we got a couple news articles to discuss feel free to comment on any of these and we'll start talking more about what you're up to okay sure okay so this first

Starting point is 00:05:30 one is a youtube video and this is making the classic minesweeper game using c++ and sfml and this one was really fun to watch it's only about seven minutes long, but the author kind of goes through the whole process of creating the game and the bugs he's run into during it. And yeah, it was very fun to watch, very entertaining. It was a huge amount of fun. I laughed myself silly on that. But I have to feel like as a sort of moral, if not actual representative of Jason, there was an awful lot of best practices that were being broken in the source code if you actually looked in the background and paused the video or went and looked at the github repo i mean it's a great format to get people interested in to show them how relatively straightforward it is to make a game um like that like minesweeper and also with

Starting point is 00:06:17 all the nice effects and everything but i would say you know i would question the the quality of the code a little bit it could be you know sharpen up a bit, but it was a lot of fun. And I definitely encourage people to go and watch the video and have a laugh. Yeah, definitely. Okay. Next thing we have is on the Visual C++ blog, and this is edit your C++ code while debugging with hot reload in Visual Studio 2022. So this is pretty cool looking.

Starting point is 00:06:45 The C++ edit and continue feature has been a part of Visual Studio for I think a pretty long time, where when you're on a break point and the debugger has stopped, you can change some variables or insert some more code and then resume and actually change the program as it runs but now you can actually just change your code without being on a breakpoint and then just hit this

Starting point is 00:07:12 hot reload button and it instantly changes your application it looks pretty amazing i was gonna say as a compiler author you must appreciate the sheer amount of magic that's going on behind the scenes to make that even feasible right i i can't imagine what's going on yeah yeah it looks very impressive yeah they must be doing a lot of stuff under the hood yeah definitely it's probably the well apart from visual studio the editor itself which is great you know as a package as an ide uh it's probably the one thing i miss from spending my time nowadays away from windows is is the ability to have edit and continue which you know as you say it's been around for like 20 odd years now I remember using on like the original Xbox as a kind of oh gosh we got to the place where the game crashed ah hit the break point oh is it this oh yeah we can fix it

Starting point is 00:07:57 and carry on and see if we get past it all those kinds of things which make a huge huge deal when you've just been playing something for 15 minutes yeah very exciting yeah definitely okay and then the next thing we have is just a new version update for the speed log library and this is that they're adding support for formats uh compile time validation which i think we talked about a few weeks ago when uh when format was updated and we you know went over a blog post about how this compile time validation works so now speed log can take advantage of that as well uh which would be a really good change to uh you know get compile errors if you put in something wrong absolutely especially as those log lines that are least likely to be hit are probably the ones that are erroring in like panicked situations that you're not expecting and if they happen to

Starting point is 00:08:44 have format string problems with them that's the last time that you want to find out is when they're actually executing it's like oh we crashed why did you crash oh because there was a bad situation and while logging it we threw an exception right right okay and then uh matt i i'm gonna ask you to go over this last article because uh you suggested this paper and it is very large i read through the abstract but i could not read through it's very much like it's my favorite thing to happen this year and it's certainly given the confluence of being invited to to co-host this and knowing that dennis was going to be here it seemed too rude not to bring it up as something that just came out it's a paper that uh describes like the new state of the art of simulating the throughput of Intel x86 processors,

Starting point is 00:09:29 including all of the really unusual internals that have to be modeled in order to get decent performance figures out. And there have been an awful lot of papers and work on essentially reverse engineering what's going on inside the chip that have been happening over the last few years. Agnafog and some of the other folks who are listed in this paper have done some things before but this is like within like a cycle or two of real world measured performance for all the different various flavors of chip so it's it we're close to like a grand unified theory of what's really going on inside the silicon that dennis won't be able to tell us about for various reasons i'm sure but it's super exciting to me as somebody

Starting point is 00:10:05 who works in performance in my day job and it's just super cool to know and to understand how complicated these cpus are inside and how many clever tricks they're pulling on our behalf to make our code go fast so i i was just even just reading the abstract and having an idea of it is is cool i think yeah this is this is actually a great paper, yeah. So I just, you know, went through it, like, really quickly. Like, I spent maybe, like, 15 minutes or so. But yeah, so it seems to be much, much more precise than, you know, than everything that we had before, like, tools like Ayaka and, like, LLVM MCA.

Starting point is 00:10:40 And then, so, yeah, and then the problem is that, you know, first of all, it's really hard to statically predict performance, right? Well, I mean, I don't know, but it's maybe even impossible to predict it super correctly. I mean, well, okay, to some extent, that's really hard to do. So, and those tools that existed before this new tool that they created in their paper. So they largely discarded the effects of various non-deterministic CPU units. Like, for example, any front-end effects essentially were discarded by those tools. So what do I mean by that? Is that, for example, you know, I mean, the alignment of the code was not in the model.

Starting point is 00:11:30 Also, the tool, you know, those tools always assume that everything is in the cache. So you don't ever hit the cache misses. And all the branches are predicted, like, pretty much 100%. Well, I mean, that's obviously not always the case, right? So, but okay, so going back to the paper, right? So I think their work is based on the work that they did previously on this uops.info, right? So they somehow leveraged these giant tables, which actually seem to be pretty much, again, accurate in terms of latency and throughput of individual assembly instructions.

Starting point is 00:12:13 They've got it down to kind of a fine art of how to measure one instruction and sort of see all the effects that it has and deal with the dependency chains and all these clever things that they talk about. There's a micro-bench, a nano-bench library that they talk about. But yeah, it's very exciting. And for me,'s as you say it replaces things like iaca which hasn't really been supported for four or five years and as you say llvm mca is brilliant um but it's driven by llvm's interpretation of what the cpu does which is not necessarily very accurate and having something more accurate can only make that better and given that llvm mca is used by clang to do the scheduling of the emitted code it kind of will actually work

Starting point is 00:12:49 its way backwards into a compiler eventually and hopefully we'll get faster code which is you know nice everyone loves faster code for free right yeah all right well dennis uh i mentioned while reading your bio that you have a book a blog and an online video course where you focus on performance and i just wanted to start off by asking how you got to have this focus on performance in c++ uh well let's see um well i mean i don't know um i was i was always you know interested in performance making things faster um i think the first, like, so I remember, you know, those funny days when I just, you know,

Starting point is 00:13:30 like, first tried Intel VTune. I think at that time, it only, you know, it required that you instrument your code. It was, like, back in maybe 2009. So you first, you know, had to wait maybe half an hour or an hour, well, I mean, depending on your application. But while first you know had to wait maybe a half an hour an hour well I mean depending on

Starting point is 00:13:45 your application but uh well you know when when the vtune will instrument your code first and then it will actually run and and you know show you the profile so those were the funny days and you know I must say that at the time I had no idea what what it is doing how it's working and so on and so forth but I mean you know uh when it showed you the list of hot functions and, you know, hot places, well, I mean, it was super cool and interesting. So I don't know. I mean, and then fast forward, I don't know how many years, maybe six or seven years, I joined Intel.

Starting point is 00:14:22 Well, I mean, so this was, you can say it was a natural step for me. I don't know. So that's how I, you know, really started working on performance on a daily basis. Well, I mean, more or less. Yeah. So that's roughly the story.

Starting point is 00:14:39 And then if you're like wondering about the book and how I get to that, this also actually came to me, let's say, organically. Okay. Well, I mean, so I started my blog roughly at the time when I joined Intel. This was around 2016, 2017 maybe. And so I began documenting things that i learned about performance about the the work that i'm doing at intel like you know this whole this profiling thing and uh

Starting point is 00:15:11 optimizing code and stuff like that um and then and then uh well i mean i collect some some knowledge let's say on my blog and then uh a couple of uh folks reached out to me like hey you have this you know good information on your blog but it's you know scattered in a way it's not you know systemized like like maybe you should write a book well and and so this is actually how my book uh you know uh uh let's say uh was was born i would say okay so so So, yeah, and then the online course, I think the same thing. My book simply lacked the practice,

Starting point is 00:15:53 the practice piece of it. So, well, I mean, and then I decided to go beyond that and maybe I can create something, some place where folks can practice optimizing and tuning their code. So this is how this Performance Ninja got to life. Could you tell us a little bit more about that? I mean, do you find that people need to practice it or is that what made you move to make a site to be able to practice to practice these things or is it you know i mean i

Starting point is 00:16:26 know from my own experience that like until you run that profiler your intuition is 100 wrong every time even when you've accounted for the fact that you know that your intuition is wrong you're like i know where it's going to be and you run the profile it's like no it's in struck copy what i didn't even call struck you know those kinds of things so um is that the kind of thing that led you to go like, no, people actually have to put this into practice themselves and maybe tell us what Performance Ninja is. Yeah, sort of, sort of, yeah.

Starting point is 00:16:54 Yeah, I mean, so I actually think that the performance optimization and tuning, you know, is, well, and I think it may become even more important than it was like during the past uh three decades three four decades maybe and i think there there there's a there are a couple of reasons to that so uh let's see so first of all i think that you know uh and it it is driven sort of you know by the uh by the current technology uh like like you know i and it is driven sort of, you know, by the current technology. Like, you know, I mean, if you would ask me whether the Moore's law is dead, I mean, well, I mean, it's not the easy answer.

Starting point is 00:17:37 But I think, you know, if you look from the software perspective on this question, from the software vendors perspective, then i think the answer is yes and what do i mean by that is that um now software vendors should spend more resources on optimizing their code because hardware does not provide you know major performance boosts anymore unfortunately i mean those transistors are going into things that aren't making things faster they're just giving us more capabilities or they're making caches bigger or things like that that aren't necessarily directly related to performance in the same way as they used to, say, 20 years ago when you could guarantee, like, I'm going to get one more gigahertz in four years' time. Right. Right. Yeah.

Starting point is 00:18:17 So I think, you know, like especially, you know, I'm not talking about the performance improvements are completely gone, but if you look at the single-threaded performance, and this is still very much important. Not only multi-threaded performance, but the single-threaded performance is also very much important for many client applications that run on your laptop. So that's the first thing. And then the second thing is that previously in the so-called PC era, where the companies that sell software,

Starting point is 00:19:00 so this software will usually run on the client's machines. So software vendors did not pay for the electricity bills uh for their customers right right right there so but now the situation is changing so now now everything is moved it moves to the cloud and then so software vendors now pay their bill their so-called cloud bills themselves. Right. So suddenly it's their problem now. And so suddenly they were motivated to fix it. Whereas before they were sort of distributing it by, well, I have a million customers.

Starting point is 00:19:33 I've got a million CPUs. Now they're like, no, I'm paying for the CPUs. Right. Right. And I think that those are two fundamental reasons why we'll see more people start focusing more on performance. And I think that those people that program in Python and other languages, they may be affected even more than us.

Starting point is 00:19:56 But I mean, well, we'll see. You talked about tuning a bit, and I just wanted to go into the tooling maybe a little bit more that you focus on when you're talking about performance tuning and teaching performance tuning in the Code Ninja class. Can you tell us more about your experience with tooling and what you like to use, what you get out of it? What I like to use? Let's see.

Starting point is 00:20:23 Well, I mean, you know, Intel VTune is my sort of go-to tool. But I actually think that more frequently I prefer just running Linux Perf because it is

Starting point is 00:20:40 installed on every Linux distribution that I'm eventually touching. I'm, you know, that I'm eventually touching. So, so, you know, I mean, for me, it's just simpler and faster to run Linux perf than, you know, than to, you know, unleash the whole power of Intel VTL, let's say, if you will. So, so, so, yeah, I mean, but for the most part, actually, Linux Perf is also a great tool. It supports most of the things that Intel or VTune can, but just, you know, there is lack of graphics interface, right? And that's sometimes that's actually hard to replace even. in Vityam, you have this timeline where you can zoom in and filter in

Starting point is 00:21:29 and in the particular interval of when the program was executing. So that's a really cool feature that you cannot unfortunately do with Linux Perf. But for the most, Linux Perf is a great tool as well. I was going to ask, actually, what kind of things do you do? So maybe just for those who aren't as familiar with what Perf is, could you explain the kind of things that both VTune and Perf are doing behind the scenes and what kind of information you can glean from them?

Starting point is 00:21:59 Sure. Yeah, so Linux Perf is actually a basic performance profiler, let's say. So what it does is actually just really simple. And the best way to explain it is everyone is familiar with the debugger, right? So debugger is essentially the simplest performance profiler, if you will. So you can view it like that. If you will run your program under debugger and you will interrupt it once per, let's say, per 10 seconds and you will record the place where you stopped and if you will repeat this process like

Starting point is 00:22:40 over, let's say, a thousand times or ten thousand times, then you will collect, you know, ten thousand samples and if you build a histogram on that, then you will be able to see where the program was interrupted the most number of times. So this will be your hottest place. And this is an oversimplified explanation of what profilers do under the hood. So they essentially are capable of interrupting your program a high number of times during a small period of time. And that's how they tell you where the hottest places in your program are um yeah so that's um that's i think the the the simplest explanation right right i know i know from personal experience

Starting point is 00:23:33 that it's not just necessarily time that you can interrupt and say like you know every 100 seconds you can say every 10 000th cache miss tell me what the heck was causing the cache miss and again you can sample and group those and so that's the kind of useful information you can glean that is very hard to do in like a in any other way right i can't tell you where is the where am i missing the cache the most and you're like oh yeah i'm walking a linked list here well of course right okay maybe i shouldn't do that that kind of thing right yep exactly yeah so then vtune on top of that is a more graphical way of getting the access to the similar kind of information that's correct yes so yeah so so so matt as you described like right so so this is

Starting point is 00:24:12 actually more a little bit more advanced uh usage of you know of the profiler if you if you really want to know where like where exactly which line of code which assembly instruction misses in cache a lot or for example which branch mispredicts, right? So that's where you sample not on cycles but on some other event like cache miss for example. I mean yeah so this is a more advanced usage and actually I wouldn't say I suggest this workflow if you will. So let me first start why. And I think that it requires you to have a knowledge about all the different performance counters that are available. But there are hundreds of them.

Starting point is 00:25:01 And they're changing in every CPU generation. So you cannot, let's say, be... Well, I mean, of course, you can study documentation and learn what they're doing, what they're measuring. But I wouldn't recommend that. There is a better, I think, approach here. And there is a methodology called top-down analysis. This actually allows you to characterize your workload from the CPU perspective. It's like where the bottleneck in your application is, like from the CPU microarchitecture perspective.

Starting point is 00:25:47 For example, there are like four major categories. It's retiring, front-end bound, back-end bound, and best speculation. Okay, we will not dig into that, but the main point here is that it actually abstracts away the knowledge about performance counters. And it collects all the needed counters automatically. You don't have to care about any specific meaning of them. You just run the tool. It will show you where exactly your bottleneck is. Like, for example, you miss in, let's say, L2 cache,

Starting point is 00:26:26 or you go all the way down to the main memory. And after you figure out what is the type of bottleneck in your application, only then you locate the exact line of code where this problem happens. So this is a really powerful technique i use myself with you know i use i i use this technique myself you know pretty much you know uh it's the first thing that i do actually right um so it's like gives you an overview before you go down the rabbit hole of like which of the 150 000 different counters should i be measuring it's like no well the first thing you need to know is that

Starting point is 00:27:05 the problems with cash misses or the problems with bad speculation. And then you can dig and dig and dig down. And it gives you a very high level, high level, relatively speaking, right? A high level view of where your performance or the kind of performance problems you've got. Right. Yeah. Yeah. Okay. The sponsor of this episode of CppCast is the PVS Studio

Starting point is 00:27:30 team. They developed the PVS Studio Static Code Analyzer. The tool helps find typos, mistakes, and potential vulnerabilities in code. The earlier you find an error, the cheaper it is to fix, and the more reliable your product releases are. PVS Studio is always evolving in two directions.

Starting point is 00:27:47 The first is new diagnostics. The second is basic mechanics, for example, data flow analysis. The PVS Studio Learns What Sterlen Is All About article describes one of these enhancements. You can find the link in the podcast description. Such articles allow you to take a peek into the world of static code analysis. You were talking earlier about what you think is going to be driving performance in the near future. Where do you see us going with

Starting point is 00:28:17 focusing on performance? What can C++ developers do to better handle the way programming is going to change as we get new hardware, new software, new compilers? That's a complicated one. So what can drive further improvements, right? And I think that, well, I mean, I would like actually to approach it from this perspective. So there are opportunities and there are challenges. So let's, I don't know, maybe let's start with opportunities first.

Starting point is 00:28:52 For example, what can drive, right? So, I mean, what we as software developers can do. So I think, you know, I think we can. So first of all, better compilers exist, I mean, in the future. And so I actually gave a talk on LVM performance workshop on February this year. And I tried to answer the question, what will drive the future improvements in the compiler? And so I was caught by this question. And so what I actually did, I asked a whole bunch of compiler experts what their view on this question is, and I just summarized this whole wisdom.

Starting point is 00:29:42 And so there is this Propstein law which says that advances in compiler technology make your software run faster. I mean, double the speed of your code

Starting point is 00:30:00 every 18 years. So it means that the performance improvements of the code that is generated by the compiler double every 18 years. Of course it doesn't hold for some reason. So what it means is that if you take the Clang compiler now, and, for example, the Clang compiler nine years ago, and you can expect the 50% faster code, right?

Starting point is 00:30:34 Just from the compiler, only from the compiler. If you run it on the same hardware, on the same operating system, and so on and so forth, on the same platform, but just a newer compiler. So some people actually think that it's now not the 18 years, but the 50 years, something closer to 50 years. So, you know, I mean, that's a really pessimistic view,

Starting point is 00:30:54 I would say. And then, so going back to how we can, you know, make compilers faster, well, you know, there are actually multiple directions. But what I was able to... I mean, the major directions are, first, machine learning. I mean, compilers are full of heuristics and cost models, so we can replace those with machine learning models. So that's first. Then the second is that there is a lot of work going on with synthesizers and super optimizers.

Starting point is 00:31:35 So, like, you know, compilers sometimes miss some, you know, peephole optimizations and, you know, not being able to generate the optimal uh assembly sequence like for example if if we're talking about the the simd instructions it's uh it's sometimes hard to find the best the best pattern of of assembly instructions with all those you know crazy shuffles and uh and you know um and and so on and so forth so uh and that's the work that John Reger and Jeff Langdale and some other people are doing. So that's, so, and oh, yeah, so what I actually wanted to say is that the conclusion around, you know, the, and actually, so one of the questions that I also was asking all those experts is that, like, what is the, you know, what is our current headroom

Starting point is 00:32:27 in existing LVM optimization passes? And I mean, I think you will be surprised, but most of the people think that, you know, we are chasing the tail now. Oh, really? Yeah, and the headroom for us is probably around 1%. And so just to understand that, you're saying like with the existing infrastructure, with the existing

Starting point is 00:32:46 set of things that we know, there's only a few percentage points left without something else changing to make it go faster. Am I understanding you right? I think that's yeah. So what do I mean by that is that if you will now go and try to polish all the

Starting point is 00:33:02 existing LLVM optimization passes, you you are not expected to you know to to gain more than i would say one percent on on average right on average wow so i mean that's again uh unfortunate and you know pessimistic view um but but yeah um so what do you think needs to change where are we going to find the next explosion of performance? Does the language need to change? Do the compiler infrastructure, is there something we're missing somewhere in the line

Starting point is 00:33:32 between programmer writing code and optimal code being produced? Is there something that we haven't thought about? Where is it going to come from? Right. Well, I mean, and then, so I actually saw and some of the folks also think that we syntax tree and so on and so forth, so we can build better tools that will help developers to improve the performance of their code more effectively and efficiently. When you say tools, you mean tools like Vintune and Linux Perf?

Starting point is 00:34:22 Right, yeah. you mean tools like vintune and linux pair yeah yeah yeah yeah so and then actually and then one of the one of the one of the tools that i'm personally dreaming about is is something like that so imagine you are writing let's say for i mean for example imagine that you are writing a matrix multiplication right so i mean it's not it's not it's not super super i mean it that's a tricky one but at the same time it's not it's not super easy to mean, that's a tricky one. But at the same time, it's not super easy to write super effective code, super optimal code for every hardware platforms. But okay, so what I would like to see is that, for example, there is some recommendation system that will detect that I'm writing

Starting point is 00:35:01 the matrix multiplication code and will say, hey, I think you're writing matrix multiply. You're saying that Clippy will pop up and tell me it looks like you're writing a matrix multiply if you considered using the Intel library to do this instead. For example, yeah. Or, for example, it will, okay, so let's say that there is no library function,

Starting point is 00:35:24 library implementation for the thing that I'm writing. So this tool will tell me, hey, you're writing this algorithm. I have the version that is optimized for your hardware with exactly the same semantic meaning. And here is the diff. Do you want to apply it? I mean, this will be great but i mean it requires you know a database let's see of of golden golden code written by performance ninjas uh that you can effectively just you know just uh dropping replacing in in your code right so this sounds like sort of like github copilot i'm just gonna

Starting point is 00:36:03 say what were your thoughts on GitHub Copilot? I mean, I know that's not necessarily targeted at improving performance, more just improving programmer productivity, but it's kind of on the same lines as what you're talking about. Right, yeah. So this, I think, will be great.

Starting point is 00:36:18 And in order to solve that, we need to solve the code similarity problem, right? Like, I mean how how we can how we can uh how we can detect whether the two uh the two uh pieces of code are have the same semantic meaning right and or do they differ in ways that are intentional or accident you know accidentals like hey yeah you don't you start at one and go to n plus one versus naught to n you know but but they're near enough the same but or is that important for some other reason right that's a tough problem to crack

Starting point is 00:36:49 and also not annoying the programmer i know from my own personal experience if if a clippy like thing popped up every time i would probably i don't know what i'd do but it wouldn't be very pleasant to my computer but it's interesting that you think the next wave of improvements will come from tooling which will help programmers write things more optimally. Is that a fair characterization of what you said? I think yes. I think yes. Because some of the transformations that can improve performance of the code are just so hard to do from from the uh from the compiler perspective

Starting point is 00:37:29 like for example a os to soa transformation like arrays of structures to structures of arrays that's uh that's something i think that's that's not easy to do on the compiler level like you need to to have the whole view of the program uh you know you need to to to know whether you know someone uh someone you know has some references in in between of of this data structure so you need to do really complicated analysis and you know um to make that happen so this is something that much much easier can be done on the on the software level by the developer himself or herself and so if we if we would give the better tooling and we will be able to detect um that such transformation can be

Starting point is 00:38:12 made or should be made then i think uh this will be much more effective you know uh spending of of of our time and resources as a as a compiler developers as a as a tool tooling developers that that makes a lot of sense to me actually now that you've put it in those terms i mean like as a compiler developers, as a tooling developers. That makes a lot of sense to me, actually, now that you've put it in those terms. I mean, the thing that I've always wondered is would it be possible for compilers to reorder private members of class structures to move things that are pack better or more cache effective

Starting point is 00:38:37 or whatever like that. But obviously, there's definitely a point where you can't do that without there being changes to the behavior of the program which are visible, but not important but the compiler can't make that distinction the programmer can and so those kinds of things are really yeah that's fascinating now that's that's it sounds like a sort of a different way um you know like you not really part of as you say it's a tool it's separate to the compiler you kind of fire up your program in it and sort of maybe get some pgo style samples in and say okay recommendation system tell me what

Starting point is 00:39:05 the heck i should do here and be like yeah have you considered moving these things closer together or that kind of thing right that makes that's exciting what an exciting world we live in but i mean i guess short of having like tooling that can do that i guess the only thing we can do is have better education right for the developers better Better education. Yeah, yeah. Well, I mean, that's the mission that I'm on, but yeah, at the same time, I don't know. I mean,

Starting point is 00:39:34 I'm not sure if we would want to be in a world full of performance ninjas. I mean, we don't want to make this as a requirement, right? So that's why I think we need to have better tools and better compilers maybe.

Starting point is 00:39:53 Because, yeah, I mean, performance is hard. Performance, yeah. So it's a huge topic. I would say that I think it's easier and we'll be more forward-looking if we will focus on better tools and developer productivity and allow them to have more insights in the performance of their code. So something that I always think about when I think about performance and allow them to have more insights in the performance of their code. So something that I always think about when I think about performance is that sometimes performance is genuinely a feature. Absolutely it's important that this is critical as fast as possible.

Starting point is 00:40:34 I mean, I spent a decade working in the trading industry where that kind of thing is kind of why they're interested in me. But there's always an antagonistic... Oh, sorry. I believe there to be an antagonism between performance and sometimes readable testable lovely beautiful code right and i've spent like the last five or six years trying to argue that the compiler mostly lets you do the left-hand thing of writing nice readable code but it still generates really optimal code on the right hand side so you know you get the best of all worlds but the kind of things that you're talking

Starting point is 00:41:09 about here is that maybe i'm going to have to change my tune in the next few years because we're going to reach the point where the compiler can't help you anymore right we've done everything we possibly can we've inlined everything we could inline we've pulled everything out the last thing you're going to have to do now is take your lovely class that looks lovely and is model modeling a real world object and then split it into three little classes that are in three different arrays somewhere which makes it a very different proposition to what extent do you think um that's true do you i mean is that the case do you think or or am i missing something here is it um well no no i think i think

Starting point is 00:41:42 uh you know where you can have the best of those two worlds. And I think that you only need to go for ugly, let's say, performance optimization tricks in those places where it's only needed. So I think there is no contradicts here. So yeah, I mean, you, you just first, you know, profile the code and you see where, where the, where are your shortcomings and then you fix those. And then, I mean, you know, sometimes even, you know,

Starting point is 00:42:14 even, even the simplest and cleanest code might be, you know, not the best code. Like for example, if you take a look at the, at the std max and min functions, right? I mean sometimes, you know, you know, I mean in and the compiler will generate will will likely generate the branches for you like if it will say like if if a is less than B then then then I choose B Otherwise, I will choose a so so compiler will will you know will usually generate branches

Starting point is 00:42:44 But what if what if you know this branch mispredicts? Then you better go for a C move instead of the branch. Well, I mean, okay, anyway, so it's too much detail here. I think these are important distinctions, right? These are the kind of things that if you're interested in these things, but there are trade-offs there that I think compilers don't necessarily know how to uh to to to do even experts you know i've seen chandler cruz did a live presentation at cpp con one time and he was like i don't know why this is happening and it's like it's generating

Starting point is 00:43:18 a seam of and i keep going don't generate the seam of generate the branch for god's sake and he's like fiddling around with the code and eventually eventually profiles it and goes, no, the CMov was actually sorry, the branch was the right thing to do in this case because it was absolutely predictable, or whatever it was, I don't remember the specifics, but these things are hard, and they're maybe data-dependent, they may be, so it's, yeah, so you were

Starting point is 00:43:37 saying that StudMax might be generating branches or not, and it might not be something that you want to do in all cases and i'm sorry i interrupted you so please carry on yeah yeah yeah no i mean that's exactly what i meant yeah yeah so but but again so as i said right so i i think that you only need to like uh to uh you know uh to to make to to sacrifice the the readability of your code only those places where it really

Starting point is 00:44:07 gives you any benefit. Right, you need to have the four-line comment above to justify the one small weird little thing that you did that wasn't obvious because that's where you're paying for it. You're like, I have to explain the apology comment that says, no, we really do need to do this funny thing here.

Starting point is 00:44:24 Funny thing because that's the code, doesn't it? And then for me, it's like, and then write a benchmarking test if you can to try and make sure that nobody accidentally undoes that or a compiler revision doesn't come along and make it so that that is no longer true or whatever it is you're relying on. You know, try and put a test around it. Sure. Yeah, this will be great. Yeah. So your idea about, you know, tooling that could suggest performance improvement techniques to developers writing code, do you know if anyone's working on that sort of thing?

Starting point is 00:44:52 I think yes. And I think actually it's an area of research called machine programming. Okay. Yeah, and I actually wrote a blog post i think about it maybe last year it's on easyperf.net on my blog yeah so so that's uh uh that's uh that's a let's say high level vision that some someday maybe uh the machines will be able to program themselves um but but hopefully after i've retired yeah don't put us all out of the job please yeah well i was just interested you know that there's we've talked

Starting point is 00:45:32 about only in passing some of the things that are in your like online course what is there something you can tell us a little bit is there a teaser you can give us is it a free course or how does the course work for for performance training i would love to. Yeah, sure. So this is a free online course. It is also self-paced, meaning that, well, okay, you can come, you can work on it whenever you have time. So the idea is this, that we actually build a set of lab assignments. So at the moment, we only have seven, I guess. So those lab assignments,

Starting point is 00:46:07 they focus on a specific performance problem. So those are small, minimalistic code samples that experience some performance problem than you require to go and fix it. Right. So like mini puzzles almost that you have to solve using the tools and the techniques that you're teaching people in the course.

Starting point is 00:46:29 Yeah, sort of. So at the moment we have lab assignments on vectorization, on function inlining, on loop interchange, on data packing and I think compiler intrinsics.

Starting point is 00:46:46 And maybe I missed some. Yeah, but the point is here, is this. Yeah, so this is an online course, right? So I recorded some videos, right? So the idea is that you first go and watch the introductory video where I sort of, you know, give an introductory into the specific performance problem. Then you go and try to fix the code yourself. It can take from half an hour to like

Starting point is 00:47:11 four hours on the lab assignment. Well, I mean, depending on your background and the level of complexity of the lab assignment itself. And then after you fix, you can actually submit it to the GitHub. So the course is on GitHub, it's for free, and there is an automated verification and benchmarking attached to it. Sweet. Right, yeah, so when you submit your code to GitHub, it will be automatically picked up by GitHub, and it will benchmark. I want to point out that, you know,

Starting point is 00:47:48 this is actually a good performance benchmarking. I mean, it's not benchmarking your program in the virtualized whatever environment on ARM or any other, you know, low-end CPU. So I actually offloaded all the benchmarking into my own, like, Linux box here at my home. You're very brave. I'm very brave, yeah. I created an image so that if anyone hacked me, I would just reformat everything. And in half an hour, I will have reformat everything,

Starting point is 00:48:25 and in half an hour I will have a cleanup system running. But I am prepared for that. But the key here is that because CI systems generally, be they GitHubs or anything you can find out there, generally they're multi-tenanted, they're virtualized. It's very hard to do performance analysis work when you're on a noisy machine full stop even if it's your own dedicated if it's a desktop under your desk it's hard to make sure

Starting point is 00:48:49 that you know you're not actually monitoring slack or or whatever other thing that's like causing your cache to go wrong so you've got a system where you you take the user's code and you run it in a very very controlled environment away from virtualization and then you can give them pretty good results about whether they've made it better or worse or whatever that that's very cool that's very cool yeah yeah and then and then so and then in the end uh there is a summary video where i explain how it how it should be how it should be done how it can be fixed and what how you can measure uh and and so on and so forth yeah cool yeah very cool sounds great we have to check it out yeah okay well uh dennis it's been great having you on the show today obviously you know we we talked about the book the blog and and the

Starting point is 00:49:30 the video course uh and we'll put links to all those in the show notes is there anything else you want to tell our listeners about before we let you go um well i think no we we we covered a lot of fun topics yeah it was uh to be here. Thanks for inviting me. Thank you so much for coming on today. Sure. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in,

Starting point is 00:49:56 or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast. And of course, you can find all that info and the show notes on the podcast website at cppcast.com. Theme music for this episode was provided by podcastthemes.com.

Your Ad Here

CppCast - Performance Tuning

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.