CppCast - Performance Analysis and Optimization

Starting point is 00:00:00 Episode 178 of CppCast with guest Dennis Bakvalov, recorded December 5th, 2018. This episode of CppCast is sponsored by Backtrace, the turnkey debugging platform that helps you spend less time debugging and more time building. Get to the root cause quickly with detailed information at your fingertips. Start your free trial at backtrace.io slash cppcast. And by JetBrains, maker of intelligent development tools to simplify your challenging tasks and automate the routine ones. JetBrains is offering a 25% discount for an individual license on the C++ tool of your choice. C-Lion, ReSharper, C++, or AppCode. Use the coupon code JETBRAINS for CppCast during checkout at JetBrains.com.

Starting point is 00:01:02 In this episode, we discuss a new feature in Visual Studio 2019. Then we talk to Dennis Bakvalov from Intel. Dennis talks to us about performance analysis and optimization. Welcome to episode 178 of CBPCast, the first podcast for C++ developers by C++ developers. I'm your host, Rob Bervin, joined by my co-host, Jason Turner. Jason, how are you doing today? I'm doing all right. Rob, how are you doing? I'm doing okay. No real news on my side. How about you? No, I guess. Well, next year is starting to get busy, by the way, for training activities for me. I've gotten a lot of contacts at the end of the year, but that's about it.

Starting point is 00:02:05 It's always exciting. Yeah, we're approaching the end of the year but um that's about it it's always exciting yeah we're we're approaching the end of the year very very quickly november real fast the next thing i have coming up is the first week of february for uh c++ on c where i'll be teaching that one day constexpr class and so i have all of january to uh make sure everything's fully prepared and ready to go that's good. Very good. Okay, well at the top of your episode I'd like to read a piece of feedback. This week we got a tweet from Barney Deller and

Starting point is 00:02:34 he was replying to our episode with Lenny last week saying, it's nice to hear that I'm not the only one mob programming in C++. Thanks CppCast and Lenny. And yeah, mob programming sounded really interesting. I've heard of pair programming, but never mob programming. Yeah.

Starting point is 00:02:52 Well, I've only heard of it from Lenny. Yeah. Well, we'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at cppcast.com. And don't forget to leave us a review on iTunes. Joining us today is Dennis Bakvalov. Dennis is a C++ developer with almost 10 years of experience.

Starting point is 00:03:12 He started his journey as a developer of desktop applications, then moved to embedded, and now he works at Intel doing C++ compiler development. He enjoys writing the fastest possible code and staring at the assembly. Dennis is a father of two. He likes to play soccer and chess.

Starting point is 00:03:27 Denis, welcome to the show. Hi, Rob. Hi, Jason. Nice to meet you. I'm excited to be here. You know, I should say probably that I'm your regular listener from the episode number one. Oh, wow. Yeah, so I'm really happy to be here as a guest, not just a listener.

Starting point is 00:03:44 Yeah. So you were actually listening from the beginning, or did you go and catch up at some point? No, I was actually started exactly at probably, okay, maybe I'm lying a little bit. It was not the episode number one. It was probably the episode number three. It was the episode with Manu Sanchez.

Starting point is 00:04:02 He was talking about the PyCode. So it was kind of really in the beginning. I'm kind of with you guys. Yeah, I think that was episode three, right? That's awesome. Thanks for being with us for so long. Yeah. Okay, well, Dennis, we got a couple news articles to discuss. Feel free to comment

Starting point is 00:04:20 on any of these, and then we'll start talking about more of the work you're doing at Intel with performance analysis, okay? Sure. Okay, so this first one is a Meeting C++ and Meeting Embedded trip report from the Conan team. And I think this is the first year with Meeting Embedded, right? Yes. Yeah, so they went to both conferences and they said the Meeting Embedded conference was,

Starting point is 00:04:45 was quite good. Uh, they presented a talk there and, uh, yeah, I'm, I'm happy to see that they're, uh, going forward with this, uh, new conference. I hope it does well for them. Yeah. I still want to listen to that talk. Uh, we stopped teaching C that's still on my list to do, but I haven't, have you looked to see how well the videos are going up? I haven't. I meant to because I was interested in watching Andre's keynote from meeting C++, which is also covered in this trip report. Right. Yeah.

Starting point is 00:05:18 So the what is the next big paradigm was Andre's keynote. And I'm a little curious as to what it was really about, because in this trip report, he says he first explored how the next big things for people's programming in the general world were perceived at the very beginning. Threads, online voting, NLP, privacy and ranges. I'm not sure what they mean by online voting. Like, do they mean political voting? I don't have any idea. Yeah, so I was curious about that. I need to go and watch his talk. Yeah.

Starting point is 00:05:52 Dan, did you go to Meaning C++ or Meaning Embedded? Unfortunately, not yet. Yeah, but actually, I also saw that there were a number of talks about the performance and about the data-driven development. So this kind of, I think, becomes the trend. Well, I'm not sure, but that's my just gut feeling. I might be wrong, but probably everyone sees how it wants it to be. Right.

Starting point is 00:06:28 Well, I just checked their youtube page and i don't see any uh conference videos from meeting people's 2018 yet i agree i i did not find any we're gonna say something else jason no that's what i was gonna say okay okay was were there any other uh highlights either of you wanted to make about the Meaning C++ trip report? No, I don't think anything else jumped out at me on that one. Although it does look like there's a lot of interesting things that went down here. Yeah, I'm looking forward to some of these talks, maybe online. Yeah. Okay, the next one is introducing the C++ Lambda runtime.

Starting point is 00:07:06 And this is from the AWS blog. I have never heard of AWS Lambda before. Have you, Jason? I have. I have friends who use AWS for all the things all the time because of work. Right. So I guess AWS Lambda just allows you to run like ruby or c sharp or other languages and just the code is kind of run without having to worry too much about the server configuration

Starting point is 00:07:34 is my understanding of it yeah it tends to be like this like simple like snippet like you have this thing that you want to do and just do it and then give me the uh the artifact results back or whatever from it right so now you can do that with c++ uh they have an sdk and they have a pretty thorough example here both doing a hello world and then doing a more complex uh ados lambda application with c++ yeah and the example they give here is basically like, I think, kicking off a transcoding of something. I think that's the final example. C++ encoder.

Starting point is 00:08:12 Yeah. Yeah. So, yeah, seems pretty exciting if you're interested in doing more in the server and web world. Mm-hmm. Okay. And then the next thing we have is uh the simd visualizer project and we were talking with jeff amstutz two weeks ago about simd so i thought this was a pretty interesting tool um you

Starting point is 00:08:34 can run it just by going to this page in your browser and you'd uh you know it basically has a bunch of code in in simd and you can kind of see what it does line by line in a nice visualization. Did either of you play with this on the website? Yeah, I did. Go ahead. Yeah, sure. So I played a little bit with it, so it looks really cool. Yeah.

Starting point is 00:09:03 I think it's just another way how we can leverage the Clang and its tools, because I assume it is made based on Clang. It's really cool. I still, myself, still am using the paper and the pen kind of to visualize how the vector code works here. So this tool is really handy for beginners, at least. Right. I was kind of hoping that the example would not rely on intrinsics,

Starting point is 00:09:44 but have something that showed what the compiler would generate. But this doesn't, it's almost like I want like a melding of like the output from compiler explorer piped into this. So I can see what the compiler actually did. It's actually in a class I was teaching last week. We were looking at some SIMD stuff that was being generated by the compiler,

Starting point is 00:10:08 and I was reasoning my way through about three-quarters of it and kind of wishing there was a way to easily visualize it. Right. Yeah, it'd be nice if we could kind of take the type of code Jeff was talking about with his SIMD wrappers and be able to visualize that. Right. Okay, And so...

Starting point is 00:10:28 Go ahead, Denis. Sorry. So probably, like, with assembly instructions, it's also not obvious, right? I mean, just even from their names, it's not obvious what they are doing sometimes. Oh, yeah. So just to have this kind of tool that will tell you, okay, now I'm doing addition with two vectors. Now I'm doing subtraction. So it's pretty handy.

Starting point is 00:10:54 Yeah. Yeah. Okay, and then the last article I have just was announced the other day that the Visual C++ Visual Studio 2019, which there is now a preview out for, is going to have the Live Share feature. And I think they first announced this feature. I'm not sure if they mentioned C++ when they first announced it.

Starting point is 00:11:17 But this is going to allow you to, from one Visual Studio instance, kind of send what you're working on and let someone collaborate with you who's, you know, maybe at some other remote location. And as long as they're using Visual Studio or if they're using Visual Studio Code, the two of you can then work together. You can see what you're debugging together,

Starting point is 00:11:40 see the call stack, see, you know, different variables. So it seems like it's a pretty powerful feature and should really help uh you know remote developers yeah then the example they say visual studio or visual studio code the thing that's not obvious to me would be if somehow it could be a mixed environment which i would assume not because that sounds like that'd be extremely difficult to get right. But no, actually, no, it does look that way, doesn't it? Yeah, yeah. One person using Visual Studio could share to someone using Visual Studio Code. That is my reading. That's what it says. Yes. All right. Well, that's kind of awesome. Yeah, yeah. And this is definitely something I see myself using because I work with several developers who are in other locations.

Starting point is 00:12:27 And, you know, we talked so much about pair programming and mob programming last week, but being able to do pair programming with someone who doesn't, you know, sit right next to you would be nice with this. Well, and it says there is a host and guests, so you could do mob programming with it as well. Okay. Okay. Multiple people watching you and then have like a Skype session or something to talk to them? I guess. Yeah.

Starting point is 00:12:51 Very cool. The guest even gets Intel incense from the host. Yeah, that could be pretty impressive. So, Dennis, why don't you start off by telling us a little bit more about the work you do at Intel? Yeah, sure. So my team and me, we're doing mostly the development of ICC, which is also known as Intel Compiler, but not limited to that. We're also contributing to LLVM open source project and GCC. So we basically make sure that we,

Starting point is 00:13:33 I mean, we, the compiler, generate the high-quality optimized code for x86 platform for Intel architecture specifically. We also enabling new instruction set architectures for new CPU generations. Like, for example, when the new CPU is going out, we need to have support for them in the compiler. So the compiler will generate new instructions for new types of hardware. So that's basically what we're doing. So when you say you contribute to LLVM and GCC,

Starting point is 00:14:14 that is the kind of things that you contribute as well as to their optimizer code generator? Yeah. Okay, that's cool. So I guess building on what you just said you said when a new processor architecture is going to come out you're going to make sure that it's supported well so if it's if i understand you correctly you're going to be making sure that gcc and lovm are ready when the cpu goes live yeah correct yeah okay very cool so um what type of work do you do with optimizations exactly?

Starting point is 00:14:47 Yeah. So basically we kind of try to find new optimizations in the compiler. We also try to kind of tune existing ones. So just for example to give you an idea, we have... So let me maybe first explain the subtle difference between what we are doing. And let me call it the standard development. So we have the fixed set of benchmarks, and we are not touching its source code. So we have them fixed

Starting point is 00:15:28 but each new day we will have the new compiler built from sources. Say it will be like the new Clang that was built from the latest revision and then we will take this compiler

Starting point is 00:15:44 and we will try to build all these benchmarks. And then we will run them. So, and well, it can happen. And it usually happens that new version of compiler will generate the different assembly, different code for the same sources. Yeah. So, and that can cause the performance regression or gain. So that's how we are tracking the compiler performance, yeah?

Starting point is 00:16:13 And, of course, if there is a regression, we should go and fix that. Well, if it's fixable, let's say. And then we are, of course, also trying to find new optimizations, like what we can do, for example, to improve this benchmark. Well, I should say that those benchmarks that we're working on, they are not contrived, in a sense.

Starting point is 00:16:40 They are real-world applications, but just limited, or let's say, cut to resemble a real benchmark, yeah? So, for example, I don't know, let's say, the Perl interpreter, or, for example, GCC compiler. So we use our new version of compiler to build GCC, and then this GCC will compile some sources, and we will benchmark it.

Starting point is 00:17:11 Oh, wow. Yeah, so that's kind of what we are doing. Speaking of optimizations, well, the most typical optimization... I can speak probably about my real work, because it's kind of proprietary, but just to give you maybe a taste. For example, we tune the inlining. For example, should we inline this particular function or not? Then we also try to, for example, should we inline this particular function or not? Then we also try to, for example,

Starting point is 00:17:49 to loop unrolling vectorization. Like, should we vectorize this loop? Should we unroll this loop? If we should unroll this loop, then how much and stuff. But also, like, for example, like,, the most part of the optimizations are

Starting point is 00:18:09 trying to deduct something from the source code. For example, if you have a virtual function call, but then you can see the whole program, and then in this whole program, there is only one instantiation of the base class. So there is essentially only one client that implements this interface. So you can be safe in just converting this virtual function call into direct call. So that's kind of the basic optimizations, what we're doing,

Starting point is 00:18:49 what we're tuning. So you said... Go ahead. Please go ahead. You said a new version of the compiler will often change things. And if I understood, you said you're doing nightly builds of compilers.

Starting point is 00:19:04 Do you see on a daily basis that your performance characteristics change from nightly builds of LLVM? Are you monitoring it that closely? Yeah, frequently. Okay.

Starting point is 00:19:17 Yeah, frequently. Wow, that's... I must say that, of course, there is always some noise there. So we kind of try to filter this noise. So, for example, if the benchmark regressed by 0.3%, we probably will not even look into this. But if the benchmark regressed, I don't know, 50%, then, I don't know, it's kind of a red flag for us.

Starting point is 00:19:49 Right. Well, if it also increased by, like, got better by 50%, is that also a red flag? Do you assume something got broken in some weird way? Well, it's not always broken, yeah? Because, for example,

Starting point is 00:20:03 let's imagine the benchmark which consists only of one hot loop. And, for example, yesterday the compiler was not able to vectorize it. But today, the compiler suddenly starts vectorizing this loop. And, wow, we have 2x performance. So, that's possible. I mean, it's not probably common for, let's say, it's not happening every day. And it's not happening for the, let's say, for mature benchmarks. Okay.

Starting point is 00:20:35 Because, let's say, most of the benchmarks have multiple hotspots, not just a single loop, right? Right. not just a single loop, right? So it's really rare that we can see on the nightly builds, we'll see such jumps in performance. Okay. How do you go about finding new optimizations? Yeah, so this is the most complex part of our work. Sure.

Starting point is 00:21:10 So, well, so basically we're doing performance analysis. So, what is performance analysis? Well, usually we start with profiling the benchmark. So, what is profiling? We try to find the with profiling the benchmark. So what is profiling?

Starting point is 00:21:25 We try to find the hot places in the benchmark. Then you kind of can just go... Probably, if you look into the profile, it will show you the hot source lines or maybe assembly instructions. If you will go into this assembly view, and then you probably can spot some inefficiencies there. Or you will come up with some idea how you can make it better. Well, so this of course also requires you to have some knowledge in reading assembly.

Starting point is 00:22:08 And I know that not a lot of people these days are doing this, I mean, looking and reading assembly. But still, this knowledge is really essential for doing good performance analysis. Yeah, so what you can try to do next is, for example, you can try and just hack the assembly. I mean, if you can do this, so doing quick experiments. Okay. So take the binary and say, well, what happens

Starting point is 00:22:52 if I replace this instruction? So you cannot modify the binary, right? Okay. It's not trivial to do this, but you can emit the assembly listing from your program and kind of go from there.

Starting point is 00:23:10 Or actually, so there are maybe also the higher level tools that you can use. For example, the compiler has something that is called optimization reports. And it's also actually integrated into the Compiler Explorer. So there is a separate kind of window that you can put on the side along with your source code and assembly.

Starting point is 00:23:36 And those optimization reports will show you what Compiler did for you with your loop, for example. Was the function inline or not? So, just even without looking into the assembly, you can

Starting point is 00:23:51 know what to expect when you will look into the assembly, right? So, for example, if you see that the compiler inlined your function, well, okay, you probably you will probably not find it in the binary,

Starting point is 00:24:08 right? Because it was inlined. Right. And the same goes for loop unrolling. It shows you the factor with which the loop was unrolled. It shows you the

Starting point is 00:24:23 vectorization report. So, was the loop was unrolled. It shows you the vectorization report. So was the loop vectorized? If yes, then what was the vectorization factor? And so on, yeah? So this kind of a higher level. For example, if we see that the loop was not vectorized, well, then we probably will think,

Starting point is 00:24:45 like, okay, is it possible at all to vectorize? If yes, then probably it's a matter of cost model, and we can probably tune it. Yeah. Okay. So when you make one of these determinations, you say, okay, this loop could probably be vectorized, you prove that you did,

Starting point is 00:25:05 and then you implement some changes in the compiler. How often do you end up causing a regression in some other bit of code? Oh, yeah, that's actually the question that I was expecting. Yeah. So, and it happens all the time, really. I, well, yeah, unfortunately it happens all the time. I actually have a number of great examples that you will like.

Starting point is 00:25:34 Yeah, so. Okay. Yeah. So, like, usually when you have, like, really the big suite of benchmarks, it's not really possible that you will optimize one benchmark without regressing the others. It's just, let's say, a reality. We should all agree with this. But actually, the reason for this, well, okay, and you might actually, might be doing really a good thing, yeah?

Starting point is 00:26:12 I mean, you can do a real good optimization without really harming everything else, but still have a regression on some benchmarks. So, like, for example, your optimization removes a couple of unnecessary assembly instructions, yeah? That's clearly... That sounds like a really good thing, yeah? Right. And there is no way how it can

Starting point is 00:26:40 affect, let's say, other benchmarks in a negative way, right? But for some reason you see that, oh, you have a regression on some other benchmarks. So, and the reason for this is actually quite interesting. And now let me maybe give you an example. So, imagine you have a benchmark with only one hot function. And just a simple function, let's say, it's take one array, it's just iterating over this array

Starting point is 00:27:14 and increments each element by one, for example. And life is good, you have some numbers for your benchmark, you are tracking it, and then one day, someone inserted another function that is completely called. No one is calling this function, but it happens to be just before

Starting point is 00:27:36 the function that you are measuring. Yeah? Okay. So it kind of just stays in the binary. It was not eliminated from the final binary by the compiler. It's just there. But what happens is just actually that your hot function was moved down in the binary. And now it has the different offset.

Starting point is 00:27:59 And just by doing this, I saw the cases where performance goes up or down by 25%. 25% is just huge. It's just enormous for us. Most of our optimizations are inside 2%, 2-3%.

Starting point is 00:28:19 If we implemented some optimization which gives, I don't know, 5%, it's like, I don't know, we can celebrate right now because it's just a huge money for Intel. Yeah. So, yeah. So, I actually wrote a blog post about this. So, yeah. With exactly this thing, this notion. Well, this problem

Starting point is 00:28:45 is called code placement or code alignment. It depends who you ask. Yeah. So, and actually if we look into this, the only thing that changed

Starting point is 00:29:01 is the layout of this function in memory, right? Nothing else was touched. And it's not only limited to the fact that was now my function. So is now my function occupy multiple cache lines or not? it's usually go ahead. I was just thinking through what you're saying.

Starting point is 00:29:40 If it was in one cache line but code pushed it and now it's across two cache lines or code pushed it because it was in one cache line, but code pushed it, and now it's across two cache lines, or code pushed it because it was across two cache lines, and now it's in one cache line. Yeah, so we should probably say that it's instruction cache line, right? Yeah, but it's not limited only to iCache, to instruction cache. There are also a number of structures in the CPU frontend which kind of

Starting point is 00:30:06 can be the bottleneck in this in those kind of cases Okay So what do you do about it? Once this has happened do you have any way of fixing this? It seems like it would make the binaries get very large if you tried to

Starting point is 00:30:24 always ensure that every function started on a cache alignment or something. Correct, yeah. So the first problem is that the binary size goes up. But the second problem is that if you will try to insert knobs into the binary, okay, you can probably insert them before the function. That probably will cost you nothing

Starting point is 00:30:49 because they will be not executed. But actually, the functional alignment is not the end of the story. So we can also align the loops. And if we will misalign or choose the bad alignment for our loops, we can also cause damage to, let's say,

Starting point is 00:31:12 to our performance. Right. And if we will try to align loops, we can insert the knobs that will be executed. That's probably still cheap, but not, let's say, cost-free.

Starting point is 00:31:27 Because we still need to fetch them and decode the knobs, right? And then we'll probably just throw them away. But still, we need to fetch and decode them. So, and actually... Go ahead. It seems like the knobs are also taking up space

Starting point is 00:31:43 in the iCache. Yeah, absolutely. Okay, just making sure. In the iCache, and also there is a dedicated uop cache. It's another structure in the front-end, which kind of caches the instructions after they were decoded. Okay, I didn't know that existed. It's kind of when we already fetched the instructions and now we try to fetch it again, we will not fetch it

Starting point is 00:32:17 because we already have it decoded in our UOP cache, so-called UOP cache, or decoded cache. Yeah. Okay, so I'm going to risk making this more complicated in the questions here that we're talking about, but you, so a function gets added, it changes the cache alignment, you do something to tweak things

Starting point is 00:32:39 so that you get back the performance loss, whatever, maybe you come up with a no-op that is worth the cost, and then, well, you're, I don't know, hypothetically, you're testing on an eighth-generation core processor. And then do you go and see what the impact was on adding that NOP to a fourth-generation processor? Or do you test regressions backward? Or are you always on the latest hardware?

Starting point is 00:33:04 Well, yeah, I mean, we do this. We also track the previous generations. But let's say to some extent, yeah. Because, well, again, there is always noise. We should somehow defend from this noise. So probably what we can do actually, and we are doing probably this, we're making the threshold for noise for the older platforms a little bit bigger than the newest generations. So we are looking closely to the CPUs from new generation.

Starting point is 00:33:47 We're looking a little bit loosely for the previous generations. It also, of course, depends on how big the regression or gain is. So if the gain is, let's say, 10%, well, it's big enough for us to start investigation. But I actually should say that, say, 10%, well, it's big enough for us to start investigation. But, well, I actually should say that, okay, I said 25% is probably somewhere on extreme. Usually, we see the jumps around, like, from 1% to 5%. It's probable that some benchmark will go up or down by that amount

Starting point is 00:34:29 just from the code alignment issues. Wow. Yeah. And now I'm curious, because I know that GCC and Clang at least have this option, and I have not spent a lot of time with ICC, but I believe it tries to at least be command line compatible with gcc if i have that correct where do these things come into play if i do like dash m tune and say i want to tune it to this specific cpu or something do you then like do

Starting point is 00:35:00 you do these kinds of tiny little details of tuning, taking into account these things between processor architectures? Yeah, so I'm actually... So speaking of code alignment, I'm not sure if there is a... Well, okay. Well, code alignment or other similar kind of things. Oh, yeah, sure, sure. Yeah, of course. Yeah, so actually, in general, I mean, it's a good idea to pass those flags. If you know that your application will run on a specific type of hardware,

Starting point is 00:35:31 let's say on Skylake's CPUs on sixth generation of Intel core architecture, then, of course, you should just go for it and pass all the flags needed, special flags for targeting specific generation of the CPU. Yeah, sure. Of course, yeah.

Starting point is 00:35:58 But, well, if you want to be a little bit conservative in this, you can probably go for a minimal version of the CPU that you know for sure this is the minimal. No one will try to

Starting point is 00:36:14 use the older CPUs for running your application. You can just choose the more appropriate flags. Okay. I want to interrupt this discussion choose the more appropriate flags. Mm-hmm. Okay. I want to interrupt this discussion for just a moment

Starting point is 00:36:29 to bring you a word from our sponsors. Backtrace is a debugging platform that improves software quality, reliability, and support by bringing deep introspection and automation throughout the software error lifecycle. Spend less time debugging and reduce your mean time to resolution by using the first and only

Starting point is 00:36:45 platform to combine symbolic debugging error aggregation and state analysis at the time of error backtrace jumps into action capturing detailed dumps of application and environmental state backtrace then performs automated analysis on process memory and executable code to classify errors and highlight important signals such as heap corruption, malware, and much more. This data is aggregated and archived in a centralized object store, providing your team a single system to investigate errors across your environments. Join industry leaders like Fastly, Message Systems, and AppNexus that use Backtrace to modernize their debugging infrastructure. It's free to try, minutes to set up, fully featured with no commitment necessary.

Starting point is 00:37:23 Check them out at backtrace.io.cppcast. So we've talked a lot about benchmarking. Can you tell us a little bit about what types of tools you use for performance analysis? Oh yeah, sure. So I think the go-to tool for profiling is Linux Perf. So this tool is capable of doing most of the things the engineer needs to perform

Starting point is 00:37:51 to do performance analysis and optimize the application. Usually when Linux Perf is not enough, I go for Intel VTune. It has a nice GUI and a little bit more capabilities than Perf is not enough, I go for Intel VTune. It has a nice GUI and a little bit more capabilities than Perf. But actually, besides from that, I also use the good old binutils.

Starting point is 00:38:16 It's just ObjDump, NM, the tools that we are all familiar with. So the point here is that well, it's actually nothing stops you from doing performance analysis right now. So most of the tools are free and they are amazing, like for example Perf. It's just

Starting point is 00:38:40 amazing what Perf can do. Yeah. You've mentioned Compiler compiler explorer i have to ask how often do you use compiler explorer to give you a quick snapshot of comparing what your work has been across different compilers right correct i i actually was thinking about how i how i can how i can make this work for me. But unfortunately, you know, we have the brand new compiler every day. So for me, it's kind of impossible to

Starting point is 00:39:10 integrate a new version into Compiler Book Explorer. I actually, I just haven't spent time on this. Maybe it's possible. Yeah, if you're running it locally, it would actually be pretty easy to do. Well, I don't make a build

Starting point is 00:39:25 every night but I do just use whatever my nightly build is right but then you can you should integrate them all into into compilers I mean if you have a number of compilers nightly snapshots

Starting point is 00:39:42 which were built so you should somehow integrate it and then keep the history and then what happens for example most of our work is on remote machines so I'm not developing

Starting point is 00:39:57 on my laptop we have a number of dedicated servers that we are just SSHing to and then go from there, kind of build the compilers, build the benchmarks, run them. So it's kind of

Starting point is 00:40:13 I think will be pretty hard to do. Right. Yeah, it would be a lot of work. Probably possible. See if Matt's listening and he decides that you can add a wild card for your search for your compiler or something and right yeah so i'm curious like as you're like optimizing this code and working all of these benchmarks what kind of things do you see is like really hard to optimize like what should what should c++ programmers stop doing so that they get better code out of our compilers?

Starting point is 00:40:47 Yeah, so... Well, the most obvious thing is, like, do not try to do the compiler's work because the compiler is probably better than you at this. So, yeah, like, you know, probably the general advice, like, do not unroll loops yourself. Compiler will probably do it better.

Starting point is 00:41:13 Do not try to inline things. The compiler will probably still do it more aggressively than you can. What about the inline keyword itself? I assume you should say you can. What about the inline keyword itself? Or always inline? I assume you should say don't do always inline because the compiler knows better. Yeah, yeah. Okay.

Starting point is 00:41:33 Yeah, so speaking of inlining keyword, from the top of my head, I'm not sure if this keyword is actually still makes sense. Well, obviously, the no inline attribute makes sense because it's kind of prevent the compiler from inlining. Yeah. Okay. Remember there was a great post by Simon Brand who actually dived deep into specifically this problem. Like, is inline keyword still make sense? He has a great article on this topic.

Starting point is 00:42:19 Okay. I'll look for that. So, yeah. So, don't try to do the compiler's work. Yourself. I like that um but then what i like uh it's almost a little depressing how you said something so simple like adding a function can push our code around and have implications that we never expected

Starting point is 00:42:38 yeah what do we do about that like do we worry about that at all i mean yeah uh so actually we spend uh some efforts on on trying to to figure out how we can solve this problem um and actually so we we haven't uh we we haven't i mean uh got to any any any good good decision about this. Whenever we choose, for example, to align all the loops by, for example, by this 32 bytes boundary, let's say, some of our benchmarks go up, but some of the benchmarks still go down. So in the end, it's still kind of the same. The performance is still kind of the same. The performance is still kind of the same.

Starting point is 00:43:25 So this problem is still kind of unsolved. And this problem, I should say, is probably the most harmful for us, for compiler developers, because the problem here is that you cannot rigorously test your optimization, yeah? Because some of the benchmarks will go up, but some goes down, and you want to know why they go down. But the reason probably is just because of the noise and different placement, different layout. Yeah. And there are actually

Starting point is 00:44:05 more also interesting problems. Like, for example, let me now ask you the question. How do you think, does debug information affect performance? Like, okay, so we pass the minus G option to the compiler, it will emit all

Starting point is 00:44:22 the sections in the binary with debug information. I feel like it must have some effect because it's making the binary larger just because of that reason. Then we start thinking that the debug information just stays in the binary and in the runtime

Starting point is 00:44:42 it's not even loaded into the main memory. It just stays on the binary, and in the runtime, it's not even loaded into the main memory, right? It just stays on the disk. I mean, well, probably, in an ideal world, it should not affect performance, right? That sounds like a trick question, honestly. Right. So, actually,

Starting point is 00:44:57 what we did... Just saying upfront, I saw the cases when... And I actually worked on the cases that you pass the minus G option. And you have... Okay, so you're building the same application. You pass the minus G option and building the binary number 1, number A. And then you build the same application without debug information,

Starting point is 00:45:25 so without passing the minus G. You have the binary number B, and then you strip the debug information from the binary A, and just dump in the assembly and compare it. Uh-huh. And it's different. Yeah. Okay.

Starting point is 00:45:43 It's not that different, let's say, but there is some difference still. So for me, it's still kind of magic. I mean, there is no way how it should be different. But I definitely saw the cases, and it definitely affects performance. So I mean, in an ideal world, the answer to my question is no,

Starting point is 00:46:04 the debug information should not affect the performance. But I mean, we're not living in an ideal world, the answer to my question is no. The debug information should not affect the performance. But, I mean, we're not living in an ideal world, so the answer is maybe, or it depends. Is that because the compiler has to emit extra things for the debugger to be able to know, okay, this does have a memory address or something like that? Is that what's going on? I'm not sure. I'm not an expert here.

Starting point is 00:46:34 It's also still an open question for me. But I tend to think that probably it could be just a bug or maybe some heuristic. For some particular optimization, there is some specific heuristic that sees that, for example, if my function is that long, I will do this. But if my function... And actually, so all the debug information is stored. For example, for LLVM, it's stored with some specific function calls, like LLVM debug and metadata.

Starting point is 00:47:25 So maybe just some optimization, just take this debug information into account and they should not do this. So I'm not sure what's the real answer for this is. Maybe if there are some really compiler experts that are now listening to this episode can maybe comment on this problem. Yeah.

Starting point is 00:47:54 Yeah. Yeah. So, yeah, in general, a performance analysis is quite a tricky thing to do because, for example, once I was working on some regression, which was 5%, 5% regression, okay? And actually, what I immediately spotted was that the good case

Starting point is 00:48:24 has the 50% less instructions retired. Or let me say just for simplicity executed. So the good case executed 50% less instructions.

Starting point is 00:48:40 So this should be definitely better and good. And when I started looking into this benchmark, I saw the patterns. In the good case, there was just one assembly instruction that was doing, for example, incrementing the value in some memory location.

Starting point is 00:49:07 So it will be like in the x86 assembly, it will be like inc, increment, and then some memory location. In the bad case, this exactly instruction was just split into three assembled instructions because still it's a read-modify-write operation. So we are first loading the value from memory, then incrementing it, and then storing it back, right? So it's still

Starting point is 00:49:35 read-modify-write. In the bad case, it was the same instruction, but just unfused, let's say. It was explicitly load, explicitly incrementing some temporary register, and then storing the value back. So, yeah, when I spotted this, this was like, well, it makes no sense.

Starting point is 00:50:01 Like, there is clearly like 50% more instructions executed. This should be the reason for the performance regression. And then, I actually just went back and then kind of consulted with one of my colleagues

Starting point is 00:50:24 which told me that well, but still it's the same amount of work for the CPU to do. So then what I actually done, I just put these assembly instructions in a tight loop

Starting point is 00:50:40 and just benchmarked only this tiny loop with just only one assembly instruction in the good case and just three assembly instructions in the bad case. And it showed exactly the same performance. So,

Starting point is 00:50:54 the thing here is what I wanted to tell is that you can be easily fooled by just a number of, for example, instructions that were executed. Right. And it's not that obvious that it can cause the performance regressions or gains.

Starting point is 00:51:14 Yeah. And in my case, the number, the fact that the bad case executed 50% more instructions was not the reason for a performance regression. Yeah. So it's always tricky. You always need to be prepared. You always kind of need

Starting point is 00:51:36 to dive deep. You always need to know how the hardware works and stuff. Yeah. Yeah. All right. Okay. Well, Dennis, I'm definitely going to put a link to your blog in the show notes,

Starting point is 00:51:53 but do you want to share that with listeners who might want to read up more about some of these performance tuning examples you have? Oh, yeah, sure. So I think you will probably just share it in the shared notes. Yeah, I mean, the link. Sure. So actually, I also wrote a number of the beginners kind of friendly articles, starting from the basic things like what is profiling? What is the, I don't know, for example, the instruction retired?

Starting point is 00:52:23 What is the reference clock? How you can collect the other performance counters, what you can do about it, about those counters, how to read them, how to collect them, a little bit more of advanced topics like what are other capabilities of the performance monitoring unit and how you can leverage that? Yeah, so what you can do with Perf, for example. Yeah. Well, where can people find you online, aside from your blog?

Starting point is 00:52:59 Yeah, so I think the best place to find me is on my Twitter. My Twitter handle is D-E-N-D-I-B-A-K-H DendyBuck. Yeah. Okay, cool. Okay. Well, it's been great having you on the show today, Dennis. Thanks. What's a pleasure. Thank you very much, guys.

Starting point is 00:53:23 Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W.ving and Jason at left to kiss on Twitter.

Starting point is 00:53:48 We'd also like to thank all our patrons who helped support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash CPP cast. And of course you can find all that info and the show notes on the podcast website at cppcast.com. Theme music for this episode was provided by podcastthemes.com.

Your Ad Here

CppCast - Performance Analysis and Optimization

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.