CppCast - Improving Performance

Starting point is 00:00:00 Episode 308 of CppCast with guest Ivica Bogoslovich, recorded July 14th, 2021. Sponsor of this episode of CppCast is the PVS Studio team. The team promotes regular usage of static code analysis and the PVS Studio static analysis tool. In this episode, we talk about Open3D Engine and documentation tools. Then we talk to Avica from Johnny's Software Lab. Avica talks about ways to improve performance in C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? I'm all right, Rob. How are you doing?

Starting point is 00:01:21 Doing good. You know, summer, doing well. I don't know. You got anything going on? Uh, well at the moment we have, um, more smoke than I would like to see from various forest fires. It certainly was worse last year, but waking up in the mornings, like it's hard to open your eyes because of all the allergens and stuff, but, uh, otherwise it's doing well. That's good. That's good. Hopefully, uh uh that gets cleaned up soon and taken care of yeah yeah we keep getting like forecast for rain but it never actually happens so that's unfortunate yeah okay well at the top of every episode i threw a piece of feedback

Starting point is 00:01:59 we got this message on discord from jack saying i I'd be interested in hearing from Robert Nystrom. He wrote two books that are freely available, and he's currently working on the Dart language at Google, I believe. It would be interesting to hear about his take on language design and evolution. He's also worked in the games industry, so has undoubtedly used C++ a bunch. And that also has a link to Robert's blog, which is journal.stuffwithstuff.com. You know, I find that interesting. He's worked a lot in the games industry,

Starting point is 00:02:30 so undoubtedly has used C++ a bunch. And I don't think I've ever shared this on CBPCast. But from time to time, I will attend the local C++, excuse me, the local game developer group in Denver. And the last time I attended it, it was virtual, and I mentioned, oh, I'm a C++, excuse me, the local game developer group in Denver. And the last time I attended it, it was virtual. And I mentioned I'm a C++ developer, whatever. And all of the other attendees were like,

Starting point is 00:02:53 does anyone still use C++ for game development? And I'm like, what alternate universe did I just step into? I'm guessing if it's like hobbyist game programmers maybe they're all indie games they're doing um you know they're not going to be using the triple a you know tools they're using unity they're not using like unreal engine which is c++ i believe it was so bizarre though to have them all respond that way and i'm like from the c++ perspective everyone thinks that that's the only thing c++ is used for like yeah it's funny though do you think they would at least be aware that even if they're not using c++ that is what's being used not at all games yeah that's fascinating although to be fair some other people from the

Starting point is 00:03:43 local indie games community are involved in my C++ meetup. So it was like, I had just gone there on the exact right day or something. Right. Right. Okay. Well,

Starting point is 00:03:53 we'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or emails at feedback at cpcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube. Joining us today is Evica Bogasovic. Ivica is a senior software engineer with 10 years of experience active in the domain of Linux and bare metal based embedded systems. His professional focus is application performance improvement, techniques used to make your C++ program run faster by using better algorithms,

Starting point is 00:04:21 better exploiting the underlying hardware and better usage of the standard library programming language and the operating system. He is founder of Johnny Software Lab, a consulting company that helps developers and development teams increase the performance of their software. He's also a writer for a performance-related tech blog, Johnny Software Lab. Ivica, welcome to the show. Hello, thank you. I'm so curious how you ended up with the name Johnny Software Lab as the founder of this company. So my name is Ivica.

Starting point is 00:04:49 It's just translation into English, Johnny. So there's a John and Ivan, and Ivica is just the diminutive of that name. And so I started because I was playing a lot with software. So it was like an experiment, and it all started like an experiment. So it's like the experiment and it all started like experiment. So it's like the English, English, English of your name. There are many people who,

Starting point is 00:05:11 uh, uh, who when moved to the U S they, they get their name translated. I think there is one, one, uh, you know,

Starting point is 00:05:19 Nikola Tesla. So he didn't get his name translated, but there is the other guy who is called Mikhailo Pupin. And I think his name in English was Michael. And he was a Serbian. He was really, I mean, Nikola Tesla is a special guy. I mean, he was like, if you meet him, you would be really shocked maybe. He was all about himself and his work.

Starting point is 00:05:40 And the other guy, Mikhailo Pupin, he was a really open, cordial man and he wrote a book in English about how he made, not a fortune, but how he made his fame starting from a small village in Serbia and then he moved to Austria-Hungary and then he moved to the United States. It's a great book, It's a good book because it shines with positivity. You know, you read it and you feel better about it. You feel better about life and about yourself. That's a good book. Yeah.

Starting point is 00:06:14 I don't like reading things where you feel worse about the world when you're done. Yeah, definitely. All right. Well, Avica, we've got a couple news articles to discuss. Feel free to comment on any of these, and we'll start talking more about performance and the types of topics you talk about on your blog, okay? Okay.

Starting point is 00:06:34 All right. So this first one, we have a GitHub repo, and this is Open3D Engine, which is an open-source, real-time, multi-platform 3D engine. And I believe this is a fairly new project. I think it was Amazon-driven. Jason actually did a video about this recently, right? It was Amazon's Lumberyard, which we've mentioned previously on the show,

Starting point is 00:06:57 but renamed and open-sourced. Okay. So, I mean, game engines are, I think, number one, number one important thing for performance developers because, you know, if you don't have those 30 frames per second, you cannot chip your game. And they have like a completely, their methodology is completely something different that you will normally meet in C++,

Starting point is 00:07:24 standard C++ development and the standard object-oriented paradigm so we're looking forward to i'm looking forward to see what will happen with that engine and i hope it will become as successful as it is as unreal engine is right now so i okay i was just gonna say it'd be a little bit of a game changer if there was a fully open source triple a game engine out there was a fully open source AAA game engine out there. Okay, you're looking at it from another perspective. I guess that's important, too. Does this mention any actual games that are out there

Starting point is 00:07:58 that have been made using this project? On the main website, it shows examples. The 03de.org the one link that we have in the show notes is to the github project um yeah like i see some screenshots from a game i'm not sure what games have actually been uh yeah good question i'll have to look into that well and what was you mentioned lumberyard which i guess we have talked before on the show when did that um how long has lumberyard been around i don't recall so i will say you know i've done a few of these where i review some project that's been recently open sourced on my videos just because it's a fun thing to do and you know honestly it's a little bit of clickbait because if it's something that's in the public eye at the moment then i'm going to get more people coming to the channel uh for the first time ever though i had

Starting point is 00:08:48 four developers on the project respond on twitter including the lead developer for the games technology at amazon they're like thanks for your feedback we hope in a few months we can uh you can come back and see all the progress we've made on the game engine. Oh, very cool. Maybe we can have them on sometime, too. Yeah. Okay, next thing we have is a blog post on 60fps.io, and this is DocTools for C++ libraries. This is a GUI toolkit 60 FPS, and they have done some analysis on

Starting point is 00:09:29 what would be the best documentation library for their usage. And I'm not... I think we probably have talked about some of these before. We definitely talked about Doxygen before, but the other two are Hide and Standard Ease that they kind of review in this post. And then they talk about Doxygen combined with Breathe, Exhale, and Sphinx,

Starting point is 00:09:50 which is what they ultimately wound up using as the best tool for generating documentation for C++. Do you have any thoughts on this, Evita? I'm a standard user, so normally CPP. So about the documentation, I mean document API should all be simple enough so you can understand it and that's the goal. Documentation and really find it especially that kind of generated documentation really gets I mean we all share that. It really gives you the information you need. You always, if you go to the C++, that website, C++, what's the name of it? CPP reference.

Starting point is 00:10:32 Yeah, CPP reference. You always scroll down to see the code example because you actually understand what the thing is doing. So all the description there is just for reference, let's say, and two or three diagrams which explain how the API should be used are most of the time enough. I did find the hide tool sounded interesting because that is written in C++ and just generates everything from your header files, and I think it then allows you to edit it

Starting point is 00:11:04 if you want to add like examples or just additional information that it can't generate from the header files itself that sounded like interesting and i thought the standard ease one sounded familiar so i went and checked it out and it's uh started by jonathan muller with contributions from manu sanchez and tristan brindle and like wait a minute these are all people we've had on the podcast not all of them several of them John McFerralline

Starting point is 00:11:31 Victor Zivervich yeah all right next thing we have is the CPPCon 2021 call for submissions and this went out a couple weeks ago but I guess we haven't actually mentioned it before on the podcast even though we have I don't think we have. Some recent

Starting point is 00:11:47 CPPCon news like we mentioned the the trip. Yeah and I wanted to get this out here real quick because the end for the submissions deadline is the 19th so that will be like two days after this episode goes live

Starting point is 00:12:04 basically so if you're listening to this and you haven't made your CPPCon submission So that will be like two days after this episode goes live, basically. So if you're listening to this and you haven't made your CPPCon submission, go ahead and do that. You can do both virtual or in-person submissions, and they will pay for your travel if you're allowed to travel to America in October. Right. You went to CPP or gave a talk at CPP now. Is that right? You have any plans for this? Yeah.

Starting point is 00:12:32 Yeah. And I have submitted for CPP con. I'm looking forward to it. First, it's a really famous conference. There are huge amount of really expert who know this stuff. I never went there. So it would be a great opportunity.

Starting point is 00:12:46 See other people, see how other people think, meet other people. So I'm looking forward if my talk gets accepted. So you're trying to do an in-person talk? Yep. Okay, and then last thing we have is Meeting C++ announcing a second set of Ask Me Anythings for Meeting C++ 2021, and that'll include Sean Parent, Daniela Engert,

Starting point is 00:13:07 and Kevlin Henley, who we've all had on the show before. So that's pretty cool. These AMAs I'm sure are a lot of fun to be able to ask, you know, anything of these speakers, prominent speakers. Okay. So you mentioned a little bit about how, you know, you came to name Johnny's Software Lab. Do you want to tell us a little bit about how you came to name Johnny Software Lab. Do you want to tell us a little bit more about the blog and how you got it started and everything?

Starting point is 00:13:30 So I'm in software development for 10 years and mostly doing embedded stuff. And from time to time, there is some performance-related problem. I mean, everybody has those problems from time to time. And normally, I really enjoy that problems and I'm good at this low-level stuff and understand a lot. And I had an idea to start writing a blog and doing some experiments measuring performance under various loads, various circumstances.

Starting point is 00:14:02 And after some time, I got approached by this company called Apentra for creating their own software that's called Parallelware Analyzer, and it's software that people who are interested in performance can use to speed up their code automatically or through recommendations. So they approached me and asked me if they want me to start working can use to speed up their code automatically or from the recommendations so they will approach me and ask me if they want if they want me they want me to start working for them as a performance specialist so i joined them and then that's the where the story started so that's when i started the company in the meantime there was a few times i helped other people with uh with not the appentra

Starting point is 00:14:42 other people who had also performance problems i helped them solve them and it's moving in the right direction. So it's enjoyable. It's really nice when you know your expertise and when you can solve problems that maybe a typical engineering would take like two months to solve and you can solve them in a few days or maybe even a few hours. So it's a, it's an enjoyable endeavor and I'm enjoying it. I'm enjoying it, I'm learning

Starting point is 00:15:05 all the times, I'm learning new stuff I'm helping other people and I'm also helping the community so I'm writing posts on this blog so these are not like short posts they have a lot of content, it takes me time to write them but

Starting point is 00:15:21 yeah it's moving in a nice direction, I like it I don't believe I've ever heard the job title performance specialist before. That's, I think, a new one to us here. It's, I mean, you need to give you, so how would you describe what I'm doing? Yeah, I mean, it makes sense. It sounds like an accurate title, yeah.

Starting point is 00:15:41 Yeah. Yeah, so normally this type of work will be ideally i would like consultant to do consultant work so not working on a steady project for a for a single company i want to work i want to work with several companies and help them in a limited time span maybe five ten days 20 day stops to help them for example either with debugging or maybe writing some performance critical software giving advice trainings anything like that that is performance related and I can that I know well that I can help with so that's that's classical that's something called consulting you work as a consultant the problem nowadays is that people use the word consulting for

Starting point is 00:16:23 meaning something else it's just a contractor. So if you go to a firm and you hire 20 engineers, they're called a consultancy, but they don't have the expertise. They're just general-purpose developers. So it's not the same, but the term gets used differently. I like that because you're you're setting uh right out the gate you're setting expectations you're saying this is what i do and don't expect me to do anything else basically in a sense in a good way because you're limiting it to the things

Starting point is 00:16:58 that you're interested in doing yes yes and the customer the, on the other hand, what he gets is he gets, I mean, he gets a good bang for his buck. So I can typically debug problems that most of the engineers don't know they exist. I know a lot of, for example, fancy data structure, how to customize them so you get two or three times better performance and so on. So if you have a problem with your software you cannot ship it so it's a good it may be a good idea to talk to me

Starting point is 00:17:30 and maybe i can help it's a interestingly um a company that i was at a couple of weeks ago was i think the first time ever that they didn't have performance in their list of like top three or four uh concerns and writing C++. They were much more concerned about best practices and correctness, which I'm cool, I'm on board with that. But it did just make me wonder now, as you were talking, what industries do you find come to you for performance help? So performance is a complex topic,

Starting point is 00:18:03 and I explicitly said that application performance. So it's part of the performance where you set up your operating system so it works under high loads. For example, a good example is Brandon Gregg, who wrote that book. It's called System Performance. So that's one part of the performance. The other one is design for performance. So when you're designing your software, for example, you don't want to have components that talk a lot without another.

Starting point is 00:18:32 So that's related to performance because every time there is a talk, maybe it goes through the network and then you lose some time. But there is also on the level of one component how it works out. For example, my typical clients and the people I'm talking to are people who actually do some kind of it's they're not like for example let's make an example image processing video processing audio processing game developers machine learning computer machine learning communities for example speech recognition computer vision Internet of Things so where they have these small

Starting point is 00:19:09 devices which need to be fast so these are my target audience sometimes sometimes the talk about performance starts at the design but I cannot as a consultant cannot fix those problems because they require that several teams cooperate. So different components require different teams to cooperate. So I don't fix those stuff. I just work on a level of one component. So does it ever happen? We're probably getting off track from our plan here, but whatever uh does it ever happen that um a company brings you in and you say you

Starting point is 00:19:47 look at their stuff and you say i'm sorry but it's going to take a re-architecture of your system to get the performance that you need uh so uh normally no because uh because the type of people that contact me um they so when you start working on performance you have a you have to have a repro so and normally people are the people are when when it comes to performance the writers of the algorithms themselves they know when they're doing redundant work so they're doing something that they need to do they know how to fix this stuff but on the lower level there are some problems that they don't know for example they. They know how to fix this stuff. But on the lower level, there are some problems that they don't know. For example, they don't know how to use computer hardware in an efficient manner. And we'll talk about that later. So that's one of the things

Starting point is 00:20:34 where I can help. Okay. Performance is about when you get... So normally you want to write code that is portable, that is easy to read. And that's, I mean, all the code should be written like that. But sometimes that's not enough. If you have a low embedded chip and you want to ship your product, but if that means that you need to pay instead of $2 for a chip, $3, then for 100,000 chips, you lose $100,000. So in that case, you're willing to do those low level optimization that are maybe for one chip only or for one family of chips, and you're willing to make some sacrifices.

Starting point is 00:21:12 Right. So if I understand, you're saying your clients tend to come with to you saying we have this specific problem, we know how to reproduce it, we need this thing faster smaller whatever and you say i got you yes yes okay cool uh could we maybe go into a little bit more detail about like the types of performance improvements usually wind up making that could be you know useful to some of our listeners maybe something they're not aware of that they should look out for uh i mean the topic of c++ performance is a huge topic and the way that the i mean the writers of the++ performance is a huge topic. I mean the writers of the STL library, when they were writing their library, they were writing a library that should be used.

Starting point is 00:21:52 STD map should work correctly for map which has 10 entries and which has 1 million entries. And STD tree should work under this and this. So they're covering a huge array of possibilities. Now, C++ code most of the time works correctly, but sometimes you need to, let's say, fine-tune it. Sometimes they're just typical bugs. For example, somebody forgot to set up the initial bucket count for a hashlet and it initializes for a long time.

Starting point is 00:22:26 But sometimes it's like you need a data structure, an allocator, and those two needs to work in a really specific way in order to get the peak performance. I mean, for example, let's say like this, one decision that was making C++ is that the std map is a red-black tree, which is the binary tree. But binary trees are not that good for performance because of the data cache misses. If you have like an energy tree with five, four, for example, instead of two, if you have four values in one node,

Starting point is 00:23:03 that means that you will have less cache misses because the tree is shallower. On the benefit side, also additional is that the good thing also about this is that the data is fetched from the memory in blocks of 64 bytes, typically 64 bytes. And you're bringing memory data to the CPU, but you're not wasting memory bandwidth. So you're not wasting data on just bringing stuff that you will never use. If you focus on this, if you understand how this works, it can have a really good impact on performance.

Starting point is 00:23:39 One of the things that seems to be coming up, maybe I'm misunderstanding what's going on sometimes. But, you know, I had a computer science curriculum in 1996 is when I started, graduated in 2000. And I feel like in so many cases, you mentioned this red black tree, like from a computer science, like big O notation, kind of whatever. We're like, oh, well, clearly a linked list linked list you know is going to be better performance if you need to do x and y because that's less moving memory around and i feel like a lot of the things that i was taught just don't hold true on modern hardware yes yes yes yes uh so uh i mean there depending on what you studied so there's computer science and computer engineering and software engineering, but computer science, they use the linked list a lot. And then you have

Starting point is 00:24:28 graduates that say, but why don't we use linked list everywhere? I mean, linked list as a data structure, if it's huge. Again, there is a lot of, let's say an example of linked list. It's a nice example because we can talk about it and we understand what's going on. So each element of the linked list is one call to the allocator, right? So if you have 10,000 elements you'll have 10,000 calls to the allocator. So the allocator is actually your memory gets fragmented. This doesn't happen with vector. So that's one thing. So just a problem of memory allocator. The next thing is that

Starting point is 00:25:00 when you're traversing linked lists you're actually referencing pointers. Now if you make a setup, you make a test environment where you have a compare vector with linked lists, and you say, they're almost the same. Maybe linked lists are a bit slower, but they're mostly the same. But what happens is that in your little environment,

Starting point is 00:25:20 when every call to malloc will return the next chunk in memory. So basically you have continuous parts of memory. But in a real-world system, environment when every call to malloc will return the next chunk in memory so basically have continuous parts of memory but in a real world system when a lot of things happen there is no guarantee that will happen and worse worse if you need to traverse a vector or need to traverse the linked list of the same time worst case scenario for linked list can be 10 times slower than the vector because of the data cache i wonder like how often, yeah, I know you said sometimes it's relatively simple things that you have to solve.

Starting point is 00:25:48 How often do you come in and you're like, why are you using only linked lists here? Let's just go ahead and fix that real quick. I mean, programmers is always makers and so on, but that doesn't happen often, but it's a good example to illustrate the problem on modern hardware that most people don't even think about it. So you say, okay, I need a data structure

Starting point is 00:26:11 where I can remove elements randomly. BlinkList will do, okay? And then what happens? I mean, why is this slow? The people in game development, they have a different data structure. It's called Colony. I don't know, probably heard about it. So it's a data structure that

Starting point is 00:26:31 it's vector-based, but you can add, remove elements. The data structure is unsorted, but it will guarantee you a good performance for the data caches. And it will also guarantee that you can quickly insert or remove elements without any problems. And they use this data structure, and this data structure, as far as I know,

Starting point is 00:26:51 is currently being standardized. Yeah, I think it was probably, what, four years ago or something, we had Matt Bentley on, I believe, to talk about his Colony proposal, and it's been a long time coming. C++ is not in a hurry it exists for 25 years it will exist for some time more yeah so you gave a whole talk at cpv now which i think i mentioned um that you were there recently uh on why we should be avoiding dynamic memory allocation

Starting point is 00:27:22 for performance improvements it's something we kind of talk about a little bit on the show fairly often, but maybe you want to go in a little bit more detail and kind of remind our listeners why dynamic memory should be not allocated as much as possible. So the problem with dynamic memory is that there are several problems with dynamic memory. First, you have the problem of memory fragmentation. And as your system is running, it gets slower and slower.

Starting point is 00:27:49 If you're just allocating, releasing, allocating, releasing memory, it will get slower, slower, slower. And you can even measure that. If you have a long-running embedded system, they have even this catch to fix memory fragmentation slowness by restarting the system. For example, your television box at 3 o'clock in the morning will restart. Maybe you're not aware of that. Mine restarts at 3.18.

Starting point is 00:28:17 So every time you turn it on in the morning, you see 3.18. Now, that's one of the ways. I mean, that's a really creative way to solve it. It's a nice way to solve it, and creativity is all about software. I heard, for example, that in the... So I'm not... This is what I heard, so this is... Don't take it like a truth, take it with a grain of salt, but in the airplane industry,

Starting point is 00:28:41 they have components like small... So you have embedded chips, you have many of them, and they can restart within like 15 milliseconds, but they also can restart. And these things are safe to use in this life-critical situation. So this is the way how to fix issues. Now, the problem, again, with the allocation is that you have memory fragmentation. That's one. So you're calling a lot of, calling allocate a lot of times. That creates problem. Then when you do it like this, if you're allocating smaller chunks of memory, that will probably mean that you will jump around memory, which results in data cache misses. Binary trees are binary trees and so trees and std map and std set and std linked list are notorious because they allocate for each for each data they will for each node they will

Starting point is 00:29:31 allocate one chunk and if you have huge data structure they can allocate a huge amount of data chunks. Now I mean maps std map exists for reason You want to quickly fetch the data, but sometimes you don't want that stress. So if you use a custom allocator, for example, you're limiting, you're allocating from a dedicated pool. And that means that when the data structure is destroyed, you can get rid of the block of memory that you allocated your data from, and you don't have any memory fragmentation and all your data is local. so it increases data cache hit rate. So it works better. So custom allocation strategies to give one big block of memory

Starting point is 00:30:12 that you then know you can allocate from a contiguous chunk. Yeah, so there are two types of allocators, the STL standard allocator. In that case, for example, if you have several instances of STD map, they will all allocate it from one large block. But on the other hand, there are per instance allocators. So each instance gets its own allocator. That makes sense. For example, if you have a huge binary tree, a huge tree in a database, like in memory database.

Starting point is 00:30:41 In that case, it makes sense to have a dedicated block for each instance. That will keep data locality at bay. It decreases data cache miss rate, TLB miss rate, so it works faster. Out of curiosity, do you ever use the PMR allocators from C++17? Those are... No, I know what they are.

Starting point is 00:31:02 I haven't used them. So I found out about them quite late and I already had most, I already covered most of the things that I was interested in. So I didn't have an opportunity to use them. Sure. It is like PMR allocators are like, they're not well known. I mean, you need to dig them. I mean, you don't hear about them. You

Starting point is 00:31:25 don't have an idea what they are capable of, what are their limitations and so on. So they're just like something you put on the side and it waits there. I don't know why it's like that. Part of the problem is Clang's libc++ still doesn't have

Starting point is 00:31:42 PMR implemented from C++17. I have a little bit of a sore spot with a few C++17 features that haven't been implemented yet. That's one of them. You mentioned different ways of some techniques used by game developers. I think we've done an interview a long time ago with some game developers talking about data-oriented design. Is that something you advocate for

Starting point is 00:32:11 as a potential performance improvement? So I personally don't have a lot of experience with data-oriented design because I never worked in a game development. So it's not even game development, it's a game engine development. As you said, Jason, most people nowadays don't use C++ for game development so it's not even game development the game engine development most as as you said jason most people nowadays don't use c plus plus for game development

Starting point is 00:32:38 which was also surprised to me um what happens is that um again in game development if you're not running at 30 frames per second uh your game cannot chip. And then you need to think about performance from day one. So they, through trial and error, they devised a new way of thinking about performance. Now, the game developer, game development is a specific kind of development. They have these worlds with huge amount of objects, and they need to render them 30 times per second. So they need to change their states 30 times per second now what they're doing is they they got rid of polymorphism and they're doing type based processing so they first there there is no like a base class that's uh like object in the game then you have a player then you have a gun and so on they have their first processing uh

Starting point is 00:33:21 first processing uh guns then they're processing people, game character, they're processing cars and so on. So this has an advantage because it uses CPU hardware more efficiently. There is a Lex instruction caches, you can keep all these components in vectors, you don't have to, it saves memory, because for example, STD variant doesn't save memory. So it saves memory and the system is generally more responsive. So according to them, it's like four or five times faster than if you take a standard object oriented approach. But again, they're working in a specific environment. I'm not sure if it's possible to implement a compiler using data-driven design. So I'm just curious. The people in game community, they're like a small sect

Starting point is 00:34:13 because not a lot of it leaks out. So yeah, type-based processing is a good idea. If you have a performance problem, you should definitely look at that side. So you don't want to, you want to sort all your objects per type and then you process them. That keeps the instruction caches warm and generally the system is more responsive. This reminds me of your C++ Now talk, which I watched part of, but you're talking about avoiding vectors of pointers in there.

Starting point is 00:34:45 And it sounds like a similar topic. Is that something you want to discuss? Yeah, sure. So again, pointers, vector, so C++, when it was invented, the polymorphism was done through pointers. So that means if you have a vector of one million objects, you need to have one million calls to malloc. That creates memory fragmentations problem. That creates memory fragmentation problem,

Starting point is 00:35:05 that creates slows down, it doesn't use memory efficiently as we already explained. Now, what happens is that you also get data cache misses. In the 90s, that wasn't that such a big of a problem, but now it is. Now, using alternatives, for example, again, type-based processing, if you can use unsorted containers, then you should have std vector for each type. That's the perfect way

Starting point is 00:35:32 to do it. In that case, you don't need virtual functions, and the compiler is better optimizing because it can inline stuff. What else? You're using memory more efficiently. So, and the program will be happy about it. So the performance will be good. Now, if you cannot do that because you need sorted containers, then you should try STD variant. Now the problem with this STD variant is you have one large object and all the other smaller, then you're using a lot of memory right again performance is uh when you're down to that level so there are some like general guidelines about performance and there are specific things when you are inspecting and seeing what's going on so sometimes you can change things other times you cannot so vector shared pointers is a vector shared pointers vectors shared pointers are generally bad for performance

Starting point is 00:36:27 because they suffer potentially from many data cache misses. And if you compare them with vectors of objects and vector of pointers, you can sometimes see, again, improvement in speed of up to 10 times. But you also mentioned virtual function calls just now. Do you feel like object-oriented design in C++ is kind of counter to performance? Object-oriented design as a paradigm...

Starting point is 00:36:52 So C as a programming language was like a high-level assembly. That's how it was conceived. C++ introduced object-oriented paradigm, but the paradigm itself doesn't have anything to do with hardware. So it doesn't have anything to do with hardware. There are no concept of class in hardware. In hardware you have

Starting point is 00:37:10 integer, you have character, you have double. So there is no class. Next thing is you don't have... virtual functions are just like function pointers where each of this can point to a different function. So this is the essence to a different function. So this is the essence is a virtual function. Now, in 2021 hardware, you have the problem. The problem, if you're changing the function you're calling, you're iterating through a container, then you're calling this function, then the other function, the other function, then the other function.

Starting point is 00:37:45 They're all different functions. The instruction cache in your CPU has a limited capacity. And the function you're currently executing evicts the code from the previous functions. You're always running code that is never in the instruction cache. That's one thing. The branch predictors are always cold because you're always running a new code. So that's a second thing that's bad. Now, the problem of pointers, that's the third thing. So the virtual functions are not bad by themselves,

Starting point is 00:38:13 but the way they are used to typically can create a lot of performance problems. And that's why the people in game development and also high-frequency trading communities development, they are avoiding them as possible. Interesting. One design idea that I've kind of thought sounded interesting, and I don't know where I may have heard or first thought of this idea,

Starting point is 00:38:37 but I've also never seen it put in use, would be like kind of a hybrid approach where you do use the object-oriented hierarchy, but you prefer having vectors of specific types. So then if you need to, you can still pass this, you know, generic thing with a virtual function call interface somewhere and use it that way. But otherwise, you prefer processing your vectors of known types

Starting point is 00:38:57 and avoiding those costs. I'm just curious if that's an idea that you've seen and if that works or not. Yes, yes. So that's an idea that you've seen and if that works or not. Yes, yes. So that's an idea that can work. So for people who are suffering from performance issues, that's an idea that can actually help. Oh, okay.

Starting point is 00:39:14 That's cool. I want to wrap up the discussion for just a moment to bring you a word from our sponsor of this episode, the PVS Studio team, who develops the same name static code analyzer. The tool detects errors and potential vulnerabilities at the earliest development stages and thus reduces the cost to fix bugs. PVS Studio analyzes code written in C, C++, C Sharp, and Java. The team has two exciting new announcements about the analyzer development to share with you. First, PVS Studio developers increased C++ Analyzer for Windows performance by more than 10%. They simply switched from Visual C++ to Clang. Follow the link to the article in the description to get some more details.

Starting point is 00:39:55 And another important news, the team has released the beta version of the PVS Studio plugin for CLion. Beta testing is already underway. You can also try the new plugin. Click on the link in the description to find out how to do this. The PVS Studio team would be grateful for your feedback and bug reports. This will definitely help make the plugin better. There's a lot to say about performance. So let's say like this, modern CPU is designed, for example example to effectively do video processing or audio processing because those that data is laid out in vectors that data is laid out in vectors and each element of the vector is a simple type and computers are really good at these things when you do it like that you can count on vectorization so this is a vector vector. The CPU can process more than one data

Starting point is 00:40:46 in a single instruction, like four doubles or eight doubles and so on. And the CPU is the computer's actually designed to do this kind of software. And actually, if you think about it, most of the things that your computer is running now is doing some kind of video processing, audio processing, displaying

Starting point is 00:41:01 your browser. So it is actually doing that kind of stuff. But, for example, imagine a software, general-proposal C++ software with classes. Now, class itself doesn't translate well to what CPU does. So what kind of CPU, the task that CPU actually is good at it. So in that case, you can expect some slowness. Now, for example, does the class size,

Starting point is 00:41:32 your processing objects, and does the class size influence performance? So that's the question. Now, the answer is yes, and it's quite dramatically. Normally, if your class is 128 bytes and you increase it by two, you can expect a loss of performance. So it will run like 0.7, 0.6. So it will run 50 or 60% times slower.

Starting point is 00:41:55 So what happens is that, so as I already told you, the data is moved from the CPU to memory in the blocks of 64 bytes. Now on modern CPUs, this is called memory wall, and the CPU has to wait for the data. Data cache misses are extremely expensive in the modern CPUs, so they take like 300 cycles to complete. That's 300 instructions that you could use, and that's the time that you could use better. Now, what you need to know is that inside the class,

Starting point is 00:42:25 if you're processing a class and it's on a hot path, if you remove everything from the class that is not related to that processing, all the data members, if you remove them out of the class and put them somewhere else, you will increase the speed. Why? Because the CPU is not transferring data from memory to the CPU that actually is not used. And you get better speed.

Starting point is 00:42:47 Keeping everything in vectors of objects, not vector of pointers, is also one way to make it faster. Okay? So that's also one way to make it faster. For a typical C++ developer, type-based processing, when you are just processing all objects by types, is also one way to make it faster.

Starting point is 00:43:05 By the way, I was surprised that C++ doesn't have like a multi-vector container, which is really easy to write. So you, it's a vector, which can hold several types, but internally each type gets its own vector. And then you can use templates to do all kind of,

Starting point is 00:43:20 all kind of, all kind of data manipulation on this. Yeah. So they don't have it. It's not a big deal to write it, but it's nice when you have it as part of the library. And it enforces this nice development, at least from the performance,

Starting point is 00:43:38 but nice development ways of thinking. So these are the things that actually do matter with performance. And C++ is not the best about that. So these are the things that actually do matter with performance. And C++ is not the best about that. Even though it claims to be a really language for performance. I'm curious, you mentioned vectorization. How often can you actually rely on the compiler to vectorize your simple data processing stuff? So vectorization, that's, okay, that's again another story.

Starting point is 00:44:09 It depends on the compiler. GCC and Clang follow each other in that regard. So I did research for a Pinterest just today, this morning, and GCC and Clang are mostly close to another. Intel has a compiler that is better at vectorizing and it is often using high performance environments, but that compiler is not really

Starting point is 00:44:29 fully C++ compliant. People are complaining about bugs and so on, so it's not typically used a lot. Now, if you have a simple processing, you can rely on vectorization for the compiler to vectorize your code automatically as certain pre-assumptions have to be

Starting point is 00:44:46 made. First, you need to have vectors of simple types. Integer, char, doubles, and so on. No classes. If it's with classes, it doesn't work. Because of the memory inefficiencies, the vectorization doesn't pay off and the compilers don't do it. Now, if you have a class and you want it vectorized, then you can move from that array of structures to structure of arrays. So that's one way to do it. Now when you have simple classes and you're iterating over vectors, you need to iterate over your vector linearly. 0, 1, 2, 3, 4, 5, or the other way around. You cannot skip elements like 0, 2, 4. The compilers don't vectorize typically well that code. So that's the second presumption. Next, you have a for loop.

Starting point is 00:45:25 While loops, they don't vectorize that well. Inside for loop has to have, before you start your for loop, you need to know exactly how many iterations will it have. So you cannot change the trip count of the loop depending on the data. So that's one thing for the compiler to be able to vectorize your loop properly.

Starting point is 00:45:45 Next, inside the loop you cannot leave the loop like with break, with goat or with break. If you do it, it doesn't vectorize or it vectorizes poorly. Next, inside the loop you don't have to have, you should not have conditional processing. If you have conditional processing, that means this element has to be processed with these instructions, and the other element has to be processed with the other instructions, so you cannot do this bulk processing 4x4 or 8x8. So these are all the conditions. If you fulfill all of these, then typically the compiler will be able to vectorize the fish.

Starting point is 00:46:18 Okay. All right. So I'm going to ask a more specific question, which I think maybe falls a little bit into the category of trying to get free consulting advice here. But the project that I'm working on, there's loops that they know are being vectorized because they verify this with, you know, compiler warnings and stuff and looking at the assembly and some people in the organization are saying well let's change everything that we're doing from double to float because then instead of getting four way vectorization we'll get eight way vectorization okay and uh and they're expecting you know a measurable performance impact and i'm just curious if you would expect a measurable performance impact in that case yes yes um yes. There are several reasons. So

Starting point is 00:47:06 there is not only one reason. So the first thing is that you get double memory bandwidth. So your vector is two times smaller. Right. It's faster. So if you're processing single, double, or float, doesn't matter. But if you're processing a vector of floats, it should be faster. So the other thing is if you have a fast math enabled, so this is a compiler flag called fast math, and then it enables some kind of approximate instructions that are faster. If your code is using square roots or divisions,

Starting point is 00:47:37 it will work faster, but it will be less precise. Right. So, yeah. So you should expect performance improvement when you switch from doubles to floats. Interesting. When you first start looking at, you know, a piece of code that needs some performance improvement, are you looking at the code itself,

Starting point is 00:47:57 or do you use any, you know, tools to try to find places where performance can be improved? So, again, often it happens that the performance improvement doesn't come from any hardware tricks and so on. If you do less work, you will get performance improvement. And sometimes the people who wrote the algorithms are actually the ones that should know that. But for some reason, they just forget.

Starting point is 00:48:28 When I come to that kind of problem, for me, it takes time to ramp up and to understand what the code is doing. So, I mean, everybody has that problem. For example, one of the ways to improve performance is to have duplicate memory.

Starting point is 00:48:43 If you have a class which is shared with pointers to two classes, and it is essential in both cases, you can get a copy of that class in two places. The normalization in database, you avoid the pointer referencing, you're avoiding data cache misses. Okay?

Starting point is 00:48:59 So that is the way how to do it. Now, there are tools that you can use to check for the hardware efficiency. If the data structure is how data structure algorithm hardware efficient, how is it hardware efficient? Now, the best tool it is, it's Intel VTune and Intel Advisor. So they're the best tools on the market. Unfortunately, they work only on Intel CPUs. They can tell you if a loop is vectorized.

Starting point is 00:49:28 They can detect memory inefficiencies and so on. Next thing is that I also like. It's called Liquid. Liquid is like a library. It's using high-performance world, and you can mark your code base. They have this marker API, and then you can see data cache misses and so on. This thing works on Intel, but it works on AMD and ARM also. AMD and ARM, AMD especially, it's not really well supported,

Starting point is 00:49:54 so you need to dig into that and see what's going on. Now, the thing that I have now currently is that by just looking at the code, I can guess if it suffers mostly from data cache misses, from poor vectorization, or maybe it's branch prediction misses, that stuff. Of course, you cannot like false share. You cannot detect that you need tools. But yes, these types of problems that come up often i can just by looking at the code i can see if there is a problem i know what to expect i know how data structures behave i know

Starting point is 00:50:31 how the compiler does its auto vectorization okay well evita it was great having you on the show today is there anything else you want to tell our listeners about before we let you go? I don't know. Performance is fun. So too bad not many people do it exclusively. There is this guy, Denis Bakvalov, who is writer for easyperf.net. And we are starting up performance challenge really soon in a few days. So people who are eager to try to do performance optimizations with other people so it's like a contest but not really because you need to cooperate with other

Starting point is 00:51:11 people and exchange it is it's it's a fun as experience really fun and people are should really should should join and take part i, there are no prizes except for participation and the knowledge you get home. So this knowledge is hard paid knowledge because it's paid in sweat. How do they find this? Sorry. Did you say easyperf.net?

Starting point is 00:51:37 Is that where they should go? Yeah, you can follow me on my blog. So there will be, I guess there will be links in the, I will also post it, but I'm not the host. Dennis is the host. I will post the notification or you can,

Starting point is 00:51:50 you can go to his easyperf.net and you can, there you can, you can subscribe to his mailing list. So he will, he's, he will be giving out a call. He will, he will be sending a call to,

Starting point is 00:52:02 for, for, for, for the contents in, in a matter of days. So probably when this, when this episode gets published, it will be sending a call for the contents in a matter of days. So probably when this episode gets published, it will be out. Okay. Okay.

Starting point is 00:52:11 Very cool. It was great having you on the show today, Wica. Thank you very much. It was really nice. I didn't talk too much, and I hope it was interesting for your audience. That's great. Thank you. Yeah, great.

Starting point is 00:52:21 Thank you very much. Take care. Bye. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com.

Starting point is 00:52:38 We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash CppCast. And of course, you can find all that info and the show notes on the podcast website at cppcast.com. Theme music for this episode was provided by podcastthemes.com.

Your Ad Here

CppCast - Improving Performance

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.