CppCast - ThinLTO

Starting point is 00:00:00 Episode 271 of CppCast with guest Teresa Johnson, recorded October 29th, 2020. Sponsor of this episode of CppCast is the PVS Studio team. The team promotes regular usage of static code analysis and the PVS Studio static analysis tool. And by JetBrains, the maker of smart IDEs and tools like IntelliJ, PyCharm, and ReSharper. To help you become a C++ guru, they've got CLion, an Intelligent IDE, and ReSharper C++, a smart extension from Visual Studio. Exclusively for CppCast, JetBrains is offering

Starting point is 00:00:32 a 25% discount on yearly individual licenses on both of these C++ tools, which applies to new purchases and renewals alike. Use the coupon code JETBRAINS for CppCast during checkout at JetBrains.com to take advantage of this deal.

Starting point is 00:01:02 In this episode we talk about Qt 6 Beta and a view for ranges. Then we talk to Teresa Johnson from Google. Teresa talks to us all about LLVM's thin LTO. Welcome to episode 271 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? I'm all right, Rob. How are you doing? Doing okay. I just saw on Twitter that you were on another podcast getting interviewed. Is that right? Yeah, the LeanPub FrontMatter podcast.

Starting point is 00:01:59 So I was being interviewed about the book that I published a couple months ago now. Oh, cool. So they just interview lots of different types of authors. Yeah. I mean, I don't know who all they interview because I didn't actually go back and look at their archives, but, um, uh, it's, it seems to be mostly people who are, who have published books on lean pub. So it's the lean pub podcast. Makes sense. Had a good time there. Yeah, it was a fun interview. It's weird being on the other side of the table, so to speak. Yeah, definitely. Okay, well, at the top there's every episode I like to read a piece of feedback. We got this tweet from Corentin, who we've had on the show before. He said, listening to Ben

Starting point is 00:02:40 Dean on CPPcast, I don't think Titus is a big fan of UDL's user-defined literals. I wonder why we need more Tituses. And Titus actually responded to this saying, it's a very simple trick. Imagine how a feature will work if everyone piles on to use it, especially in the face of collisions. And then I guess he puts UDLs in the same bucket as global namespaces and macros. Yeah. I know some people who are huge fans of UDLs because they like the way that it simplifies some things, but I've heard too much from compiler developers

Starting point is 00:03:11 and I can't bring myself to use them now. Okay. Well, we'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at feedback at cpcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube. Joining us today is Teresa Johnson. Teresa develops compiler optimization technologies at Google. She's an active contributor to the LLVM open source project and designed the thin LTO scalable link time optimization framework. Prior to joining

Starting point is 00:03:40 Google in 2011, she developed compiler optimizations for the Itanium compiler at HP. She received a PhD in electrical and computer engineering from the University of Illinois at Urbana-Champaign in 1998. Teresa, welcome to the show. Thank you. Glad to be here. Itanium is relatively short-lived technology, huh? Yeah, it was great as a compiler developer. There's a lot of really interesting... It really relied on compiler optimization, which was a neat architecture to be working on. But then for a variety of reasons,

Starting point is 00:04:13 many of them not technical, it ultimately didn't succeed. So... Am I correct to think, for some reason, I feel like the Itanium calling conventions, that's what actually directly influenced our 64-bit Intel calling conventions today? I think so. Okay. Yeah, because I often see references to the Itanium

Starting point is 00:04:33 calling convention even and things that I work on now. So yeah. Interesting legacy there. Very cool. Okay, well, Teresa, we got a couple news articles to discuss, and then we'll start talking more about your work on LTO and other things you're doing at Google. Okay. All right. This first article we have is about Qt 6.0 beta being released. And I actually have a separate link here with a list of new features that are available in Qt 6. One of the big ones I noticed was that they are now building Qt with CMake. I thought that was a big change for them, right?

Starting point is 00:05:10 Oh, I missed that. I did read recently that they had officially dropped support for their... It's some proprietary build system, right? Qt build or something. Hmm. Yeah. Sorry, what were you going to say? No, I was just going to say there's lots of random just classes changed in places.

Starting point is 00:05:28 So I was browsing through it, but it was hard to get an overall feel for what someone should expect for Qt 6. Although it does seem like a few things were removed. There's a few comments on the first link with people saying, but you removed the three features that I require. So, but they said those will be coming in a future update, I guess, or via package management. I think they also mentioned that there is a Qt 5 compatibility library. So I guess if you're dependent on some older legacy APIs that were removed,

Starting point is 00:06:00 maybe you can still access them that way. They've done that a lot historically. I think every version has had a backward compatibility library all right and then the next article we have uh is about range ref and i thought this article first of all was just very good at explaining the the motivation and use cases for spans and then they go into uh you know what they see as some of the shortcomings and propose this new range ref type yeah this one was particularly interesting to me because i just gave a talk at my meetup talking about various techniques for making other kinds of view like things like stream view and function ref and uh it's kind of funny to me because this is like the,

Starting point is 00:06:47 I solved a similar problem in a much more complicated way, basically. My solution, I guess, is slightly more generic. Because this one, the main entry point is for each, that's all you can do is do a for each over the range. Mine would let you get back begin and end iterators. But yeah, anyhow. I took a look through that article. I thought it was an interesting article, and it addresses a couple of issues we frequently see with compiler optimization with people misusing, well, temporary lifetime issues, which are always fun to debug, and also reducing the number of required template instantiations, which they don't mention it there,

Starting point is 00:07:27 but a frequent problem that we see when we turn on higher levels of optimization, particularly whole program optimization, is people who have inadvertently defined template functions in their source files and not in the header, which often you can get by with until suddenly the compiler does something interesting and then you get a link, link time error. So,

Starting point is 00:07:48 okay. And this, uh, last thing we have is a announcement on the meeting C++ blog. And this is about, uh, you and I, Jason,

Starting point is 00:07:56 we're actually going to be doing, uh, an AMA on the first day of meeting C++. First time I've done one of these, I don't know about you. It will be the first time I've done one as well. And I don't think the actual time is announced. Just that we will be doing it on the first day of the conference.

Starting point is 00:08:10 Yeah, I don't see a time listed here either. And what is the first day of the conference? It's like what? First or second week in November? Seems like something we should know if we're going to be doing it. I guess, Teresa, you should know as well since you're one of the keynoters right yeah and the friday two weeks from this friday is my talk so it's like a day a couple days before and after that okay so second week of november i guess yeah so that's the 13th then i guess is when you'll be giving your keynote and rob and i will be doing the ama on the 12th what is an ama i guess ask me anything. Ah, thank you. Okay. I'm only

Starting point is 00:08:48 familiar with it from Reddit. Oh, okay. Yeah. I don't know. It started on Reddit. Okay. So, Teresa, as we said, you're going to be doing one of the keynotes at Meaningful Plus in a few weeks. And I think the main thing you'll be talking about is thin LTO. Could you start off by giving us an overview of what thin LTO is? Sure. So well, let me start with just what link time optimization is. So basically, link time optimization is a way of implementing whole program optimization. So you know, typically, when you build your source files, your C++ source files, the optimizer works on a translation unit basis

Starting point is 00:09:32 and then you get native code and you link it all together at the end. So the optimization is limited to what it can see within a translation unit. So the source files and anything in lines from the headers. So with whole program optimization, obviously you need to be able to see the entire program. The advantage of that is you can do optimization across those translation unit boundaries. And I might say module just because in LLVM parlance, a translation unit becomes a module. So if I say

Starting point is 00:10:00 module, that's what I mean. So the advantage of doing that at link time is, well, for one, it doesn't change the user's model of compilation. So you still would do like a bunch of compiles onto an object file, which when you turn on link time optimization will then not be actually native code. It'll be the compiler's internal representation sort of hidden under the covers. And then when you link all those data together, there's some mechanism that needs to be built into the linker. It could be a plugin, which is what happens with BFD, LD, and GOLD. Or it can be like with LLD, which is the LLVM's linker. It just links in the compilers as part of that linker. And it basically hands those files off to the compiler. And then the compiler can do something to basically see the entire program. So there's

Starting point is 00:10:52 no change to the user model. And then the other advantage of doing it in link time is that the linker then can give information to the compiler about symbols, symbol resolutions. So which copy of a, say you have like template file or template functions defined in headers, you get like n copies of those. It can tell you which one is prevailing. It can also tell you if there are symbols that will be exported potentially. So it can tell you, for example,

Starting point is 00:11:19 if you're linking into a shared library, it will tell you something different about then, you know, if you are linking into an actual statically linked binary, where it could actually tell you for sure that, you know, which symbols are exported and which ones won't. So you can use that link time information to basically do more aggressive optimization. So that's why whole program optimization is frequently done as a link time optimization. So the way that that has been done traditionally is you get all these objects,

Starting point is 00:11:48 which are the compiler's internal representation, the linker hands it off the compiler. And the easiest way to do the whole program optimization is basically take all those IR files and basically merge them all together into one big internal representation of the entire program. And now I can see everything, right?

Starting point is 00:12:07 Well, that's great, except you have a bit of a blow up, both memory and time. And for small programs, that might be okay. But, you know, one of the issues we had in Google is that our applications are, you know, I was sort of amazed when I moved over to Google, and the biggest problem we had was like on the iCache side and the ITLB side. And basically, the applications are so huge, they blow up, you know, they are hitting up against every limit. And so, you know, when you get to the compiler, and you want to talk about whole program optimization, trying to merge all those together is just infeasible. And I think one of our applications we tried to build with whole

Starting point is 00:12:48 program, like traditional LTO on a really beefy machine, and it just, you know, you hit like 60 gigs, and it just it's still going. So it's totally infeasible. So thin LTO is a way that we is a mechanism that we designed to basically do link time optimization in a scalable way so that we can actually build our applications with whole program optimization. So the way that that works is that instead of merging all of your IR into one big monolithic blob, when the compiler builds those IR data files, it also computes a small summary of the module, things that you might need for a whole program optimization. So call edges, which symbols are defined, what their linkage type is, which ones are referenced, and the edges between those things.

Starting point is 00:13:41 And then when the linker, instead of passing off all of the IR to the compiler to merge, it just merges all those summaries. And so they're much smaller. And so they don't blow up in memory. And you do all of the whole program analysis on the summaries. And it's much, much faster. And then at the end of that analysis, all of the translation units can continue to be optimized independently with information that that whole program phase passes off to it. So you basically still maintain the parallelism of compiling each translation unit independently, but with

Starting point is 00:14:18 certain information from this very thin whole program step. And one of the things that's really key to performance is inlining. And so we want to be able to inline across translation unit boundaries. So to do that, this whole program phase will decide, you know, based on like the full call graph that it can build by merging the sub arrays, I can decide, you know, which functions from other translation units I might want to inline. And it passes that information off. And so each translation unit, as it's compiled through the backend phase of this thin LTO, will basically go and just grab the functions from other objects that it needs,

Starting point is 00:14:59 rather than pulling in the whole IR. It just pulls in these smaller pieces. And so you don't get the memory blow up. So you get the parallelism, which reduces the time, and you get the smaller memory. And then another benefit of that is that you can actually do better incremental compiles. You don't need to, if you're doing like a traditional LTO where you've merged all the IR, if you touch anything, you have to redo that whole big thing. With an LTO, you have to redo the thin link, which is what we call that thin sort of summary-based whole program base, but it's pretty fast.

Starting point is 00:15:37 And then you only need to do sort of the backend pieces of the, you know, compiles that had any new information. So you can cache basically your final native code and decide based on new summary information coming in, new analysis information coming in, which actually have to be touched. So if you touch one source file, you know, that might affect that particular source file's backend, but it might also affect a couple others that were importing functions from it. But it's much smaller than obviously doing the whole thing together. So in a nutshell, that's what is the idea behind thin LTO. What kind of performance benefits can you see in general from using whole program optimization or LTO versus traditional linking? So, you know, one of the biggest levers,

Starting point is 00:16:25 I mentioned this just a minute ago, is inlining. That's one of the biggest performance levers that we have in the compiler. You get huge benefits from having inlining. And when you can do it across these module translation unit boundaries, that can be a really big win. That's really, if you look at the sort of regular way of merging all of the IR and doing LTO,

Starting point is 00:16:48 the biggest performance win is from that cross module, cross translation unit boundary inlining that you get. And so, you know, really the first optimization that we implemented for thin LTO was that sort of function importing that I talked about. So decide which functions I need and just very quickly parse out, you know, from the other IR files, the pieces that I need about. So decide which functions I need and just very quickly parse out from the other IR files the pieces that I need and pull those in. And that actually gets you

Starting point is 00:17:12 the vast majority of your regular LTO whole program benefit and can give you really big wins over just regular O2 compilation. So we see on average 5% performance improvement from turning on thin LTO, but it can vary. I mean, we've seen 10 plus percent and it just depends on the application. So one of the first questions I had was, this sounds like this became necessary just because of the size and scale of programs at Google.

Starting point is 00:17:44 But a lot of us work on programs that are nowhere near Google scale. Is thin LTO still a benefit versus traditional LTO because of what you were saying with the incremental compilation? Okay. Yeah. And in fact, it was kind of interesting. When we proposed thin LTO, like Euro LLVM 2015, I think is when David Lee, who I work with at Google, and I proposed this model, we were thinking of

Starting point is 00:18:12 it in terms of, you know, addressing the scalability. But one of our, we ended up collaborating with Mediamini, who was at Apple at the time. And he was really interested in it from the incremental perspective. And that actually became a really big benefit for us as well. But it was sort of scalability was our sort of first, the biggest concern we were trying to address initially.

Starting point is 00:18:35 And so for a lot of Apple's code, the NelTO became a way to get that sort of very fast incremental build ability and more so than needing it from a scalability perspective. Okay. then LTO became a way to get that sort of very fast incremental build ability. And more so than needing it from a scalability perspective. Okay. And then what sort of runtime differences are there between a program using traditional link time optimization and thin link time optimization? Like, I understand, you know, the incremental builds, that's a huge win for like programmer

Starting point is 00:19:03 productivity. But if you're making like a release build where you're just, you know, compiling this I understand the incremental builds, that's a huge win for programmer productivity. But if you're making a release build where you're just compiling this once on a build server or something, is it worth it to still use thin LTO there if you can link with LinkedIn optimization? So that's a good question. And certainly it's going to be faster, but, you know, the smaller the program you have, obviously those benefits are going to be less overwhelming. And regular LTO, I mean, the performance delta between regular and thin LTO was actually quite small. I mean, we weren't able to measure it for our larger applications, but we were able to measure it across a spec CPU.

Starting point is 00:19:43 And it's on average, it's like very, very close. But there's certainly some things that we can't do without the full IR and being able to, you know, see everything, you know, do this sort of iterative, you know, anything that's like, really requires a lot of like iterative analysis and transformation on IR. You know, that gets a little bit harder with NLTO. We're doing, we've done a bunch of, rolled out a bunch of new optimizations with an LTO, but it takes a bit more work to get, there's pieces that are just harder to do with, you know, the mechanism where we have like a thin whole program optimization and then the IR transformation. So if you have a small application that can build fine with full you know regular lto uh and you're you don't care about the incremental benefits and you really care about like eking out

Starting point is 00:20:34 every last bit of performance then you know maybe regular lto is still has some benefit um but it's you know it's not going to give you a lot of benefit over thin LTO. Okay. Thin LTO, that name, that's a clang or LLVM-specific technology, right? Yes. So it's only implemented in LLVM right now. I know reading through blogs of someone who works on LTO for GCC that they just talked about, you know, potentially adding thin LTO there, but I don't think anyone has seriously started looking at that. So this kind of super fast, thin link time optimization that you've done is not there's

Starting point is 00:21:18 nothing comparable to it in MSVC or GCC that you know of? No. So I'm less familiar with the Microsoft compiler. And GCC has a much more scalable regular LTO than LLVM's regular LTO. Okay. But it still has. And it's been improved over the last couple of years. But ultimately, it's working on IR and the sort of serial phase. And so it, it does have memory cost. That's makes it infeasible for us.

Starting point is 00:21:53 Okay. We had an episode recently with some developers on Bazel. Does Bazel have any specific support for thin LTO? Or maybe the better question is, do build systems need to have any special support in order to use thin LTO? So for in-process thin LTO, which is sort of the default mechanism, it's completely transparent to the user. So it would be transparent to the build system. So for example, if you do an LLVM, if you turn on LTO, regular LTO or thin LTO, you hand off all the object files to the linker, it will behind under the covers, it will do this LTO. So for thin LTO, what that does is

Starting point is 00:22:39 it spins off threads for all those backend compilations. So you don't necessarily need to have build time support. However, one of the other big advantages of thin LTO, and the reason why we designed it the way we did is that we have a distributed build system in Google. And with thin LTO, you can make that a distributed compilation. So all those, so you do the thin link, and then you have all of these backend, what we call backend compilations that take the IR and apply the whole program results, and then, you know, build your final native objects. If you're doing that in process, that happens in threads and then passes those native objects back to the linker and it all just you just get an ADAT out, you know, like, you don't actually see that happening. But what you

Starting point is 00:23:23 can do with NLTO is break out those parallel backends and make those actually separate processes. There's a way through Clang to pass in the analysis results and do that back of LTO, thin LTO backend. And so what we did was to, and we worked with the Basel team to add support, to add those thin LTO backends as separate actions. So there's a, what we call in Bazel parlance, an LTO indexing action, which is the thin link. That's one process. And then we have like N thin LTO backend actions and then a final native link. And so those, all of those parallel backends get distributed just as as you would distribute your normal C++ compilations.

Starting point is 00:24:08 And so we get the benefit of that really high parallelism in those backend compiles. So Bazel does have support in there. It is aware of the NLTO compilation when you've turned that on. And it has support for actually distributing and breaking out those backend compiles. So yeah, so I worked with them quite a bit over the years. Okay, so Bazel's got like some extra support to parallelize it really well. But I guess any build system that uses Clang can make use of thin LTO?

Starting point is 00:24:42 Yes. Okay. Yep. Very cool. And so there are other build systems that have looked at adding support for the distributed build. So the, I'm blanking on the name, but the build system used for Chrome, for example,

Starting point is 00:24:58 they are adding in, they have, so thin LTO was on for Chrome, but they're adding in support for the distributed. I think they've added it in. I don't know whether it's turned on by default. And they use a different build system. And I know that there are other companies out there who have looked at enabling distributed NLTO, but it kind of depends.

Starting point is 00:25:21 Some companies have their own custom linkers, and so they have to plumb that support in. Sorry, I don't think we, if we've already addressed this, I don't recall, but how does a link time speed for thin LTO versus traditional linking compare? So if you're doing that in process, thin LTO, obviously, more of your compilation time moves into the link. So, you know, you don't get the explosive growth you get with regular LTO, but you definitely are going to, you know, you're doing more, you're doing the whole program analysis, and then you're actually compiling down to native code, a whole bunch of files, and then doing

Starting point is 00:26:03 a final native link. So obviously, like more of your compilation time moves to that part of your compile. If you break it out as distributed build, you know, the final native link is actually faster with an LTO. Because we've simplified the code more aggressively. So one of the examples is, you know, when you have a template code, that's the template definitions in your header files. Normally you get N copies of that. Every time you've used that header, you get a, if that template function is needed,

Starting point is 00:26:34 you instantiate a copy of it in every translation unit, essentially that references it. And then you compile it onto object files. You have N copies of that function. The linker will duplicate that. And so it decides which one's prevailing and it throws out the rest. With thin LTO, because we have linker information coming into the thin link, telling us which is the prevailing copy.

Starting point is 00:26:59 We actually, what we can, one of the things we do is we, all of the copies that are not prevailing, we actually drop them after inlining. So we keep them around long enough so that they can be inlined, but then they go away, and so they're not even in your native object that gets generated. So the linker doesn't have to do any more of this deduplication. So you get not only, I guess, faster link time, but one of the, actually,

Starting point is 00:27:24 this ended up being a nice side benefit for us internally is again, with these really large applications, we start hitting into all sorts of limits. And some of them are sort of artificially imposed by the build system with like the aggregate, you know, it tends to complain if you start sending it to like your link action,

Starting point is 00:27:43 an aggregate size of object files that is too large, like above some limit. And you can easily drop below that when you turn on thin LTO because your object files get significantly smaller. So that sort of ended up being a side benefit that we hadn't actually initially anticipated. Currently working on a project that gets random build failures on the CI

Starting point is 00:28:02 when we think it's because of memory running out during the linking phases. Oh, really? Interesting. Sounds fun. Maybe this would be good for you, Jason. Yeah. I think GCC is our primary compiler

Starting point is 00:28:18 on that particular platform, but something to think about. I think for Clang, we found... I'm sorry, yeah, the Clang binary itself, we found that the, I think the objects were reduced, overall was like 25% reduction in the Dado size going to the final link because of that early deduplication. So are there any reasons we shouldn't be using thin LTO or should we just enable it on all of our projects now?

Starting point is 00:28:46 So we've actually turned it on by default internally for our builds targeting production. There are some caveats, though. One of the things that we hit internally, and I think, you know, sort of independent of whether you do distributed thinLto or sort of in-process default thinLto, is that normally so if you have a source file that is feeds into like a number of different binaries in your build system, normally you can compile it down to an object file, a native object file once, and that could feed all of those links. With thinLTO, you can compile it into an IR file once, but those back-end compiles down to native code, and now we're target-dependent. So because you're using whole program information to optimize down to your native object, you sort of lose the cross binary sort of parallelism you get in your build for those pieces of the compile. So again, like as we talked about, more of the compilation moves into the link.

Starting point is 00:30:02 So you have to do that more for every single target binary that you're building. So there is a little bit of a scalability issue that you can hit if you're building tons and tons of statically linked binaries and you normally got to share native objects across all of those static links. Now parts of that compile have to be target or target dependent. So we've had to be a little bit careful with, you know, builds that say, launch, if you invoke Bazel, and you

Starting point is 00:30:36 are building, you know, a ton of different targets, you know, test targets at once. For us, normally, actually, by default, that's not a huge issue, because our tests are default linked shared. And you can do like little mini thin LTOs for your shared libraries and get native shared libraries that are shared. But some number of tests, maybe their integration tests have statically linked targets, and you might have a ton of them that you've spun up in one Bazel invocation. And now you get sort of an explosion of these backend actions. So we've had to be a little bit strategic about how we handle those cases. You know, maybe not doing full thin LTO

Starting point is 00:31:17 for like lots of different unit tests where you're not really going to be testing the same whole program optimization anyway. So, you know, do something a little bit simpler. So there's, you know, things that we've had to think about in the build system. And in terms of the model of how we use it, that we've had to be a little bit careful. I'm wondering, I don't know if you can speak to this at all. But I'm thinking, from what you're telling me, that if I'm using CLion, or VS code, or one of these IDEs that does compilation continuously in the background, you know, to see if the code I'm currently editing is correct, then using thin LTO flags could perhaps speed up my IDEs responsiveness in that regard, because it's only having to do the compilation of that

Starting point is 00:32:01 individual file. That sound like crazy talk? Well, if you were doing, if you were doing, if you were previously doing like regular LTO, it would definitely speed you up because of that incremental ability. Now, compared to a regular just dash O2 non-LTO, you still have to like redo that thin link. But again, it's pretty fast and pretty small in terms of memory. And you only have to redo the backend parts of the compile that are actually touched. So it would actually make that feasible, I would say, to do whole program optimization.

Starting point is 00:32:32 Now, the question is, do you really need to do whole program optimization in the development cycle? I don't know. Yeah, I'm just thinking about, yeah, just the individual.cpp files in the IDE. I mean, you'll have to try it with a bigger project and see what happens.

Starting point is 00:32:46 Okay. C-sharp and Java. The tool is a paid B2B solution, but there are various options for its free licensing for developers of open projects, Microsoft MVPs, students, and others. The analyzer is actively developing. New diagnostics appear regularly, along with expanding integration opportunities. As an example, PVS Studio has recently posted an article on their site covering the analysis of pull requests in Azure DevOps using self-hosted agents. Check out the link to the article in the podcast description. One question I had was, I think you said ThinLTO, you first proposed it, I think, in 2015. Is it still in active development?

Starting point is 00:33:38 Are you still trying to improve ThinLTO? Yeah, so we, it was first, so we proposed a very initial stuff in 2015. And then ended up implementing upstream throughout most of 2016. It got first released with LLVM in late 2016. We actually internally didn't start using till 2017 because we were previously using GCC. And so turning on thin LTO, and there's a whole other backstory there, but we used to use a different custom technology we built into GCC to do a sort of pseudo program. And so implementing thin LTO as part of LLVM was just one of the requirements for actually moving onto LLVM. So that move onto LLVM actually happened in

Starting point is 00:34:27 2017. And since then, what I've been working on is rolling it out. As I mentioned, all of our code targeting production by default has this turned on in the builds. There are some cases that are opted out for like various build system reasons. And we're sort of working through those, you know, like remaining issues, but most of our code for production, almost all of it has thin LTO on. And that was just a lot of work. But at the same time, we've been now that we have thin LTO on, it's more of a framework than a particular optimization. And so we're now looking at how do we leverage this having a whole program optimization framework turned on for everything at how do we leverage this having a whole program optimization framework turned on for everything.

Starting point is 00:35:06 How do we leverage that with additional optimizations? One of the things I've been looking at is rolling out whole program to virtualization using thin LTO. Then once you do that, you can actually sort of a broader class of whole program class hierarchy type of optimizations that you can start thinking about doing. So there's, and there's a variety of other things too. We've started using FinolTO to do some other optimizations that are, you know, are basically need to know like the working set size of the application. There's a variety of different things that you can do

Starting point is 00:35:47 with whole program optimization, as you can imagine. So it is actively being developed. And then there are actually quite a few external to Google contributors to thin LTO upstream. So I know Facebook has proposed, and I think they're using internally, a thin LTO-based similar code merging functionality because they care a lot about code size. So, yes, long answer, but the short answer is yes, definitely under active development.

Starting point is 00:36:18 You mentioned devirtualization for like, so actually removing virtual functions? Well, removing virtual function calls. Okay. So, and turning them into direct function calls. So if you can, you know, analyze your entire class hierarchy across the whole program, you know exactly which virtual functions are overridden.

Starting point is 00:36:41 And then you can then look at your particular call sites and the type of the statically declared type of that call site and know whether it's essentially the final implementation. Sometimes you can do that without whole program optimization, but in a lot of cases, you need that whole program optimization to actually see your entire class hierarchy to guarantee that, yes, I know that I can make a direct call to that particular virtual function implementation at this call site. So something I've done in the past, probably not a good idea, but something I've done

Starting point is 00:37:17 is loading classes from shared libraries. And so I've got some new type that the compiler didn't know about at all before. And I override a virtual function in there. And I use that as my kind of, you know, callback mechanism or whatever from a shared library being loaded. If I turn on whole program optimization and it de-virtualizes things because as well,

Starting point is 00:37:42 that's all you told me about. Could that break functionality like that now? Yes. Oh, okay. Yeah, so, right. So obviously, I mean, you really have to be able to guarantee that you can see the entire class hierarchy to do this. So you have to be, you know, you have to apply it where it is,

Starting point is 00:38:02 you know, you have those guarantees. And actually, so to that end, originally, so there was a whole program optimization implemented for regular LTO. This was actually done for Chrome, which uses a security mechanism in the compiler called control flow integrity. So Peter Collinburn, who implemented that CFI also also implemented LTO-based whole program to virtualization, which helps reduce the cost. And he also implemented a sort of a hybrid, partially thin LTO, partially regular LTO mechanism for doing that. But internally for Google, we really needed to be fully thin LTO.

Starting point is 00:38:38 So the work that I did more recently was to actually port that over to make it work for just fully thin LTO builds. And traditionally, the way that they used this for Chrome, for example, was they turned on a compiler flag that basically says like your visibility, visibility for everything marked, it was like visibility hidden, which, I mean, can be used for other things besides whole program to virtualization, but it also applied to sort of your, your V tables and basically told LTO, which V tables it could assume were basically hidden. And, you know,

Starting point is 00:39:23 it saw the entire class hierarchy for those V tables in your sort of LTO unit. And they actually had to go then mark with an attribute cases that sort of violated that where, for example, shared library might override or add a child class or an overrider. For our internal use, actually, so we don't want to have to go and manually mark up code. It's not scalable. And also we want to be able to build a source file once into IR and use it for shared library builds

Starting point is 00:40:01 where it's not legal to do the sort of whole program class hierarchy analysis and also use the same IR object for statically linked targets where we guarantee that we see everything. And so some of the work that I did to make this usable inside Google was to basically defer that decision about whether you have whole program visibility into the link. And then we basically apply an option when we are building a binary statically that says, yes, okay, go ahead. You have your guarantee to have whole program visibility. And actually that was something I think,

Starting point is 00:40:38 they talked about in the Basel talk interview that you had where some of the advantage of our build system is we know whether we're building a library or a binary. That's sort of part of what we, when we write our build files, you can give that information about the target. And so we can leverage that to basically do different types of link time optimization,

Starting point is 00:41:00 depending on whether you're building a library or building a statically linked binary, for example. Okay. So there is a hope that I could still use thin LTO in a project like that, but not flip the flag that says, oh, and assume you have full visibility of all virtual. Right. Right. Okay. You could even, you could turn on the option that says do whole program to virtualization, and it would only apply to things for example classes defined in anonymous namespaces where it knows just from the way that the code is written that I'm guaranteed that I basically have see the entire class hierarchy

Starting point is 00:41:37 because in those cases you don't need linker information you don't need a whole program you know that it's essentially hidden LTO visibility. Right. One of the things you mentioned to us when we were talking via email starting before this interview was that you're working on sanitizer light heap profiling. And that's what it sounded like to me. Is that right? Okay, so yeah, I probably didn't describe that very well. So we're working on heat profiling built into LLVM that basically uses a sanitizer approach to profiling the memory. So for example, in LLVM, we have support for various sanitizers like ASAN, address sanitizer. And it uses something called shadow memory to make that analysis and tracking of memory that it needs to do faster than other approaches. And so basically, the idea behind shadow memory is every piece of memory that you allocate has like a, you know, you can do this like mapping like in Asan it's I think eight bytes down to like a one byte. I forget their mapping. It's

Starting point is 00:42:52 eight to one. I think so it must be eight bytes down to one byte of like shadow memory flag saying, you know, flagging whether basically you've done something violated, whether you've accessed memory appropriately or not. And that ASAN ends up having, I think, a 2x runtime cost, which is actually pretty low for doing that kind of tracking of memory. And so basically for heat profiling, we're leveraging that sort of idea of doing shadow memory. And so for each piece of memory that you allocate, we map is again eight to one,

Starting point is 00:43:29 but the granularity is a little bit larger. And so we can track in the shadow memory, the hotness. So every time you load or store a piece of memory, it's a very simple instrumentation in the compiler to say like shift and mask and like update the count in my shadow memory. And then when you're done with the run or when you deallocate memory, you basically grab the shadow memory count. And

Starting point is 00:43:50 now you get like a listing of for every piece of memory I accessed with full context because the sanitizer also tracks the allocation context. How hot was that data? And then we also have some tracking in the header to when you allocate something, you allocate an additional little piece of header. And again, that's sort of similar to what ASAN does. And you can track other things in there like the lifetime. So again, you can get at the end of your run, you can say for this allocation context, what is my hotness? And also what is my lifetime? You can track that like as an average, or like a min max, or you know, you can do different types of tracking because

Starting point is 00:44:30 so the idea is that at the end of your run, you have full stack context of allocation sites, along with average hotness, or, you know, some indication of hotness, some indication of lifetime. And then eventually, we would like to feed that back into the compiler and either do some transformations in the compiler and or pass information off to the memory allocation runtime to allocate and handle the memory differently depending on whether it's hot or cold or long lifetime or short lifetime. So for the hotness, you can get better locality. For tracking the lifetime, you can get hopefully better reduced memory fragmentation by basically allocating long-lived memory, say, in one huge page and short-lived memory in a different one. So there's various things that you can look at doing.

Starting point is 00:45:27 So we proposed this heap profiler upstream in LLVM. And at this point, the instrumentation side and the runtime has been upstreamed. And what we haven't done yet is the actual feedback into the compiler. That sounds hard. It's, I mean, it's like a matter of engineering. Profile-guided optimization, but taken kind of to another level, I feel like. Right. It's very interesting. Because internally, we use profile guided optimization quite a lot. And we use for our peak optimization customers, like the binaries that really care the most about performance, they use instrumentation based profiling.

Starting point is 00:46:15 For sort of the long tail, we use data collected from hardware counters, which is called auto FDO. And that gives a lot, you know, a good amount of performance, but not quite the same performance you get with instrumentation based profiling. So we would like to actually do a single profiling run, and essentially collect both the your sort of traditional profile guided optimization profile information alongside the heap profile and do a single feedback and basically use them independently and also sort of together. It's the ultimate vision. So is the, sorry, the other technologies that you mentioned, like instrumentation based profile guided optimization, is that stuff that's built into Clang currently? Or LLVM? Mm-hmm. Oh, okay.

Starting point is 00:47:13 So yeah, so Clang and LLVM have support for both that instrumentation-based profiling. It also has support for feeding back what I was talking about, the auto FDO profiles, which we collect in production basically using hardware counters. And I can there's i think there's uh yeah depends on your architecture but and what hardware you're collecting on but there's there's mechanisms for basically converting hardware counters into these sort of auto fdo profiles that you can then feedback gonna have to look that up later yeah it's called um so of course naming is always interesting. So, auto FDO is, because we initially

Starting point is 00:47:48 did this on GCC, GCC calls it feedback directed optimization, FDO, and just for fun, Clang calls it PGO, profile guided optimization.

Starting point is 00:48:00 So, inside LLVM, it's actually called sample PGO, which is the same thing as auto FDO. Just depends on which compiler you're using. Okay, I think I have enough search links up here to come back to that later. Now, what's interesting to me, though, is I mean, you're talking about all these technologies coming together.

Starting point is 00:48:19 We're running a little short on time. But my experience has been that, you know, PGO will gain me something, but LTO tends to gain me more. And using them together seems to often be a waste of time on the average application. But perhaps I'm missing something here. So actually, we find that the performance we get from the two of them combined is actually better than just adding together the individual benefits. They're doing slightly different things. I mean, FGO, PGO is telling you maybe, you know, like which calls are hot, which functions are hot, you know, which blocks are hot. And so you can do like, you know, you can do smarter inlining with that

Starting point is 00:49:01 information. You can do, you know, code layout with that information. You can do code layout with that information. You can drive a bunch of optimizations with that. LTO is giving you the ability to do, for example, inlining across module boundaries. If you combine them, so I have this ability to do inlining across module boundaries, but if I don't know what's hot, you could make the wrong decisions. So when you combine it with PGO information, you can do much smarter inlining. So actually your benefit gets magnified. Okay. I think the last time I evaluated this was even before thin LTO was probably about five years ago. Okay. And also I work on simulation software. So trying to come up with what is a typical workload is almost impossible because we don't know how the user,

Starting point is 00:49:45 what they're going to try to simulate. That's always hard, yeah. Interesting. Well, thank you for that diversion there. Just to go back to thin LTO one more time, if a listener is already using Clang and using traditional LTO currently, is it as easy as swapping out a compiler flag to switch over to thin LTO?

Starting point is 00:50:08 So traditional LTO is dash F LTO. If you just make that dash F LTO equal thin boom, it's an LTO. Awesome. So, and all the linkers that support LLVM clangs, regular LTO support, then LTO is,

Starting point is 00:50:23 I mean, at least all of the publicly available linkers. I don't know about all the niche ones in there, you know, that aren't public. But yeah, like Gold and LLD. And there's also like you can actually do it with GNU LD because it uses a plugin mechanism. So it just uses the Gold plugin on the LLVM side. I know there's support there. I know people do it.

Starting point is 00:50:44 I will say that we don't test that with any build bots on the LLVM side. I know there's support there. I know people do it. I will say that we don't test that with any build bots on the LLVM side, that combination, but I told it works. And LD64, if you're doing, if you're developing for Mac OS also supports it. Okay. Well,

Starting point is 00:50:57 it's been great having you on the show today, Teresa. Thanks for having me. Where can listeners find you online? So I'm not very, I don't really do much social media at all. So really. That's better for your mental health. You know, my email, my work email that I also use for LLVM development is tejohnson at google.com.

Starting point is 00:51:20 And that's probably the best way to reach me about this stuff. Okay. It's been great having you on the show today. Thanks. I'm happy to be here. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in,

Starting point is 00:51:34 or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast. And of course, you can find all that info and the show notes on the podcast website

Your Ad Here

CppCast - ThinLTO

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.