CppCast - ThinLTO
Episode Date: October 30, 2020Rob and Jason are joined by Teresa Johnson from Google. They first discuss the Qt6 beta release and a blog post proposing range_ref, a lightweight view for ranges. Then they talk to Teresa about ThinL...TO, the scalable and incremental Link Time Optimization built into LLVM. News Qt 6.0 Beta Released Range_ref Rob and Jason AMA Links ThinLTO CppCon 2017: Teresa Johnson "ThinLTO: Scalable and Incremental Link-Time Optimization" Meeting C++ 2020 - ThinLTO Whole Program Optimization: Past, Present and Future Sponsors PVS-Studio. Write #cppcast in the message field on the download page and get one month license PVS-Studio: analyzing pull requests in Azure DevOps using self-hosted agents Why it is important to apply static analysis for open libraries that you add to your project Use code JetBrainsForCppCast during checkout at JetBrains.com for a 25% discount
Transcript
Discussion (0)
Episode 271 of CppCast with guest Teresa Johnson, recorded October 29th, 2020.
Sponsor of this episode of CppCast is the PVS Studio team.
The team promotes regular usage of static code analysis and the PVS Studio static analysis tool.
And by JetBrains, the maker of smart IDEs and tools like IntelliJ, PyCharm, and ReSharper.
To help you become a C++ guru, they've got CLion, an Intelligent IDE,
and ReSharper C++, a smart
extension from Visual Studio.
Exclusively for CppCast, JetBrains is offering
a 25% discount on yearly
individual licenses on both of these
C++ tools, which applies to
new purchases and renewals alike.
Use the coupon code JETBRAINS
for CppCast during checkout at
JetBrains.com to take advantage
of this deal.
In this episode we talk about Qt 6 Beta and a view for ranges.
Then we talk to Teresa Johnson from Google.
Teresa talks to us all about LLVM's thin LTO. Welcome to episode 271 of CppCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. How are you doing?
Doing okay. I just saw on Twitter that you were on another podcast getting interviewed. Is that right?
Yeah, the LeanPub FrontMatter podcast.
So I was being interviewed about the book that I published a couple months ago now.
Oh, cool. So they just interview lots of different types of authors. Yeah. I mean, I don't know who all
they interview because I didn't actually go back and look at their archives, but, um, uh, it's,
it seems to be mostly people who are, who have published books on lean pub. So it's the lean
pub podcast. Makes sense. Had a good time there. Yeah, it was a fun interview. It's weird being on the other side of the table, so to speak.
Yeah, definitely. Okay, well, at the top there's
every episode I like to read a piece of feedback. We got this tweet
from Corentin, who we've had on the show before. He said, listening to Ben
Dean on CPPcast, I don't think Titus is a big fan of
UDL's user-defined literals. I wonder why
we need more Tituses. And Titus actually responded to this saying, it's a very simple trick. Imagine
how a feature will work if everyone piles on to use it, especially in the face of collisions.
And then I guess he puts UDLs in the same bucket as global namespaces and macros.
Yeah. I know some people who are huge fans of UDLs
because they like the way that it simplifies some things,
but I've heard too much from compiler developers
and I can't bring myself to use them now.
Okay.
Well, we'd love to hear your thoughts about the show.
You can always reach out to us on Facebook, Twitter,
or email us at feedback at cpcast.com.
And don't forget to leave us a review on iTunes or subscribe on YouTube. Joining us today is Teresa Johnson. Teresa develops
compiler optimization technologies at Google. She's an active contributor to the LLVM open
source project and designed the thin LTO scalable link time optimization framework. Prior to joining
Google in 2011, she developed compiler optimizations for the Itanium compiler at HP.
She received a PhD in electrical and computer engineering from the University of Illinois at Urbana-Champaign in 1998.
Teresa, welcome to the show.
Thank you. Glad to be here.
Itanium is relatively short-lived technology, huh?
Yeah, it was great as a compiler developer. There's a lot of really interesting... It really relied on compiler optimization,
which was a neat architecture to be working on.
But then for a variety of reasons,
many of them not technical,
it ultimately didn't succeed.
So...
Am I correct to think, for some reason,
I feel like the Itanium calling conventions,
that's what actually directly influenced
our 64-bit
Intel calling conventions today? I think so. Okay. Yeah, because I often see references to the Itanium
calling convention even and things that I work on now. So yeah. Interesting legacy there. Very cool.
Okay, well, Teresa, we got a couple news articles to discuss, and then we'll start talking more about your work on LTO and other things you're doing at Google.
Okay.
All right.
This first article we have is about Qt 6.0 beta being released.
And I actually have a separate link here with a list of new features that are available in Qt 6.
One of the big ones I noticed was that they are now building Qt with CMake.
I thought that was a big change for them, right?
Oh, I missed that.
I did read recently that they had officially dropped support for their...
It's some proprietary build system, right?
Qt build or something.
Hmm.
Yeah.
Sorry, what were you going to say?
No, I was just going to say there's lots of random just classes changed in places.
So I was browsing through it, but it was hard to get an overall feel for what someone should expect for Qt 6.
Although it does seem like a few things were removed.
There's a few comments on the first link with people saying, but you removed the three features that I require.
So, but they said those will be coming in a future update, I guess, or via package management.
I think they also mentioned
that there is a Qt 5 compatibility library.
So I guess if you're dependent
on some older legacy APIs that were removed,
maybe you can still access them that way.
They've done that a lot historically.
I think every version has had a backward compatibility library all right and then the next article we have uh
is about range ref and i thought this article first of all was just very good at explaining
the the motivation and use cases for spans and then they go into uh you know what they see as some of the shortcomings and
propose this new range ref type yeah this one was particularly interesting to me because i just gave
a talk at my meetup talking about various techniques for making other kinds of view
like things like stream view and function ref and uh it's kind of funny to me because this is like the,
I solved a similar problem in a much more complicated way, basically. My solution,
I guess, is slightly more generic. Because this one, the main entry point is for each,
that's all you can do is do a for each over the range. Mine would let you get back
begin and end iterators. But yeah, anyhow. I took a look through that article. I thought it was an
interesting article, and it addresses a couple of issues we frequently see with compiler
optimization with people misusing, well, temporary lifetime issues, which are always fun to debug,
and also reducing the number of required template instantiations,
which they don't mention it there,
but a frequent problem that we see when we turn on higher levels of optimization,
particularly whole program optimization,
is people who have inadvertently defined template functions in their source files
and not in the header,
which often you can get by with until suddenly the compiler does something interesting
and then you get a link,
link time error.
So,
okay.
And this,
uh,
last thing we have is a announcement on the meeting C++ blog.
And this is about,
uh,
you and I,
Jason,
we're actually going to be doing,
uh,
an AMA on the first day of meeting C++.
First time I've done one of these,
I don't know about you.
It will be the first time I've done one as well.
And I don't think the actual time is announced.
Just that we will be doing it on the first day of the conference.
Yeah, I don't see a time listed here either.
And what is the first day of the conference? It's like what? First or second week in November?
Seems like something we should know if we're going to be doing it.
I guess, Teresa, you should know as well since you're one of the
keynoters right yeah and the friday two weeks from this friday is my talk so it's like a day
a couple days before and after that okay so second week of november i guess yeah so that's the 13th
then i guess is when you'll be giving your keynote and rob and i will be doing the ama on the 12th
what is an ama i guess ask me anything. Ah, thank you. Okay. I'm only
familiar with it from Reddit. Oh, okay. Yeah. I don't know. It started on Reddit. Okay. So,
Teresa, as we said, you're going to be doing one of the keynotes at Meaningful Plus in a few weeks.
And I think the main thing you'll be talking about is thin LTO. Could you start off by giving us an
overview of what thin LTO is? Sure. So well, let me start with just what link time optimization is.
So basically, link time optimization is a way of implementing whole program optimization.
So you know, typically, when you build your source files,
your C++ source files,
the optimizer works on a translation unit basis
and then you get native code
and you link it all together at the end.
So the optimization is limited
to what it can see within a translation unit.
So the source files and anything in lines from the headers.
So with whole program optimization, obviously you need to be able to see the entire program.
The advantage of that is you can do optimization across those translation unit boundaries. And I
might say module just because in LLVM parlance, a translation unit becomes a module. So if I say
module, that's what I mean. So the advantage of doing that at link time is, well, for one,
it doesn't change the user's model of compilation. So you still would do like a bunch of compiles
onto an object file, which when you turn on link time optimization will then not be actually native
code. It'll be the compiler's internal representation sort of hidden under the covers. And then when you link all those data together, there's some mechanism that needs to
be built into the linker. It could be a plugin, which is what happens with BFD, LD, and GOLD.
Or it can be like with LLD, which is the LLVM's linker. It just links in the compilers as part
of that linker. And it basically hands those files off to the compiler.
And then the compiler can do something to basically see the entire program. So there's
no change to the user model. And then the other advantage of doing it in link time is that the
linker then can give information to the compiler about symbols, symbol resolutions. So which copy of a, say you have like template file
or template functions defined in headers,
you get like n copies of those.
It can tell you which one is prevailing.
It can also tell you if there are symbols
that will be exported potentially.
So it can tell you, for example,
if you're linking into a shared library,
it will tell you something different about then,
you know, if you are linking
into an actual statically linked binary, where it could actually tell you
for sure that, you know, which symbols are exported and which ones won't. So you can use
that link time information to basically do more aggressive optimization. So that's why whole
program optimization is frequently done as a link time optimization. So the way that that has been done traditionally
is you get all these objects,
which are the compiler's internal representation,
the linker hands it off the compiler.
And the easiest way to do the whole program optimization
is basically take all those IR files
and basically merge them all together
into one big internal representation
of the entire program.
And now I can see everything, right?
Well, that's great, except you have a bit of a blow up, both memory and time.
And for small programs, that might be okay.
But, you know, one of the issues we had in Google is that our applications are,
you know, I was sort of amazed when I moved over to Google,
and the biggest problem we had was like on the iCache side and the ITLB side. And basically,
the applications are so huge, they blow up, you know, they are hitting up against every limit.
And so, you know, when you get to the compiler, and you want to talk about whole program
optimization, trying to merge all those together is just infeasible. And I think one of our applications we tried to build with whole
program, like traditional LTO on a really beefy machine, and it just, you know, you hit like 60
gigs, and it just it's still going. So it's totally infeasible. So thin LTO is a way that we
is a mechanism that we designed to basically do link time optimization in a scalable way so that we can actually build our applications with whole program optimization.
So the way that that works is that instead of merging all of your IR into one big monolithic blob, when the compiler builds those IR data files, it also computes a small summary of the module,
things that you might need for a whole program optimization.
So call edges, which symbols are defined,
what their linkage type is, which ones are referenced,
and the edges between those things.
And then when the linker,
instead of passing off all of the IR
to the compiler to merge, it just merges all those summaries. And so they're much smaller.
And so they don't blow up in memory. And you do all of the whole program analysis on the summaries.
And it's much, much faster. And then at the end of that analysis, all of the translation units
can continue to be optimized independently with
information that that whole program phase passes off to it. So you basically still
maintain the parallelism of compiling each translation unit independently, but with
certain information from this very thin whole program step. And one of the things that's really key to performance
is inlining. And so we want to be able to inline across translation unit boundaries.
So to do that, this whole program phase will decide, you know, based on like the full call
graph that it can build by merging the sub arrays, I can decide, you know, which functions from other
translation units I might want to inline.
And it passes that information off.
And so each translation unit, as it's compiled through the backend phase of this thin LTO,
will basically go and just grab the functions from other objects that it needs,
rather than pulling in the whole IR.
It just pulls in these smaller pieces.
And so you don't get the memory blow up.
So you get the parallelism, which reduces the time, and you get the smaller memory.
And then another benefit of that is that you can actually do better incremental compiles.
You don't need to, if you're doing like a traditional LTO where you've merged all the IR,
if you touch anything, you have to redo that whole big thing.
With an LTO, you have to redo the thin link, which is what we call that thin sort of summary-based whole program base, but it's pretty fast.
And then you only need to do sort of the backend pieces of the, you know, compiles that had any new information. So you
can cache basically your final native code and decide based on new summary information coming
in, new analysis information coming in, which actually have to be touched. So if you touch
one source file, you know, that might affect that particular source file's backend, but it might
also affect a couple others that were importing functions from it. But it's much smaller than obviously doing the whole thing together. So in a nutshell, that's
what is the idea behind thin LTO. What kind of performance benefits can you see
in general from using whole program optimization or LTO versus traditional linking?
So, you know, one of the biggest levers,
I mentioned this just a minute ago, is inlining.
That's one of the biggest performance levers
that we have in the compiler.
You get huge benefits from having inlining.
And when you can do it across these module translation unit boundaries,
that can be a really big win.
That's really, if you look at the sort of regular way
of merging all of the IR and doing LTO,
the biggest performance win is from that cross module,
cross translation unit boundary inlining that you get.
And so, you know, really the first optimization
that we implemented for thin LTO
was that sort of function importing that I talked about.
So decide which functions I need
and just very quickly parse out, you know, from the other IR files, the pieces that I need about. So decide which functions I need and just very quickly parse out
from the other IR files the pieces that I need and pull those in. And that actually gets you
the vast majority of your regular LTO whole program benefit and can give you really big wins
over just regular O2 compilation. So we see on average 5% performance improvement
from turning on thin LTO, but it can vary.
I mean, we've seen 10 plus percent
and it just depends on the application.
So one of the first questions I had was,
this sounds like this became necessary
just because of the size and scale of programs at Google.
But a lot of us work on programs that are nowhere near Google scale.
Is thin LTO still a benefit versus traditional LTO
because of what you were saying with the incremental compilation?
Okay.
Yeah.
And in fact, it was kind of interesting.
When we proposed thin LTO, like Euro LLVM 2015, I think is when
David Lee, who I work with at Google, and I proposed this model, we were thinking of
it in terms of, you know, addressing the scalability.
But one of our, we ended up collaborating with Mediamini, who was at Apple at the time.
And he was really interested in it
from the incremental perspective.
And that actually became a really big benefit
for us as well.
But it was sort of scalability was our sort of first,
the biggest concern we were trying to address initially.
And so for a lot of Apple's code,
the NelTO became a way to get that sort of
very fast incremental build ability
and more so than needing it from a scalability perspective. Okay. then LTO became a way to get that sort of very fast incremental build ability.
And more so than needing it from a scalability perspective. Okay.
And then what sort of runtime differences are there between a program using traditional
link time optimization and thin link time optimization?
Like, I understand, you know, the incremental builds, that's a huge win for like programmer
productivity.
But if you're making like a release build where you're just, you know, compiling this I understand the incremental builds, that's a huge win for programmer productivity.
But if you're making a release build where you're just compiling this once on a build server or something, is it worth it to still use thin LTO there if you can link with LinkedIn
optimization?
So that's a good question.
And certainly it's going to be faster, but, you know, the smaller the program you have, obviously those benefits are going to be less overwhelming.
And regular LTO, I mean, the performance delta between regular and thin LTO was actually quite small.
I mean, we weren't able to measure it for our larger applications, but we were able to measure it across a spec CPU.
And it's on average, it's
like very, very close. But there's certainly some things that we can't do without the full IR and
being able to, you know, see everything, you know, do this sort of iterative, you know, anything
that's like, really requires a lot of like iterative analysis and transformation on IR.
You know, that gets a little bit harder with NLTO. We're doing, we've done a bunch of, rolled out a bunch of new optimizations with an LTO, but it takes a bit more
work to get, there's pieces that are just harder to do with, you know, the mechanism where we have
like a thin whole program optimization and then the IR transformation. So if you have a small application that can build fine with full you know regular lto
uh and you're you don't care about the incremental benefits and you really care about like eking out
every last bit of performance then you know maybe regular lto is still has some benefit
um but it's you know it's not going to give you a lot of benefit over thin LTO.
Okay.
Thin LTO, that name, that's a clang or LLVM-specific technology, right?
Yes.
So it's only implemented in LLVM right now. I know reading through blogs of someone who works on LTO for GCC that they just talked about, you know,
potentially adding thin LTO there, but I don't think anyone has seriously started looking at
that. So this kind of super fast, thin link time optimization that you've done is not there's
nothing comparable to it in MSVC or GCC that you know of? No. So I'm less familiar with the Microsoft compiler.
And GCC has a much more scalable regular LTO than LLVM's regular LTO.
Okay.
But it still has.
And it's been improved over the last couple of years.
But ultimately, it's working on IR and the sort of serial phase.
And so it, it does have memory cost.
That's makes it infeasible for us.
Okay.
We had an episode recently with some developers on Bazel.
Does Bazel have any specific support for thin LTO?
Or maybe the better question is, do build systems need to have any special support in order to use thin LTO?
So for in-process thin LTO, which is sort of the default mechanism, it's completely transparent to the user.
So it would be transparent to the build system. So for example, if you do
an LLVM, if you turn on LTO, regular LTO or thin LTO, you hand off all the object files to the
linker, it will behind under the covers, it will do this LTO. So for thin LTO, what that does is
it spins off threads for all those backend compilations. So you don't necessarily need to have build time support.
However, one of the other big advantages of thin LTO, and the reason why we designed it the way
we did is that we have a distributed build system in Google. And with thin LTO, you can make that
a distributed compilation. So all those, so you do the thin link, and then you have all of these
backend, what we call backend compilations that take the IR and apply the whole program results,
and then, you know, build your final native objects. If you're doing that in process,
that happens in threads and then passes those native objects back to the linker and it all
just you just get an ADAT out, you know, like, you don't actually see that happening. But what you
can do with NLTO is break out those parallel backends and make those actually separate processes. There's a way
through Clang to pass in the analysis results and do that back of LTO, thin LTO backend.
And so what we did was to, and we worked with the Basel team to add support, to add those thin LTO backends as separate actions.
So there's a, what we call in Bazel parlance, an LTO indexing action, which is the thin
link.
That's one process.
And then we have like N thin LTO backend actions and then a final native link.
And so those, all of those parallel backends get distributed just as as you would distribute your normal C++ compilations.
And so we get the benefit of that really high parallelism
in those backend compiles.
So Bazel does have support in there.
It is aware of the NLTO compilation when you've turned that on.
And it has support for actually distributing and breaking out those backend compiles.
So yeah, so I worked with them quite a bit over the years.
Okay, so Bazel's got like some extra support to parallelize it really well.
But I guess any build system that uses Clang can make use of thin LTO?
Yes.
Okay.
Yep.
Very cool. And so there are other build systems
that have looked at adding support
for the distributed build.
So the, I'm blanking on the name,
but the build system used for Chrome, for example,
they are adding in, they have,
so thin LTO was on for Chrome,
but they're adding in support for the distributed.
I think they've added it in.
I don't know whether it's turned on by default.
And they use a different build system.
And I know that there are other companies out there who have looked at enabling distributed NLTO,
but it kind of depends.
Some companies have their own custom linkers,
and so they have to plumb that support in. Sorry, I don't think we, if we've already addressed this, I don't recall, but how does a link time speed for thin LTO versus
traditional linking compare? So if you're doing that in process,
thin LTO, obviously, more of your compilation
time moves into the link.
So, you know, you don't get the explosive growth you get with regular LTO, but you definitely
are going to, you know, you're doing more, you're doing the whole program analysis, and
then you're actually compiling down to native code, a whole bunch of files, and then doing
a final native link.
So obviously, like more of your compilation time moves to that part of your compile. If you break it out as
distributed build, you know, the final native link is actually faster with an LTO. Because we've
simplified the code more aggressively. So one of the examples is, you know, when you have a template
code, that's the template definitions in your header files.
Normally you get N copies of that.
Every time you've used that header,
you get a, if that template function is needed,
you instantiate a copy of it in every translation unit,
essentially that references it.
And then you compile it onto object files.
You have N copies of that function.
The linker will duplicate that.
And so it decides which one's prevailing and it throws out the rest.
With thin LTO, because we have linker information coming into the thin link,
telling us which is the prevailing copy.
We actually, what we can, one of the things we do is we,
all of the copies that are not prevailing, we actually drop them after inlining.
So we keep them around long enough so that they can be inlined,
but then they go away,
and so they're not even in your native object that gets generated.
So the linker doesn't have to do any more of this deduplication.
So you get not only, I guess, faster link time,
but one of the, actually,
this ended up being a nice side benefit
for us internally is again,
with these really large applications,
we start hitting into all sorts of limits.
And some of them are sort of artificially imposed
by the build system with like the aggregate,
you know, it tends to complain
if you start sending it to like your link action,
an aggregate size of object files
that is too large,
like above some limit.
And you can easily drop below that when you turn on thin LTO
because your object files get significantly smaller.
So that sort of ended up being a side benefit
that we hadn't actually initially anticipated.
Currently working on a project that gets random build failures on the CI
when we think it's because of memory running out
during the linking phases.
Oh, really?
Interesting.
Sounds fun.
Maybe this would be good for you, Jason.
Yeah.
I think GCC is our primary compiler
on that particular platform,
but something to think about.
I think for Clang, we found... I'm sorry, yeah, the Clang binary itself, we found
that the, I think the objects were reduced, overall
was like 25% reduction in the Dado size
going to the final link because of that early deduplication.
So are there any reasons we shouldn't be using thin LTO or should we just
enable it on all of our projects now?
So we've actually turned it on by default internally for our builds targeting production.
There are some caveats, though. One of the things that we hit internally, and I think, you know, sort of independent
of whether you do distributed thinLto or sort of in-process default thinLto, is that normally
so if you have a source file that is feeds into like a number of different binaries in
your build system, normally you can compile it down to an object file, a native object file once, and that could feed
all of those links. With thinLTO, you can compile it into an IR file once, but those back-end
compiles down to native code, and now we're target-dependent. So because you're using whole program information to optimize down to your native object, you sort of lose the cross binary sort of parallelism you get in your build for those pieces of the compile.
So again, like as we talked about, more of the compilation moves into the link.
So you have to do that more for every single target binary that you're building.
So there is a little bit of a scalability issue
that you can hit if you're building tons and tons
of statically linked binaries
and you normally got to share native objects
across all of those static links.
Now parts of that compile have to be target or target dependent. So we've had to
be a little bit careful with, you know, builds that say, launch, if you invoke Bazel, and you
are building, you know, a ton of different targets, you know, test targets at once. For us,
normally, actually, by default, that's not a huge
issue, because our tests are default linked shared. And you can do like little mini thin LTOs for your
shared libraries and get native shared libraries that are shared. But some number of tests,
maybe their integration tests have statically linked targets, and you might have a ton of them
that you've spun up in one Bazel
invocation. And now you get sort of an explosion of these backend actions. So we've had to be a
little bit strategic about how we handle those cases. You know, maybe not doing full thin LTO
for like lots of different unit tests where you're not really going to be testing the same
whole program optimization anyway. So, you know, do something a little bit simpler. So there's, you know, things that we've had to think
about in the build system. And in terms of the model of how we use it, that we've had to be a
little bit careful. I'm wondering, I don't know if you can speak to this at all. But I'm thinking,
from what you're telling me, that if I'm using CLion, or VS code, or one of these IDEs that does
compilation continuously in the background, you know, to see
if the code I'm currently editing is correct, then using thin LTO flags could perhaps speed up my
IDEs responsiveness in that regard, because it's only having to do the compilation of that
individual file. That sound like crazy talk? Well, if you were doing, if you were doing, if you were previously doing like regular
LTO, it would definitely speed you up because of that incremental ability.
Now, compared to a regular just dash O2 non-LTO, you still have to like redo that thin link.
But again, it's pretty fast and pretty small in terms of memory.
And you only have to redo the backend parts
of the compile that are actually touched.
So it would actually make that feasible, I would say,
to do whole program optimization.
Now, the question is,
do you really need to do whole program optimization
in the development cycle?
I don't know.
Yeah, I'm just thinking about, yeah,
just the individual.cpp files in the IDE.
I mean, you'll have to try it with a bigger project
and see what happens.
Okay. C-sharp and Java. The tool is a paid B2B solution, but there are various options for its free
licensing for developers of open projects, Microsoft MVPs, students, and others. The
analyzer is actively developing. New diagnostics appear regularly, along with expanding integration
opportunities. As an example, PVS Studio has recently posted an article on their site
covering the analysis of pull requests in Azure DevOps using self-hosted agents.
Check out the link to the article in the podcast description.
One question I had was, I think you said ThinLTO, you first proposed it, I think, in 2015.
Is it still in active development?
Are you still trying to improve ThinLTO?
Yeah, so we, it was first, so we proposed a very initial stuff in 2015. And then
ended up implementing upstream throughout most of 2016. It got first released with LLVM in
late 2016. We actually internally didn't start using till 2017 because we were previously using GCC.
And so turning on thin LTO, and there's a whole other backstory there,
but we used to use a different custom technology we built into GCC to do a sort of pseudo program.
And so implementing thin LTO as part of LLVM was just one of the requirements for actually moving onto LLVM.
So that move onto LLVM actually happened in
2017. And since then, what I've been working on is rolling it out. As I mentioned, all of our
code targeting production by default has this turned on in the builds. There are some cases
that are opted out for like various build system reasons. And we're sort of working through those,
you know, like remaining issues, but most of our code for production, almost all of it has thin LTO on. And that was
just a lot of work. But at the same time, we've been now that we have thin LTO on, it's more of
a framework than a particular optimization. And so we're now looking at how do we leverage this
having a whole program optimization framework turned on for everything at how do we leverage this having a whole program optimization framework
turned on for everything.
How do we leverage that with additional optimizations?
One of the things I've been looking at is rolling out whole program to virtualization
using thin LTO.
Then once you do that, you can actually sort of a broader class of whole program class
hierarchy type of optimizations that you
can start thinking about doing. So there's, and there's a variety of other things too. We've
started using FinolTO to do some other optimizations that are, you know, are
basically need to know like the working set size of the application. There's a variety of different things that you can do
with whole program optimization, as you can imagine.
So it is actively being developed.
And then there are actually quite a few external to Google contributors
to thin LTO upstream.
So I know Facebook has proposed,
and I think they're using internally,
a thin LTO-based similar code merging functionality because they care a lot about code size.
So, yes, long answer, but the short answer is yes, definitely under active development.
You mentioned devirtualization for like, so actually removing virtual functions?
Well, removing virtual function calls.
Okay.
So, and turning them into direct function calls.
So if you can, you know,
analyze your entire class hierarchy
across the whole program,
you know exactly which virtual functions are overridden.
And then you can then look at your particular call sites
and the type
of the statically declared type of that call site and know whether it's essentially the final
implementation. Sometimes you can do that without whole program optimization, but in a lot of cases,
you need that whole program optimization to actually see your entire class hierarchy to
guarantee that, yes, I know that I can make a direct call
to that particular virtual function implementation at this call site.
So something I've done in the past, probably not a good idea, but something I've done
is loading classes from shared libraries. And so I've got some new type that the compiler didn't know about at all
before.
And I override a virtual function in there.
And I use that as my kind of,
you know,
callback mechanism or whatever from a shared library being loaded.
If I turn on whole program optimization and it de-virtualizes things because
as well,
that's all you told me about.
Could that break functionality like that now?
Yes.
Oh, okay.
Yeah, so, right.
So obviously, I mean, you really have to be able to guarantee
that you can see the entire class hierarchy to do this.
So you have to be, you know, you have to apply it where it is,
you know, you have those guarantees.
And actually, so to that end, originally,
so there was a whole program optimization implemented for regular LTO. This was actually
done for Chrome, which uses a security mechanism in the compiler called control flow integrity.
So Peter Collinburn, who implemented that CFI also also implemented LTO-based whole program to virtualization, which helps reduce the cost.
And he also implemented a sort of a hybrid, partially thin LTO,
partially regular LTO mechanism for doing that.
But internally for Google, we really needed to be fully thin LTO.
So the work that I did more recently was to actually port that over
to make it work for just fully thin LTO
builds. And traditionally, the way that they used this for Chrome, for example,
was they turned on a compiler flag that basically says like your visibility, visibility for
everything marked, it was like visibility hidden, which, I mean,
can be used for other things besides whole program to virtualization,
but it also applied to sort of your, your V tables and basically told LTO,
which V tables it could assume were basically hidden. And, you know,
it saw the entire class hierarchy for those
V tables in your sort of LTO unit. And they actually had to go then mark with an attribute
cases that sort of violated that where, for example, shared library might override or add
a child class or an overrider. For our internal use, actually,
so we don't want to have to go and manually mark up code.
It's not scalable.
And also we want to be able to build a source file once into IR
and use it for shared library builds
where it's not legal to do the sort of whole program class hierarchy analysis
and also use the same IR object for statically linked targets where we guarantee that we see
everything. And so some of the work that I did to make this usable inside Google was to basically
defer that decision about whether you have whole program visibility into the link. And then we basically apply an option
when we are building a binary statically
that says, yes, okay, go ahead.
You have your guarantee to have whole program visibility.
And actually that was something I think,
they talked about in the Basel talk interview that you had
where some of the advantage of our build system
is we know whether we're building a library or a binary.
That's sort of part of what we,
when we write our build files,
you can give that information about the target.
And so we can leverage that to basically
do different types of link time optimization,
depending on whether you're building a library
or building a statically linked binary,
for example. Okay. So there is a hope that I could still use thin LTO in a project like that,
but not flip the flag that says, oh, and assume you have full visibility of all virtual.
Right. Right. Okay. You could even, you could turn on the option that says do whole program
to virtualization, and it would only apply to
things for example classes defined in anonymous namespaces where it knows just from the way that
the code is written that I'm guaranteed that I basically have see the entire class hierarchy
because in those cases you don't need linker information you don't need a whole program you
know that it's essentially hidden LTO visibility.
Right. One of the things you mentioned to us when we were talking via email starting before this interview was that you're working on sanitizer light heap profiling. And that's
what it sounded like to me. Is that right? Okay, so yeah, I probably didn't describe that very well. So we're working on heat profiling built into LLVM that basically uses a sanitizer approach to profiling the memory.
So for example, in LLVM, we have support for various sanitizers like ASAN, address sanitizer.
And it uses something called shadow memory to make that analysis and tracking of memory that
it needs to do faster than other approaches. And so basically, the idea behind shadow memory is
every piece of memory that you allocate has like a, you know, you can do this like mapping like in Asan it's I think eight bytes down to like a one byte. I forget their mapping. It's
eight to one. I think so it must be eight bytes down to one byte of like shadow memory flag saying,
you know, flagging whether basically you've done something violated, whether you've accessed memory appropriately or not.
And that ASAN ends up having, I think, a 2x runtime cost,
which is actually pretty low for doing that kind of tracking of memory.
And so basically for heat profiling,
we're leveraging that sort of idea of doing shadow memory.
And so for each piece of memory that you allocate,
we map is again eight to one,
but the granularity is a little bit larger.
And so we can track in the shadow memory, the hotness.
So every time you load or store a piece of memory,
it's a very simple instrumentation in the compiler
to say like shift and mask
and like update the count in my shadow memory.
And then when you're
done with the run or when you deallocate memory, you basically grab the shadow memory count. And
now you get like a listing of for every piece of memory I accessed with full context because
the sanitizer also tracks the allocation context. How hot was that data? And then we also have some
tracking in the header to when you allocate
something, you allocate an additional little piece of header. And again, that's sort of similar to
what ASAN does. And you can track other things in there like the lifetime. So again, you can get at
the end of your run, you can say for this allocation context, what is my hotness? And also
what is my lifetime? You can track that like as an average,
or like a min max, or you know, you can do different types of tracking because
so the idea is that at the end of your run, you have full stack context of allocation sites,
along with average hotness, or, you know, some indication of hotness, some indication of lifetime.
And then eventually, we would like to feed that back into the compiler and either do some transformations in the compiler and or pass information off to the memory allocation runtime
to allocate and handle the memory differently depending on whether it's hot or cold or long lifetime or short lifetime.
So for the hotness, you can get better locality.
For tracking the lifetime, you can get hopefully better reduced memory fragmentation
by basically allocating long-lived memory, say, in one huge page and short-lived memory in a different one.
So there's various things that you can look at doing.
So we proposed this heap profiler upstream in LLVM.
And at this point, the instrumentation side and the runtime has been upstreamed.
And what we haven't done yet is the actual feedback into the compiler.
That sounds hard.
It's, I mean, it's like a matter of engineering.
Profile-guided optimization, but taken kind of to another level, I feel like.
Right.
It's very interesting. Because internally, we use profile guided optimization quite a lot. And we use for our peak optimization customers, like the binaries that really care the most about performance, they use instrumentation based profiling.
For sort of the long tail, we use data collected from hardware counters, which is called auto FDO. And that gives a lot, you know, a good amount of
performance, but not quite the same performance you get with instrumentation based profiling.
So we would like to actually do a single profiling run, and essentially collect both the
your sort of traditional profile guided optimization profile information alongside the heap profile and do a single
feedback and basically use them independently and also sort of together. It's the ultimate vision.
So is the, sorry, the other technologies that you mentioned, like instrumentation
based profile guided optimization, is that stuff that's built into Clang currently? Or LLVM? Mm-hmm.
Oh, okay.
So yeah, so Clang and LLVM have support for both that instrumentation-based profiling. It also has support for feeding back what I was talking about, the auto FDO profiles, which we collect
in production basically using hardware counters. And I can there's i think there's uh yeah
depends on your architecture but and what hardware you're collecting on but there's there's
mechanisms for basically converting hardware counters into these sort of auto fdo profiles
that you can then feedback gonna have to look that up later yeah it's called um so of course
naming is always interesting. So,
auto FDO is,
because we initially
did this on GCC,
GCC calls it
feedback directed optimization,
FDO,
and just for fun,
Clang calls it
PGO,
profile guided optimization.
So,
inside LLVM,
it's actually called
sample PGO,
which is the same thing as auto FDO.
Just depends on which compiler you're using.
Okay, I think I have enough search links up here to come back to that later.
Now, what's interesting to me, though, is I mean, you're talking about all these technologies coming together.
We're running a little short on time.
But my experience has been that, you know, PGO will gain me something, but LTO tends to gain me more.
And using them together seems to often be a waste of time on the average application.
But perhaps I'm missing something here.
So actually, we find that the performance we get from the two of them combined is actually better than just adding
together the individual benefits. They're doing slightly different things. I mean, FGO,
PGO is telling you maybe, you know, like which calls are hot, which functions are hot, you know,
which blocks are hot. And so you can do like, you know, you can do smarter inlining with that
information. You can do, you know, code layout with that information. You can do code layout with that information. You can
drive a bunch of optimizations with that. LTO is giving you the ability to do, for example,
inlining across module boundaries. If you combine them, so I have this ability to do inlining across
module boundaries, but if I don't know what's hot, you could make the wrong decisions. So when
you combine it with PGO information, you can do much smarter inlining. So actually your benefit gets magnified. Okay. I think the last time I evaluated this was
even before thin LTO was probably about five years ago. Okay. And also I work on simulation
software. So trying to come up with what is a typical workload is almost impossible because
we don't know how the user,
what they're going to try to simulate.
That's always hard, yeah.
Interesting.
Well, thank you for that diversion there.
Just to go back to thin LTO one more time,
if a listener is already using Clang
and using traditional LTO currently,
is it as easy as swapping out a compiler flag to switch over to thin LTO?
So traditional LTO is dash F LTO.
If you just make that dash F LTO equal thin boom,
it's an LTO.
Awesome.
So,
and all the linkers that support LLVM clangs,
regular LTO support,
then LTO is,
I mean,
at least all of the publicly available linkers.
I don't know about all the niche ones in there, you know, that aren't public.
But yeah, like Gold and LLD.
And there's also like you can actually do it with GNU LD because it uses a plugin mechanism.
So it just uses the Gold plugin on the LLVM side.
I know there's support there.
I know people do it.
I will say that we don't test that with any build bots on the LLVM side. I know there's support there. I know people do it. I will say that we don't test that with any build bots on the LLVM side,
that combination,
but I told it works.
And LD64,
if you're doing,
if you're developing for Mac OS also supports it.
Okay.
Well,
it's been great having you on the show today,
Teresa.
Thanks for having me.
Where can listeners find you online?
So I'm not very, I don't really do much social media at all.
So really.
That's better for your mental health.
You know, my email, my work email that I also use for LLVM development is tejohnson at google.com.
And that's probably the best way to reach me about this stuff.
Okay.
It's been great having you on the show today.
Thanks.
I'm happy to be here.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.
You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank
all our patrons who help support the show through Patreon. If you'd like to support us on Patreon,
you can do so at patreon.com slash cppcast. And of course, you can find all that info and
the show notes on the podcast website