llunak

The effect of CPU, link-time (LTO) and profile-guided (PGO) optimizations on the compiler itself

2021-04-18T20:18:00.002+02:00

In other words, how much faster will a compiler be after it's been built with various optimizations?

Given the recent Clang12 release, I've decided to update my local build of Clang11 that I've been using for building LibreOffice. I switched to using my own Clang build instead of openSUSE packages somewhen in the past because it was faster. I've meanwhile forgot how much faster :), and openSUSE packages now build with LTO, so I've built Clang12 in several different ways to test the effect and this is it:

The file compiled is LO Calc's document.cxx, a fairly large source file, in a debug LO build. The compilation of the file is always the same, the only thing that differs is the compiler used and whether LO's PCH support is enabled. And the items are:

Base - A release build of Clang12, with (more or less) the default options.
CPU - As above, with -march=native -mtune=native added.
LTO - As above, with link-time optimization used. Building Clang this way takes longer.
LTO+PGO - As above, also with profile-guided optimization used. Building Clang this way takes even longer, as it needs two extra Clang builds to collect the PGO data.
Base PCH - As Base, and the file is built with PCH used.
LTO+PGO PCH - As LTO+PGO, again with PCH used.

Or, if you want this as numbers, then with Base being 100%, CPU is 85%, LTO is 78%, LTO+PGO is 59%, Base PCH is 37% and LTO+PGO PCH is 25%. Not bad.

Mind you, this is just for one randomly selected file. YMMV. For the build from the video from the last time, the original time of 4m39s with Clang11 LTO PCH goes down to 3m31s for Clang12 LTO+PGO PCH, which is 76%, which is consistent with the LTO->LTO+PGO change above.

Clang precompiled headers and improving C++ compile times, conclusion

2021-04-06T15:38:00.000+02:00

With Clang12 almost released, I guess it's high time to write a conclusion to the Clang11 changes that improve compilation times with PCHs. I originally planned to do this after the Clang11 release, but with the process to get the changes reviewed and merged having been so tedious I was glad it was finally over and I couldn't at the time muster the little extra effort to also write this down (I spent way more time repeatedly writing 'ping' and waiting for a possible reaction than writing the code, which was really demotivating). But although the new options are described in the Clang11 release notes, I think it'd be useful to write it down in more detail.

First of all, I've already written why C++ developers might care, but a thousand pictures can be worth more than a thousand words saying how this can save you even 60% of the build time:

In case you'd like to see a similar change in your LibreOffice compilation times, it should be as simple as this:

Get Clang11 or newer and use it to build LibreOffice.
Use --enable-pch (preferably --enable-pch=full) to enable PCH use.
Profit.

You may want to check that your ccache and icecream are not too ancient if you use them, but by now any reasonably recent version should do. And if you need to do changes that would repeatedly trigger larger rebuilds (such as changing a header), the trick to temporarily disable PCH builds is to do 'make BLOCK_PCH=1'. And since PCH builds sometimes cause build errors in non-PCH builds because of missing #include's of headers, it's a good idea to touch all your changed files and do 'make BLOCK_PCH=1' before pushing your change to Gerrit. This is all you should need to know, the LO build system will take care of everything else.

As for the rest of the world, this boils down to the two PCH-related sections in the Clang11 release notes.

The first one is using -fpch-instantiate-templates. It needs to be set when building the PCH, but it will work even if you just add it to the CXXFLAGS used for building everything. Recent enough CMake version should handle this option automatically, I have no idea about other build systems. It should be safe to enable the option for any building with PCH. It's not enabled by default in Clang only because of really corner cases with PCHs that are not self-contained. In other words, as long as your PCH works with an empty .cpp file, it's safe, and if your PCH is not self-contained, you'd be better off fixing that anyway.

The second part using -fpch-codegen -fpch-debuginfo is more complicated, as it requires build system support, and I'm not aware of any build system besides LibreOffice's providing it. This discussion in a CMake ticket provides an example of how to use the option that seems rather simple. For other build systems have a look at the description in the Clang11 release notes for all the details and possible complications, and consider whether it'd be worth it. Which it sometimes may not be, as this graph from my previous post shows (Clang means normal PCH build, Clang+ means only -fpch-instantiate-templates, Clang++ means all 3 options).

Note that you may get rather different results depending on how much you put in your PCHs. Unlike before, now the general rule should be that the more you add to your PCHs, even if shared only by several source files, the faster the builds usually will be. And since these options move building some code to the PCH-related object file, the improvement is usually even better for incremental builds than full rebuilds. I've been using PCHs this way for slightly more than a year, and I forgot already quite some time ago how slow it used to be.

Skipping functions from entire directories while debugging (e.g. skip all functions from system headers)

2020-01-21T17:19:00.001+01:00

So, today I got finally so tired of navigating (or explicitly stepping over) all the internal functions in gdb (you know, all the inline functions from STL containers, from Boost, from this pointer wrapper class, that string class) that I finally googled 'gdb skip system functions'. And guess what, it's been there since gdb 7.12, from 3 years ago, and it's almost trivial, just adding something like this to ~/.gdbinit:

skip -gfi /usr/include/*
skip -gfi /usr/include/*/*
skip -gfi /usr/include/*/*/*
skip -gfi /usr/include/*/*/*/*

I feel so stu^H^H^Hproud for catching up only 3 years late.

Clang precompiled headers and improving C++ compile times, take #2

2019-11-09T19:41:00.001+01:00

It's been almost half a year since I mentioned how precompiled headers do (not) improve C++ compile times. Quite a long time, filled with doing other things, life, occassionally working on getting my patch production-ready and, last but definitely not least, abandoning that patch and starting from scratch again.
It turns out, the problems I mentioned last time had already been more or less solved in Clang. But only for C++ modules, not for precompiled headers. *sigh* I had really mixed feelings when I finally realized that. First of all, not knowing Clang internals that well, it took me quite a long time to get to this point figuring it all out, probably longer than it could have. Second, I've been using C++ modules when building Clang itself and while it's usable, I don't consider it ready (for example, sometimes it actually makes the build slower), not to mention that it's non-trivial to setup, not standardized yet and other compilers (AFAIK) do not yet support C++ modules. And finally, WTH has nobody else yet noticed and done the work for precompiled headers too? After all the trouble with finding out how the relevant Clang parts work, the necessary patches mostly border on being trivial. Which, on the other hand, is at least the good news.
And so I'm switching for LibreOffice building to my patched build of Clang. For the motivation, maybe let's start with an updated picture from the last time:

This is again column2.cxx, a larger C++ file from Calc. The first row is again compilation without any PCH involved. The second row is unpatched Clang with --enable-pch=full, showing again that way too large PCHs do not really pay off (here it does, because the source file is large, but for small ones such as bcaslots.cxx shown last time it makes things slower). In case you notice the orange 'DebugType' in the third row that looks like it should be in the second row too, it should be there, but that's one of these patches of mine that the openSUSE package does not have.
The third row is with one patch that does the PerformPendingInstantiations phase also already while building the PCH. The patch is pretty much a one-liner when not counting handling fallout from some Clang tests failing because of stuff getting slightly reordered because of this. Even by now I still don't understand why PCH generation had to delay this until every single compilation using the PCH. The commit introducing this had a commit message that didn't make much sense to me, the test it added works perfectly fine now. Presumably it's been fixed by the C++ modules work. Well, who cares, it apparently works.
The last row adds also Clang options -fmodules-codegen -fmodules-debuginfo. They do pretty much what I was trying to achieve with my original patch, they just approach the problem from a different side (and they also do not have the technical problems that made me abandon my approach ...). They normally work only for C++ modules, so that needed another patch, plus a patch fixing some problems. Since this makes Clang emit all kinds of stuff from the PCH into one specific object file in the hopes that all the compilations using the PCH will need that too but will be able to reuse the shared code instead, LibreOffice now also needs to link with --gc-sections, which throws away all the possibly problematic parts where Clang guessed wrong. But hey, it works. Even with ccache and Icecream (if you have the latest Icecream, that is, and don't mind that it "implements" PCHs for remote compilations by simply throwing the PCH away ... it still pays off).
So, that it's for a single compilation. How much does it help with building in practice? Time for more pretty colorful pictures:

This is a debug LO build on my 4-core (8 HT) Ryzen laptop, Library_sm is relatively small (36 source files), Library_scfilt is larger (154 source files). Plain 'Clang' means unpatched Clang(v9), 'Clang+' is with the PerformPendingInstantiations patch (i.e. the third row above), 'Clang++' is both patches (i.e. the fourth row above). The setting is either --enable-pch=base for including only system and base LO headers in the PCH, or --enable-pch=full for including everything that makes sense. It clearly shows that using large PCHs with GCC or unpatched Clang just doesn't make sense.
Note that GCC(v9) and MSVC(v2017) are there more as a reference than a fair comparison. MSVC runs on a different OS and the build may be possibly slightly handicaped by some things taking longer in Cygwin/Windows. GCC comes from its openSUSE package, which AFAICT is built without LTO (unlike the Clang package, where it makes a noticeable difference).
And in case the graphs don't seem impressive enough, here's one for Library_sc, which with its 598 source files is too big for me to bother measuring it in all cases. This is the difference PCHs can make. That's 11:37 to 4:34, almost down to one third:

As for building entire LO from scratch, it can be like in the picture below (or even better). The effect there is smaller, because the build consists of other things than just building libraries, and some of the code built doesn't use PCHs. And it's even smaller than it could be, because I used --enable-pch=base, as that's what I've been using up to now (although now I'll switch to a higher level). That's about 1h42m without PCHs to 1h14m with unpatched Clang (27% percent saved), and 1h06m with patches (and the 8minutes difference is still 11% of the unpatched time). Not bad, given that this is the entire LO build. Those 6 minutes for ccache are there to show the maximum possible improvement (or rather nowhere near possible, since the compiler normally still has to do the work of actually compiling the code somehow).

In case you'd want to use this too, that's not up to me now. The patches are now sitting and waiting in the LLVM Phabricator. Hopefully somebody there still cares about PCHs too.

Icecream 1.3 and Icemon 3.3 released

2019-09-11T23:57:00.000+02:00

A new version 1.3 of the distributed C/C++ compilation tool Icecream has been released. To accompany it, version 3.3 of the GUI monitor Icemon has been released as well.

The changelogs are here and here. In a less changelog-y way, the changes are:

Compiler location are no longer hardcoded anywhere. Previously the compiler automatically packaged and sent to remote nodes was always /usr/bin/gcc (g++, clang, clang++). That might not match the actual compiler used and the workaround was to manually package the proper one using icecc-create-env. But now it's possible to build even with e.g. CXX=/my/own/build/of/clang and it'll simply work. This should also mean that explicitly setting $ICECC_VERSION now should be needed only for cross-compiling.
Slightly better job scheduling, both for remote and local builds. For example, the local machine should no longer be possibly overloaded by running way too many local preprocessor steps.
Better compression, both for sending data and packaged compilers. Compilation data is compressed using zstd if the other node supports it, compiler environments can be compiled using zstd or xz. This improves performance by reducing both network and CPU usage. Note that while compilation compression falls back to the older method if not supported by the other side, for compiler environments this is more tricky and so it has to be set up manually. You can set e.g. ICECC_ENV_COMPRESSION=xz , but the daemon will not fall back to using any other mechanism. Which means it will use only nodes that are at least version 1.3, the scheduler should also be from 1.3 (run another one if needed, the newest one wins) and the remote node needs to support the compression (1.3 uses newly uses libarchive, which supports zstd only in its relatively recent releases). So this is mainly useful if you have full control over the Icecream cluster, but by default the compression is the old gzip, for backwards compatibility.
Speaking of which, the maximum cache size for compiler environments now defaults to 256MiB. Use the --cache-size option of iceccd for different sizes.
Objective C/C++ support has been fixed.
Some special workarounds for GCC's -fdirectives-only option that is used when sending sources to remote nodes, as it breaks in some corner cases.
The --interface option of the daemons (and scheduler) now allow binding only to a specific network interface, if needed. Note that Icecream still assumes it runs in a trusted network and if that's not so it's up to you to ensure it by using tools such as a firewall.
Icemon now displays in the defailed host view what protocol a node supports (1.3 has protocol version 42, env_xz/env_zstd mean it supports compiler environments compiled using xz/zstd).
And various other fixes.

Why precompiled headers do (not) improve C++ compile times

2019-05-23T23:22:00.000+02:00

Would you like your C++ code to compile twice as fast (or more)?

Yeah, so would I. Who wouldn't. C++ is notorious for taking its sweet time to get compiled. I never really cared about PCHs when I worked on KDE, I think I might have tried them once for something and it didn't seem to do a thing. In 2012, while working on LibreOffice, I noticed its build system used to have PCH support, but it had been nuked, with the usual poor OOo/LO style of a commit message stating the obvious (what) without bothering to state the useful (why). For whatever reason, that caught my attention, reportedly PCHs saved a lot of build time with MSVC, so I tried it and it did. And me having brought the PCH support back from the graveyard means that e.g. the Calc module does not take 5:30m to build on a (very) powerful machine, but only 1:45m. That's only one third of the time.

In line with my previous experience, on Linux that did nothing. I made the build system support also PCH with GCC and Clang, because it was there and it was simple to support it too, but there was no point. I don't think anybody has ever used that for real.

Then, about a year ago, I happened to be working on a relatively small C++ project that used some kind of an obscure build system called Premake I had never heard of before. While fixing something in it I noticed it also had PCH support, so guess what, I of course enabled it for the project. It again made the project build faster on Windows. And, on Linux, it did too. Color me surprised.

The idea must have stuck with me, because a couple weeks back I got the idea to look at LO's PCH support again and see if it can be made to improve things. See, the point is, PCHs for that small project were rather small, it just included all the std stuff like <vector> and <string>, which seemed like it shouldn't make much of a difference, but it did. Those standard C++ headers aren't exactly small or simple. So I thought that maybe if LO on Linux used PCHs just for those, it would also make a difference. And it does. It's not breath-taking, but passing --enable-pch=system to configure reduces Calc module build time from 17:15m to 15:15m (that's a less powerful machine than the Windows one). Adding LO base headers containing stuff like OUString makes it go down to 13:44m and adding more LO headers except for Calc's own leads to 12:50m. And, adding even Calc's headers, results in 15:15m again. WTH?

It turns out, there's some limit when PCHs stop making things faster and either don't change anything, or even make things worse. Trying with the Math module, --enable-pch=system and then --enable-pch=base again improve things in a similar fashion, and then --enable-pch=normal or --enable-pch=full just doesn't do a thing. Where it that 2/3 time reduction --enable-pch=full does with MSVC?

Clang has recently received a new option, -ftime-trace, which shows in a really nice and simple way where the compiler spends the time (take that, -ftime-report). And since things related to performance simply do catch my attention, I ended up building the latest unstable Clang just to see what it does. And it does:

So, this is bcaslots.cxx, a smaller .cxx file in Calc. The first graph is without PCH, the second one is with --enable-pch=base, the third one is --enable-pch=full. This exactly confirms what I can see. Making the PCH bigger should result in something like the 4th graph, as it does with MSVC, but it results in things actually taking longer. And it can be seen why. The compiler does spend less and less time parsing the code, so the PCH works, but it spends more time in this 'PerformPendingInstantiations', which is handling templates. So, yeah, in case you've been living under a rock, templates make compiling C++ slow. Every C++ developer feeling really proud about themselves after having written a complicated template, raise your hand (... that includes me too, so let's put them back down, typing with one hand is not much fun). The bigger the PCH the more headers each C++ file ends up including, so it ends up having to cope with more templates. With the largest PCH, the compiler needs to spend only one second parsing code, but then it spends 3 seconds sorting out all kinds of templates, most of which the small source file does not need.

This one is column2.cxx, a larger .cxx file in Calc. Here, the biggest PCH mode leads to some improvement, because this file includes pretty much everything under the sun and then some more, so less parsing makes some savings, while the compiler has to deal with a load of templates again, PCH or not. And again, one second for parsing code, 4 seconds for templates. And, if you look carefully, 4 seconds more to generate code, most of it for those templates. And after the compiler spends all this time on templates in all the source files, it gets all passed to the linker, which will shrug and then throw most of it away (and that will too take a load of time, if you still happen to use the BFD linker instead of gold/lld with -gsplit-dwarf -Wl,--gdb-index). What a marvel.

Now, in case there seems to be something fishy about the graphs, the last graph indeed isn't from MSVC (after all, its reporting options are as "useful" as -ftime-report). It is from Clang. I still know how to do performance magic ...

Linux perf and KCachegrind

2019-05-20T16:18:00.001+02:00

If you occassionally do performance profiling as I do, you probably know Valgrind's Callgrind and the related UI KCachegrind. While Callgrind is a pretty powerful tool, running it takes quite a while (not exactly fun to do with something as big as e.g. LibreOffice).

Recently I finally gave Linux perf a try. Not quite sure why I didn't use it before, IIRC when I tried it somewhen long ago, it was probably difficult to set up or something. Using perf record has very little overhead, but I wasn't exactly thrilled by perf report. I mean, it's text UI, and it just gives a list of functions, so if I want to see anything close to a call graph, I have to manually expand one function, expand another function inside it, expand yet another function inside that, and so on. Not that it wouldn't work, but compared to just looking at what KCachegrind shows and seeing ...

When figuring out how to use perf, while watching a talk from Milian Wolff, on one slide I noticed a mention of a Callgrind script. Of course I had to try it. It was a bit slow, but hey, I could finally look at perf results without feeling like that's an effort. Well, and then I improved the part of the script that was slow, so I guess I've just put the effort elsewhere :).

And I thought this little script might be useful for others. After mailing Milian, it turns out he just created the script as a proof of concept and wasn't interested in it anymore, instead developing Hotspot as UI for perf. Fair enough, but I think I still prefer KCachegrind, I'm used to this, and I don't have to switch the UI when switching between perf and callgrind. So, with his agreement, I've submitted the script to KCachegrind. If you would find it useful, just download this do something like:


$ perf record -g ...

$ perf script -s perf2calltree.py > perf.out
$ kcachegrind perf.out

$ICECC_VERSION

2014-02-18T09:31:00.000+01:00

It's been brought to my attention that the Icecream documentation more or less suggests it is necessary to manually set up $ICECC_VERSION (which also involves creating the environment tarball with the compiler and so on). That is incorrect. I've already updated the documentation to say that, like with pretty much everything, Icecream simply figures it out itself by default.

So if you happen to use $ICECC_VERSION, unless you know why you do that (e.g. cross-compilation), don't. It's not only simpler but also better to leave it up to Icecream to package the system compiler as necessary, as it simply works, and avoids possible problems (such as updating the system compiler and forgetting to update the tarball).

Are you enjoying your Icecream?

2013-07-12T09:48:00.000+02:00

A bug has crept into the Icecream 1.0 release that makes the daemon crash on its first start after boot if /var/run is cleared (e.g. it's tmpfs-mounted by systemd). This in practice means that on such systems Icecream does not work unless started manually. So in case your compiles have felt tad a bit slower recently, upgrade to the just released version 1.0.1. If you run openSUSE, an online update with the fix has already been published.

Clang's AST dump AKA 'WTH is the compiler doing???'

2013-07-11T10:19:00.002+02:00

It's no secret that I use the Clang compiler for development. Although GCC is still somewhat better when things like the performance of the resulting code matter, there are other features that matter more during development. And although again competition helps (it's not difficult to guess where the inspiration for the new error reporting in 4.8 comes from), there are features where I expect it'd be hard for GCC to match Clang. The capabilities and ease of writing Clang plugins is one thing, but there are more hidden secrets, like the AST dump.

If Clang is invoked also with -Xclang -ast-dump options, it'll dump its internal representation of the compiled source. Which can be pretty useful when the source code doesn't actually mean what one expects, or if there's something unexpected from elsewhere interfering. Consider the following (simple, for clarity) example:


#include <iostream>

using namespace std;



class A
    {
    };



class B
    {
    public:
        operator A() const { return A(); }
    };



class C
    : public B
    {
    };



void foo( const A& )
    {
    cout << "A" << endl;
    }



void foo( B& )
    {
    cout << "B" << endl;
    }



int main()
    {
    foo( C());
    }

Looking only at class C, it may perhaps come as a surprise to some that this prints "A" and not "B". And overlooking the missing const or not knowing that it will prevent passing the temporary to the function certainly helps with the surprise, but even if not, still, so what is actually going on? With larger codebase, that can be a lot of time to find out. But finding out what the compiler thinks about the code can help:


$ clang++ -Wall a.cpp -Xclang -ast-dump

...

 `-FunctionDecl 0x90ffb90 <line:29:1, line:32:5> main 'int (void)'

  `-CompoundStmt 0x9100078 <line:30:5, line:32:5>

    `-CallExpr 0x90fffb8 <line:31:5, col:13> 'void'

      |-ImplicitCastExpr 0x90fffa8 <col:5> 'void (*)(const class A &)' <FunctionToPointerDecay>

      | `-DeclRefExpr 0x90fff74 <col:5> 'void (const class A &)' lvalue Function 0x90fec80 'foo' 'void (const class A &)'

      `-MaterializeTemporaryExpr 0x9100068 <col:10, col:12> 'const class A' lvalue

        `-ImplicitCastExpr 0x9100058 <col:10, col:12> 'const class A' <NoOp>

          `-ImplicitCastExpr 0x9100048 <col:10, col:12> 'class A' <UserDefinedConversion>

            `-CXXMemberCallExpr 0x9100028 <col:10, col:12> 'class A'

              `-MemberExpr 0x9100008 <col:10, col:12> '<bound member function type>' .operator A 0x90fe740

                `-ImplicitCastExpr 0x90ffff8 <col:10, col:12> 'const class B' <UncheckedDerivedToBase (B)>

                  `-CXXTemporaryObjectExpr 0x90ffdd8 <col:10, col:12> 'class C' 'void (void)' zeroing

Knowing a bit about how compilers work helps a lot, but even without it this is quite simple to read. From bottom to up, there's a temporary object of class C created and it's cast to its base class B. That's the expected part, the unexpected part is the 3 AST nodes up, which show that the object is converted to class A by a user defined conversion using operator A(). Which, as the rest of this part of the AST dump shows, results in calling foo( const A& ). Mystery solved.

(Fun trivia: I once helped a GCC developer to disentangle a problem in a complex C++ testsuite using this. But don't tell anyone ;) . )

Icecream 1.0.0 released

2013-04-08T17:29:00.001+02:00

After almost 10 years since the first version, version 1.0.0 of the Icecream distributed build tool has been released.

Yes, it's been almost a decade with us. And as it usually is with versions 1.0.0 after such a long time, it doesn't actually bring anything breadth-taking. But there of course have been some fixes and improvements since version 0.9.7, and in fact even one larger feature found its way in, out-of-the-box support for the Clang compiler, including support for its plugins. Written by yours truly, after finding out about this compiler and finding out it was pretty difficult to get it to work with 0.9.7 icecream at least in some way. And also being the reason for repeatedly bugging Coolo about another icecream release :) .

And, on the way to 1.0.0, the development repository has been moved to GitHub. Which should be good, as IMO the fact that it used to be developed in the KDE SVN helped to make the false impression that it is somehow specific to KDE, limiting its use among developers of other projects. This is probably actually the biggest feature of the 1.0.0 release.

So, thanks to everyone who has helped to make compiling a much more pleasant experience (possibly even with the added colors in the icecream monitor ;) ).