Thursday, May 23, 2019

Why precompiled headers do (not) improve C++ compile times

Would you like your C++ code to compile twice as fast (or more)?

Yeah, so would I. Who wouldn't. C++ is notorious for taking its sweet time to get compiled. I never really cared about PCHs when I worked on KDE, I think I might have tried them once for something and it didn't seem to do a thing. In 2012, while working on LibreOffice, I noticed its build system used to have PCH support, but it had been nuked, with the usual poor OOo/LO style of a commit message stating the obvious (what) without bothering to state the useful (why). For whatever reason, that caught my attention, reportedly PCHs saved a lot of build time with MSVC, so I tried it and it did. And me having brought the PCH support back from the graveyard means that e.g. the Calc module does not take 5:30m to build on a (very) powerful machine, but only 1:45m. That's only one third of the time.

In line with my previous experience, on Linux that did nothing. I made the build system support also PCH with GCC and Clang, because it was there and it was simple to support it too, but there was no point. I don't think anybody has ever used that for real.

Then, about a year ago, I happened to be working on a relatively small C++ project that used some kind of an obscure build system called Premake I had never heard of before. While fixing something in it I noticed it also had PCH support, so guess what, I of course enabled it for the project. It again made the project build faster on Windows. And, on Linux, it did too. Color me surprised.

The idea must have stuck with me, because a couple weeks back I got the idea to look at LO's PCH support again and see if it can be made to improve things. See, the point is, PCHs for that small project were rather small, it just included all the std stuff like <vector> and <string>, which seemed like it shouldn't make much of a difference, but it did. Those standard C++ headers aren't exactly small or simple. So I thought that maybe if LO on Linux used PCHs just for those, it would also make a difference. And it does. It's not breath-taking, but passing --enable-pch=system to configure reduces Calc module build time from 17:15m to 15:15m (that's a less powerful machine than the Windows one). Adding LO base headers containing stuff like OUString makes it go down to 13:44m and adding more LO headers except for Calc's own leads to 12:50m. And, adding even Calc's headers, results in 15:15m again. WTH?

It turns out, there's some limit when PCHs stop making things faster and either don't change anything, or even make things worse. Trying with the Math module, --enable-pch=system and then --enable-pch=base again improve things in a similar fashion, and then --enable-pch=normal or --enable-pch=full just doesn't do a thing. Where it that 2/3 time reduction --enable-pch=full does with MSVC?

Clang has recently received a new option, -ftime-trace, which shows in a really nice and simple way where the compiler spends the time (take that, -ftime-report). And since things related to performance simply do catch my attention, I ended up building the latest unstable Clang just to see what it does. And it does:
So, this is bcaslots.cxx, a smaller .cxx file in Calc. The first graph is without PCH, the second one is with --enable-pch=base, the third one is --enable-pch=full. This exactly confirms what I can see. Making the PCH bigger should result in something like the 4th graph, as it does with MSVC, but it results in things actually taking longer. And it can be seen why. The compiler does spend less and less time parsing the code, so the PCH works, but it spends more time in this 'PerformPendingInstantiations', which is handling templates. So, yeah, in case you've been living under a rock, templates make compiling C++ slow. Every C++ developer feeling really proud about themselves after having written a complicated template, raise your hand (... that includes me too, so let's put them back down, typing with one hand is not much fun). The bigger the PCH the more headers each C++ file ends up including, so it ends up having to cope with more templates. With the largest PCH, the compiler needs to spend only one second parsing code, but then it spends 3 seconds sorting out all kinds of templates, most of which the small source file does not need.

This one is column2.cxx, a larger .cxx file in Calc. Here, the biggest PCH mode leads to some improvement, because this file includes pretty much everything under the sun and then some more, so less parsing makes some savings, while the compiler has to deal with a load of templates again, PCH or not. And again, one second for parsing code, 4 seconds for templates. And, if you look carefully, 4 seconds more to generate code, most of it for those templates. And after the compiler spends all this time on templates in all the source files, it gets all passed to the linker, which will shrug and then throw most of it away (and that will too take a load of time, if you still happen to use the BFD linker instead of gold/lld with -gsplit-dwarf -Wl,--gdb-index). What a marvel.

Now, in case there seems to be something fishy about the graphs, the last graph indeed isn't from MSVC (after all, its reporting options are as "useful" as -ftime-report). It is from Clang. I still know how to do performance magic ...



7 comments:

  1. > "Now, in case there seems to be something fishy about the graphs, the last graph indeed isn't from MSVC (after all, its reporting options are as "useful" as -ftime-report). It is from Clang. I still know how to do performance magic ..."

    Please tell us the magic incantation!

    ReplyDelete
  2. I think currently the only incantation for you is to chant "find the time to describe the solution and send the patch to Clang devs" :). The 4th graph is indeed my patched Clang, but it fails in some corner cases[*] and my Clang Fu is not up to that. So this blog post is basically a lame glorified way to host the two images for when I send the patch to Clang devs asking for help.

    [*] LibreOffice - Proudly breaking your toolchain(TM).

    ReplyDelete
  3. Okay, let's try the chanting: please do find the time to describe the solution and send the patch to Clang devs!

    PS may your Clang Fu make it to the next level :-)

    ReplyDelete
  4. Interesting benchmark. I wonder if Link Time Code Generation would help with the "let's-generate-the-code-for-the-same-templates-over-and-over-again-and-then-throw-it-all-away-while-linking" issue.

    ReplyDelete
    Replies
    1. Interesting thought. But given that this is about the build time, LTO is not the solution.

      Delete
  5. Hi Luboš, thanks for posting also on planetkde, these kind of blogs are the reason why it's still worth reading it.

    ReplyDelete
    Replies
    1. Thank you for telling me. I was actually a bit hesitant about that, so it'd good for me to know.

      Delete