Wednesday, April 20, 2022

Improving text layout performance

So I've been working on improving LO text layout performance, as specified by the TDF tender. As it says, text layout in LO can be rather slow, because of problems like repeated text layout calls for the same text.

Let's have a look at a perf profile for PDF export of the document from bug#116400 :

There are two major costs here:

The first one is splitting text into script runs (separating runs of e.g. latin text from RTL text). About 61% of time is spent in vcl::text::TextLayoutCache. Which is rather strange for something called 'cache'. But this is one of the cases of poor naming, as the class is actually not a cache, it is the result of the script run splitting. It is called TextLayoutCache probably because callers are supposed to cache it and pass the same item to several OutputDevice calls ... which mostly does not happen. So whenever a text is to be drawn, this gets recreated. To make things even worse, it is done for the entire string, even if only a part of it is drawn, so this supposed cache actually makes things slower.

This can be fairly easily fixed by turning the class into an actual cache. Each TextLayoutCache instance depends only on the string, so it's easy to keep a reasonable number of them in one global cache.

The second problem is breaking text into multiple lines at suitable places. In this case the problem was in the ICU library we use. The common scenario when breaking text is finding the first break and then continuing with the same text to find the following break, and so on. ICU code tries to cache this if the position in the text is close enough to the previous one, and if it's not close enough but after the last position, it tries to only walk back a bit instead of repeating the entire work from the beginning of the string. But for whatever strange reason this walking back didn't work, and it walked back until the very beginning. And the test handling the result of the walking back didn't check what the result was and reset the position to whatever the result was. So in practice the breaking almost always started from the beginning, even if the last position was a way more reasonable place to start breaking from. So 26% of time is spent breaking the same text over and over (and it's only 26% because script run splitting is even more expensive).

I've reported this to ICU together with a suggested fix to not reset position to beginning if the last position is better, they've confirmed the problem, but apparently want to look at why the walking back doesn't work in the first place, and there has not been an actual fix from them yet. So I've at least pushed my patch for now.

The resulting perf profile now looks much better:

As can be seen from the number of cycles at the bottom, this is now almost 10x faster (well, ok, 8x to be more precise). The script run splitting can't be seen anymore (it's ~0.1% now), text breaking is still there, but way smaller (6%, and that's 6% of a total that's 8x smaller, so it would be 0.75% compared to the original 26%). Not bad. The PDF generation still takes a couple of seconds (it's 400 pages after all), but it's way faster.

Other problem I noticed while working on this was the related bugreport #144515 (and a couple more that I've closed as its duplicates):

The primary cost here is OutputDevice::ImplLayout(), which lays out text into glyphs and their positions. It is possible to cache this using the SalLayoutGlyphs class, and e.g. Writer has already started caching that for repeated calls, but in this case it's Calc using the EditEngine class, which does no caching.

So as a fix I've moved the caching code to VCL and turned it into a generic SalLayoutGlyphsCache class, and then made this place use that cache ... which didn't really help that much. After investigation it turned out that EditEngine tries to fit the given text into the given paper size (Calc's cell in this case), and so it repeatedly asks to lay out the entire long string in the cell, then breaks the line at the needed width, and then it repeats the same with the rest of the string for the next line, and so on. Which again results in horrible O(N^2) performance that mostly repeats the same over and over again.

But that should be possible to avoid, right? If it's repeatedly the same text, just a subset with increasing starting index, then presumably the glyphs are the same, and their positions are also the same, just at an offset. Well, it's not that simple actually, as in some cases it's not possible to cut glyphs for text at a random place and hope it'll be exactly the same as text layout would give, since text layout may try e.g. to position several adjacent spaces more nicely. But after a couple of attempts Noel pointed out to me that Harfbuzz actually provides information about places where it's not safe to break. Finally, I noticed that the problem with a number of those bugreports is people having small Calc cells with long text, where most of the text is not seen. So for the default case it can be made to show only the start of the text that fits, and so I made it possible for EditEngine to stop breaking lines when the given size is filled in instead of trying to pointlessly process lines that won't be needed.

Again, the resulting profile looks much better:

The operation is now almost 100x times faster (*cough*, ok, 62x) and is just a relatively small part in the total picture. Let's call that good enough.

In other somewhat related news, the fontconfig library has a new stable release that finally includes some performance improvemens when looking up fonts (not updated on its webpage, but available for download in release list).

BTW, since this was a TDF tender work, these improvements have been funded by TDF donations.


1 comment:

  1. It sounds like you have been modest in the 7.4 release notes, the improvements sound amazing!

    ReplyDelete