Thursday, May 12, 2022

Improving Calc support for 16384 columns

So I enabled support for up to 16384 columns in Calc by default some time ago, but just getting it to work was not necessarily the end of the work. Making Calc have 16 times more columns means that any operation that works on entire columns is suddenly 16 times slower, or even worse. Similarly this could easily lead to 16x more memory used. So the support not only needs to work, but it also needs to be usable.

It theory adding a number of empty columns to the end of a spreadsheet should not make a difference, but in practice it does. With 1024 columns it is not as necessary to ignore those empty columns as it is with 16k, and a lot of the code dates back to the times when Calc supported even fewer colums (256?), where a being little inefficient here or there didn't show. But now it suddently did.

For example, if you protect or hide all unused columns until the end of the spreadsheet, then hitting the right arrow key on the last accessible cell makes Calc check all cells to the right for whether it's possible to go into them. And checking whether a column is hidden requires searching the list of column information, which is not trivial (it's compacted in order not to waste memory). The barely noticeable cost of this with 1024 columns got large enough to cause noticeable delays. Fortunately the ColHidden() function is smart enough to return the first and last column in the compacted range where the flag is equal, the code doing the cursor navigation just up until now didn't bother using that information, but now it needed to do so.

Another example, and that's quite a large topic, is allocating columns. If most of those new columns are not actually used, then it makes sense to allocate them only when needed, right? That will save memory, and it will make things faster too, because there is no need to check those empty columns. That idea got implemented back when this 16k work was started by others, adding e.g. function GetColumnsRange() that clamped the range to the allocated columns, but the problem is that in practice this is not as simple as that.

One of the issues here is let's say the case of selecting an entire row (clicking the row number to the left of the table does that easily) and then hitting Ctrl+B to make the entire row bold. That should not clamp the column range to the allocated columns, because if I later enter something into cells in those columns, I expect that to be bold. But if Calc allocates all columns for this, maybe I do not intend to enter values anywhere else except the first rows, so allocating all columns will be a waste. The solution to this is having default column data. The ScTable class now, besides having a list of allocated ScColumn's also has a ScColumnData member that stores some data for all not-yet allocated columns. Set the bold flag for all allocated columns and also in the default, and problem solved.

Except then, GetColumnsRange() clamping to allocated columns becomes incorrect, because now it's possible to have set data even beyond allocated columns, such as this bold flag. So I changed GetColumnsRange() to simply return the given range, without any adjustments, and then added the better-named GetAllocatedColumnsRange() for cases where the code knows it wants only the allocated range.

Somewhat similarly to the bold case, merely showing or going to an unallocated column should not allocate it. Otherwise hit e.g. Ctrl+Right one time too many and the cursor going to column XFD would make all columns get allocated. But that causes yet another complication - I am now at an unallocated column and all operations should either detect the column is not allocated and return, or allocate the column if needed. The initial 16k work added CreateColumnIfNotExists() exactly to protect such accesses and allocate the column if needed. It's just that this needed adding to quite many places, and some were still missing it, and others were calling it unnecessarily causing unnecessary column allocations. So I needed to work on these over time. I eventually went as far as change Calc to initially allocate just one column. Since before that Calc used to allocate 64 columns by default, a number of places missing such checks kept working because normally people didn't work with more than 64 columns (and so this 64 default was a reasonable choice at the time, as there was really a lot to check and fix). Now that I have changed this to just one column and fixed all tests, it looks like I've rooted them all out (at least I'm still getting only very few bugreports about something breaking :) ).

Drawing, somewhat unexpectedly, turned out to be a possible performance problem too. There are few ways in which cells to the left can affect drawing of cells to the right. If you enter a too-long text into a cell, it will overflow to the right, into the space of the next cell, or possibly even several cells. So when Calc is drawing let's say a couple of cells around the 10000th column, it actually needs to check also all the 10000 columns before. Somebody back in the day thought about optimizing it, and so before Calc draws cells, function FillInfo() first collects information about all the cells to draw and also all the cells to the left. What possibly(?) was an optimization with 256 or 1024 column is a problem with 16384 columns. Even allocating and clearing all the memory actually had a noticeable performance impact. Sadly, as sometimes happens to be the case with optimizations from the OpenOffice.org times, whoever wrote this made it slow. Function FillInfo() collects all data necessary for drawing a cell into struct CellInfo, and all that info is collected also for all the cells to the left, even though most of it is not used for them. So I had to find out what was necessary and split that out (and provide proper abstraction, because real programmers back in the day used direct data access, right).


 Some of the problems can be even a bit funny. Have you created e.g. a named range called DAY1, NUM1, LOG10 or even DOG10? Well, now you can't, since now those are valid cell addresses, going up to XFD1. So Calc now needed special backwards compatibility code for this.

I expect the real test of this comes when it becomes part of the LibreOffice 7.4 release or Collabora Online. But so far it seems to work rather well.

This work is funded/sponsored by DEVxDAO as part of its mission to support open source and transparent research and development of emerging technologies and frameworks.


Wednesday, April 20, 2022

Improving text layout performance

So I've been working on improving LO text layout performance, as specified by the TDF tender. As it says, text layout in LO can be rather slow, because of problems like repeated text layout calls for the same text.

Let's have a look at a perf profile for PDF export of the document from bug#116400 :

There are two major costs here:

The first one is splitting text into script runs (separating runs of e.g. latin text from RTL text). About 61% of time is spent in vcl::text::TextLayoutCache. Which is rather strange for something called 'cache'. But this is one of the cases of poor naming, as the class is actually not a cache, it is the result of the script run splitting. It is called TextLayoutCache probably because callers are supposed to cache it and pass the same item to several OutputDevice calls ... which mostly does not happen. So whenever a text is to be drawn, this gets recreated. To make things even worse, it is done for the entire string, even if only a part of it is drawn, so this supposed cache actually makes things slower.

This can be fairly easily fixed by turning the class into an actual cache. Each TextLayoutCache instance depends only on the string, so it's easy to keep a reasonable number of them in one global cache.

The second problem is breaking text into multiple lines at suitable places. In this case the problem was in the ICU library we use. The common scenario when breaking text is finding the first break and then continuing with the same text to find the following break, and so on. ICU code tries to cache this if the position in the text is close enough to the previous one, and if it's not close enough but after the last position, it tries to only walk back a bit instead of repeating the entire work from the beginning of the string. But for whatever strange reason this walking back didn't work, and it walked back until the very beginning. And the test handling the result of the walking back didn't check what the result was and reset the position to whatever the result was. So in practice the breaking almost always started from the beginning, even if the last position was a way more reasonable place to start breaking from. So 26% of time is spent breaking the same text over and over (and it's only 26% because script run splitting is even more expensive).

I've reported this to ICU together with a suggested fix to not reset position to beginning if the last position is better, they've confirmed the problem, but apparently want to look at why the walking back doesn't work in the first place, and there has not been an actual fix from them yet. So I've at least pushed my patch for now.

The resulting perf profile now looks much better:

As can be seen from the number of cycles at the bottom, this is now almost 10x faster (well, ok, 8x to be more precise). The script run splitting can't be seen anymore (it's ~0.1% now), text breaking is still there, but way smaller (6%, and that's 6% of a total that's 8x smaller, so it would be 0.75% compared to the original 26%). Not bad. The PDF generation still takes a couple of seconds (it's 400 pages after all), but it's way faster.

Other problem I noticed while working on this was the related bugreport #144515 (and a couple more that I've closed as its duplicates):

The primary cost here is OutputDevice::ImplLayout(), which lays out text into glyphs and their positions. It is possible to cache this using the SalLayoutGlyphs class, and e.g. Writer has already started caching that for repeated calls, but in this case it's Calc using the EditEngine class, which does no caching.

So as a fix I've moved the caching code to VCL and turned it into a generic SalLayoutGlyphsCache class, and then made this place use that cache ... which didn't really help that much. After investigation it turned out that EditEngine tries to fit the given text into the given paper size (Calc's cell in this case), and so it repeatedly asks to lay out the entire long string in the cell, then breaks the line at the needed width, and then it repeats the same with the rest of the string for the next line, and so on. Which again results in horrible O(N^2) performance that mostly repeats the same over and over again.

But that should be possible to avoid, right? If it's repeatedly the same text, just a subset with increasing starting index, then presumably the glyphs are the same, and their positions are also the same, just at an offset. Well, it's not that simple actually, as in some cases it's not possible to cut glyphs for text at a random place and hope it'll be exactly the same as text layout would give, since text layout may try e.g. to position several adjacent spaces more nicely. But after a couple of attempts Noel pointed out to me that Harfbuzz actually provides information about places where it's not safe to break. Finally, I noticed that the problem with a number of those bugreports is people having small Calc cells with long text, where most of the text is not seen. So for the default case it can be made to show only the start of the text that fits, and so I made it possible for EditEngine to stop breaking lines when the given size is filled in instead of trying to pointlessly process lines that won't be needed.

Again, the resulting profile looks much better:

The operation is now almost 100x times faster (*cough*, ok, 62x) and is just a relatively small part in the total picture. Let's call that good enough.

In other somewhat related news, the fontconfig library has a new stable release that finally includes some performance improvemens when looking up fonts (not updated on its webpage, but available for download in release list).

BTW, since this was a TDF tender work, these improvements have been funded by TDF donations.


Thursday, March 31, 2022

Clang 14 faster at building LibreOffice

 So I've updated my Clang to the newly released version 14, and while doing the LibreOffice rebuild afterwards I noticed that it seemed to build faster. I didn't measure it, but the build finished sooner than I expected. So of course I've measured it.

As a simple reference I used my year-old post about Clang 11 building faster with PCH, where Calc'c Library_sc built in 4 minutes and 39 seconds. And indeed now it's faster:

 1105.84user 88.35system 2:55.07elapsed 682%CPU (0avgtext+0avgdata 1666576maxresident)k
 180904inputs+2272520outputs (41390major+23529938minor)pagefaults 0swaps

Just to make sure it's not something else, I went back to Clang 13 to test it:

 1581.76user 93.84system 4:14.75elapsed 657%CPU (0avgtext+0avgdata 1916380maxresident)k
 153912inputs+2316696outputs (41891major+27245569minor)pagefaults 0swaps

So yes, it's real, and it's the compiler. It also suggests that there was an improvement also between Clang 11 and Clang 13, although not as noticeable as this.

I have no idea why that is, it seems too big of a difference to be just something random, but I see nothing relevant in the release notes (and it's not DWARF5, I tested that one). I also have no idea if it's a code change or if it's the compiler being faster because it generates faster code and is self-built. But hey, it's nice. I still vaguely remember the times when I was trying to avoid full Calc rebuilds like a plague, but that seems like a long time ago.


Tuesday, March 8, 2022

Enabling Calc support for 16384 columns by default

Last couple of weeks I have been working on the 16k columns support in Calc. There's been a lot of work on this already by Noel and others, but so far this has been hidden behind the experimental option, and normally documents open only with the "normal" 1024 columns support. The goal of this work is to finish the 16k support stable enough for it to be the default, so that people who need this many columns can finally get them without any complications.

As of now all Calc tests pass with the default switched to 16k, and I've also dealt with all the known problems from tdf#133764 (minus few rare corner cases that I can deal with later). But I'm pretty sure there are more hidden problems lurking, either crashes because of incorrect bounds checking, or performance problems when some code suddenly deals with 16x more columns. So the next step is to enable this by default in master and collect compl... feedback from  guin... testers :).

I have the change already in Gerrit and expect to push it later today. If you'll be lucky unlucky noticing any problems, please report them in bugzilla and set your bugreport to block tdf#133764, and I'll sort it out. One of the things that remains to be decided is how to handle this from the users point of view. So far it seems to it'll be fine to just load everything in 16k without saying anything, but a part of collecting feedback is checking whether and how much backwards compatibility handling would be needed. If all goes well, and so far I don't see why it shouldn't, 7.4 will ship with 16k columns being the default.

Just to be clear, this is only about columns, not rows. The experimental option also changes rows from 1048576 to 16777216, but that's not focus of the work, so the default will stay at 1048576. First of all, Excel also only supports 1m rows, so there's not(?) such a demand for it, and it'll also require more work (16m rows is a lot of pixels, and it doesn't fit into 32bits).

This work is funded/sponsored by DEVxDAO as part of its mission to support open source and transparent research and development of emerging technologies and frameworks.

Fun meaningless fact: Column 16384 is called XFD.


Friday, October 22, 2021

Optimizing LibreOffice for a larger number of users

Have you ever edited a document in LibreOffice in more than one window? Right, neither have I. Who'd think about LibreOffice and more than one user at the same time, right? Except ... somebody did and that's how collaborative editing based on LibreOffice works. For whatever strange reason, somewhen in the past somebody thought that implementing multiple views for one document in OpenOffice (StarOffice?) was a good idea. Just select Window->New Window in the menu and you can edit your favourite document in 50 views that each show a different part of the document and update in real time. And that, in fact, is how collaborative editing such as with Collabora Online works - open a document, create a new view for every user, and there you go.

But, given that this has never really been used that much, how well did the original relevant code perform and scale for more users? Well, not much, it turns out. Not a big surprise, considering that presumably back when that code was written nobody thought the same document could be edited by numerous users at the same time. But I've been looking exactly into this recently as part of optimizing Collabora Online performance, and boy, are there were gems in there. You thought that showing the same document in more views would just mean more painting also in those views? Nah, think again, this is OpenOffice code, the land of programming wonders.

Profiling the code

When running Online's perf-test, which simulates several users typing in the same document, most of the time is actually spent in SwEditShell::EndAllAction(). It's called whenever Writer finishes an action such as adding another typed characters, and one of the things it does is telling other views about the change. So here LO spends a little time adding the character and then the rest of the time is spent in various code parts "talking" about it. A good part of that is that whenever an action is finished, that view tells the others about it happening, and then all those views tell all other views about how they reacted to it, making every change O(n^2) with the number of views. That normally does not matter, since on the desktop n generally tends to be 1, but hey, add few more views, and it can be a magnitude slower or more.

Redrawing, for example, is rather peculiar. When a part of the document changes, relevant areas of the view need redrawing. So all views get told about the rectangles that need repainting. In the desktop case those can be cropped by the window area, but for tiled rendering used by Online the entire document is the "window" area, so every view gets told about every change. And each view collects such rectangles, and later on it processes them and tells all other views about the changes. Yes, again. And it seems that in rare cases each view really needs its own repaint changes (even besides the cropping, as said before). So there's a lot of repeated processing of usually the same rectangles over and over again.

One of the functions prominently taking place in CPU costs is SwRegionRects::Compress(), which ironically is supposed to make things faster by compressing a group of rectangles into a smaller set of rectangles. I guess one of the cases in OpenOffice where the developer theoretically heard about optimizations being supposed to make things faster, but somehow the practical side of things just wasn't there. What happens here is that the function compares each rectangle with each other, checking if they can be merged ... and then, if yes and that changes the set of rectangles, it restarts the entire operation. Which easily makes the entire thing O(n^3). I have not actually found out why the restarting is there. I could eventually think of a rather rare case where restarting makes it possible to compress the rectangles more, but another possibility is that the code dates back to the time when it was not safe to continue after modifying whichever container SwRegionRects was using back then, and it has stayed there even though that class has been using to std::vector since a long time ago.

Another kind of interesting take on things is the SwRegionRects::operator-= in there. Would you expect that rectangles would be collected by simply, well, collecting them and then merging them together? Maybe you would, but that's not how it's done here. See, somebody apparently thought that it'd be better to use the whole area, and then remove rectangles to paint from it, and then at the end invert the whole thing. The document area is limited, so maybe this was done to "easily" crop everything by the area? It works, except, of course, this is way slower. Just not slow enough to really notice when n is 1.

Other code that works fine with small numbers but fails badly with larger ones is VCLEventListeners, a class for getting notified about changes to VCL objects such as windows. It's simply a list of listener objects, and normally there aren't that many of those. But if LO core gets overloaded, this may grow. And since each listener may remove itself from the list at any point, the loop calling all of them always checks for each of them if the listener is still in the list. So, again, O(n^2). And, of course, it's only rarely that any listener removes itself, so the code spends a lot of time doing checks just in case.

But so that I do not talk only about old code, new code can do equally interesting things. Remote rendering uses LOK (LibreOfficeKit), which uses text-based messages to send notifications about changes. And the intuitive choice for writing text are C++ iostreams, which are flexible, and slow. So there will be a lot of time spent in creating text messages, because as said above, there are many changes happening, repeatedly. And since there are so many repeated messages, it makes sense to write extra class CallbackFlushHandler that collects these messages and drops duplicates. Except ... for many of the checks it first needs to decode text mesages back to binary data, using C++ iostreams. And in most cases, it will find out that it can drop some message duplicates, so all these string conversions were done for nothing. Oops.

And there are more ways in which things can get slower rather than faster. CallbackFlushHandler uses an idle timer to first process all data in bulk and flush the data at once only when idle. Except if it gets too busy to keep up, and it can easily get too busy because of all the things pointed out above, it may take a very long time before any data is flushed. To make things even worse, the queue of collected messages will be getting longer and longer, which means searching for duplicates and compressing it will get longer. Which in turn will make everything even slower, which again in turn will make everything even slower. Bummer.

All in all, if unlucky, it may not take that much for everything to slow down very noticeably. Online's perf-test, which simulates only 6 users typing, can easily choke itself for a long time. Admitedly, it simulates them all typing at the same time and rather fast, which is not very a realistic scenario, but typing hitting the keyboard randomly and quickly is exactly how we all test things, right? So I guess it could be said that Collabora Online's perf-test simulates users testing Collabora Online performance :). Realistic scenarios are not going to be this bad.

Anyway. In this YT video you can see in the top part how perf-test performs without any optimizations. The other 5 simulated users are typing elsewhere in the document, so it's not visible, but it affects performance.

Improved performance

But as you can see in the other two parts, this poor performance is actually already a thing of the past. The middle part shows how big a difference can even one change make. In this specific case, the only difference is adding an extra high-priority timer to CallbackFlushHandler, which tries to flush the message queue before it becomes too big.

The bottom part is all the improvements combined, some of them already in git, some of them I'm still cleaning up. That includes changes like:

  • SwRegionRects::Compress() is now roughly somewhere at O(n*log(n)) I think. I've fixed the pointless restarts on any change, and implemented further optimizations such as Noel's idea to first sort the rectangles and not compare ones that cannot possibly overlap.
  • I have also changed the doubly-inverted paint rectangles handling to simply collecting them, cropping them at the end and compressing them.
  • One of the things I noticed when views collect their paint rectangles is that often they are adjacent and together form one large rectangle. So I have added a rather simple optimization of detecting this case and simply growing the previous rectangle.
  • Since it seems each Writer view really needs to collect its own paint rectangles, I have at least changed it so that they do not keep telling each other about them all the time in LOK mode. Now they collect them, and only once at end they are all combined together and compressed, and often thousands of rectangles become just tens of them.
  • Another thing Writer views like to announce all the time in LOK mode are cursor and selection positions. Now they just set a flag and compute and send the data only once at the end if needed.
  • VCLEventListeners now performs checks only if it knows that a listener has actually removed itself. Which it knows, because it manages the list.
  • Even though LOK now uses tiled rendering to send the view contents to clients, LO core was still rendering also to windows, even though those windows are never shown. That's now avoided by ignoring window invalidations in LOK mode.
  • Noel has written a JSON writer which is faster then the Boost one, and made non-JSON parts use our OString, which is faster and better suited for the fixed-format messages.
  • I have converted the from-message conversions to also use our strings, but more importantly I have changed internal LOK communications to be binary rather than text based, so that they usually do not even have to be converted. Only at the end those relatively few non-duplicated text messages are created.
  • Noel has optimized some queue handling in CallbackFlushHandler and then I have optimized it some more. Including the high-priority timer mentioned above.
  • There have been various other improvements from others from Collabora as part of the recent focus on improving performance.
  • While working on all of this, I noticed that even though we have support for Link Time Optimization (LTO), we do not use it, probably because it was broken on Windows. I've fixed this and sorted out few other small problems, and releases in the future should get a couple percent better performance across the board from this.

This is still work in progress, but it already looks much better, as now most of the time is actually spent doing useful things like doing the actual document changes or drawing and sending document tiles to clients. LOK and Collabora Online performance should improve noticeably, recent (Collabora) versions 6.4.x should include already some improvements, and the upcoming Collabora Online 2021 should have all of them.

And even though this's been an exercise in profiling LibreOffice performance for something nobody thought of back when the original OpenOffice code was written, some of these changes should matter even for desktop LibreOffice and will be included starting with LO 7.3.



Wednesday, August 25, 2021

Skia on Mac

 So, current LibreOffice git version now has support for drawing based on Skia also for the Mac. Both Raster and Metal (the Mac GPU framework). Below is the obligatory screenshot, and here is the hey-it-can-be-faster video.



Sunday, April 18, 2021

The effect of CPU, link-time (LTO) and profile-guided (PGO) optimizations on the compiler itself

 In other words, how much faster will a compiler be after it's been built with various optimizations?

Given the recent Clang12 release, I've decided to update my local build of Clang11 that I've been using for building LibreOffice. I switched to using my own Clang build instead of openSUSE packages somewhen in the past because it was faster. I've meanwhile forgot how much faster :), and openSUSE packages now build with LTO, so I've built Clang12 in several different ways to test the effect and this is it:


The file compiled is LO Calc's document.cxx, a fairly large source file, in a debug LO build. The compilation of the file is always the same, the only thing that differs is the compiler used and whether LO's PCH support is enabled. And the items are:

  1. Base - A release build of Clang12, with (more or less) the default options.
  2. CPU - As above, with -march=native -mtune=native added.
  3. LTO - As above, with link-time optimization used. Building Clang this way takes longer.
  4. LTO+PGO - As above, also with profile-guided optimization used. Building Clang this way takes even longer, as it needs two extra Clang builds to collect the PGO data.
  5. Base PCH - As Base, and the file is built with PCH used.
  6. LTO+PGO PCH - As LTO+PGO, again with PCH used.

Or, if you want this as numbers, then with Base being 100%, CPU is 85%, LTO is 78%, LTO+PGO is 59%, Base PCH is 37% and LTO+PGO PCH is 25%. Not bad.

Mind you, this is just for one randomly selected file. YMMV. For the build from the video from the last time, the original time of 4m39s with Clang11 LTO PCH goes down to 3m31s for Clang12 LTO+PGO PCH, which is 76%, which is consistent with the LTO->LTO+PGO change above.