Friday, October 22, 2021

Optimizing LibreOffice for a larger number of users

Have you ever edited a document in LibreOffice in more than one window? Right, neither have I. Who'd think about LibreOffice and more than one user at the same time, right? Except ... somebody did and that's how collaborative editing based on LibreOffice works. For whatever strange reason, somewhen in the past somebody thought that implementing multiple views for one document in OpenOffice (StarOffice?) was a good idea. Just select Window->New Window in the menu and you can edit your favourite document in 50 views that each show a different part of the document and update in real time. And that, in fact, is how collaborative editing such as with Collabora Online works - open a document, create a new view for every user, and there you go.

But, given that this has never really been used that much, how well did the original relevant code perform and scale for more users? Well, not much, it turns out. Not a big surprise, considering that presumably back when that code was written nobody thought the same document could be edited by numerous users at the same time. But I've been looking exactly into this recently as part of optimizing Collabora Online performance, and boy, are there were gems in there. You thought that showing the same document in more views would just mean more painting also in those views? Nah, think again, this is OpenOffice code, the land of programming wonders.

Profiling the code

When running Online's perf-test, which simulates several users typing in the same document, most of the time is actually spent in SwEditShell::EndAllAction(). It's called whenever Writer finishes an action such as adding another typed characters, and one of the things it does is telling other views about the change. So here LO spends a little time adding the character and then the rest of the time is spent in various code parts "talking" about it. A good part of that is that whenever an action is finished, that view tells the others about it happening, and then all those views tell all other views about how they reacted to it, making every change O(n^2) with the number of views. That normally does not matter, since on the desktop n generally tends to be 1, but hey, add few more views, and it can be a magnitude slower or more.

Redrawing, for example, is rather peculiar. When a part of the document changes, relevant areas of the view need redrawing. So all views get told about the rectangles that need repainting. In the desktop case those can be cropped by the window area, but for tiled rendering used by Online the entire document is the "window" area, so every view gets told about every change. And each view collects such rectangles, and later on it processes them and tells all other views about the changes. Yes, again. And it seems that in rare cases each view really needs its own repaint changes (even besides the cropping, as said before). So there's a lot of repeated processing of usually the same rectangles over and over again.

One of the functions prominently taking place in CPU costs is SwRegionRects::Compress(), which ironically is supposed to make things faster by compressing a group of rectangles into a smaller set of rectangles. I guess one of the cases in OpenOffice where the developer theoretically heard about optimizations being supposed to make things faster, but somehow the practical side of things just wasn't there. What happens here is that the function compares each rectangle with each other, checking if they can be merged ... and then, if yes and that changes the set of rectangles, it restarts the entire operation. Which easily makes the entire thing O(n^3). I have not actually found out why the restarting is there. I could eventually think of a rather rare case where restarting makes it possible to compress the rectangles more, but another possibility is that the code dates back to the time when it was not safe to continue after modifying whichever container SwRegionRects was using back then, and it has stayed there even though that class has been using to std::vector since a long time ago.

Another kind of interesting take on things is the SwRegionRects::operator-= in there. Would you expect that rectangles would be collected by simply, well, collecting them and then merging them together? Maybe you would, but that's not how it's done here. See, somebody apparently thought that it'd be better to use the whole area, and then remove rectangles to paint from it, and then at the end invert the whole thing. The document area is limited, so maybe this was done to "easily" crop everything by the area? It works, except, of course, this is way slower. Just not slow enough to really notice when n is 1.

Other code that works fine with small numbers but fails badly with larger ones is VCLEventListeners, a class for getting notified about changes to VCL objects such as windows. It's simply a list of listener objects, and normally there aren't that many of those. But if LO core gets overloaded, this may grow. And since each listener may remove itself from the list at any point, the loop calling all of them always checks for each of them if the listener is still in the list. So, again, O(n^2). And, of course, it's only rarely that any listener removes itself, so the code spends a lot of time doing checks just in case.

But so that I do not talk only about old code, new code can do equally interesting things. Remote rendering uses LOK (LibreOfficeKit), which uses text-based messages to send notifications about changes. And the intuitive choice for writing text are C++ iostreams, which are flexible, and slow. So there will be a lot of time spent in creating text messages, because as said above, there are many changes happening, repeatedly. And since there are so many repeated messages, it makes sense to write extra class CallbackFlushHandler that collects these messages and drops duplicates. Except ... for many of the checks it first needs to decode text mesages back to binary data, using C++ iostreams. And in most cases, it will find out that it can drop some message duplicates, so all these string conversions were done for nothing. Oops.

And there are more ways in which things can get slower rather than faster. CallbackFlushHandler uses an idle timer to first process all data in bulk and flush the data at once only when idle. Except if it gets too busy to keep up, and it can easily get too busy because of all the things pointed out above, it may take a very long time before any data is flushed. To make things even worse, the queue of collected messages will be getting longer and longer, which means searching for duplicates and compressing it will get longer. Which in turn will make everything even slower, which again in turn will make everything even slower. Bummer.

All in all, if unlucky, it may not take that much for everything to slow down very noticeably. Online's perf-test, which simulates only 6 users typing, can easily choke itself for a long time. Admitedly, it simulates them all typing at the same time and rather fast, which is not very a realistic scenario, but typing hitting the keyboard randomly and quickly is exactly how we all test things, right? So I guess it could be said that Collabora Online's perf-test simulates users testing Collabora Online performance :). Realistic scenarios are not going to be this bad.

Anyway. In this YT video you can see in the top part how perf-test performs without any optimizations. The other 5 simulated users are typing elsewhere in the document, so it's not visible, but it affects performance.

Improved performance

But as you can see in the other two parts, this poor performance is actually already a thing of the past. The middle part shows how big a difference can even one change make. In this specific case, the only difference is adding an extra high-priority timer to CallbackFlushHandler, which tries to flush the message queue before it becomes too big.

The bottom part is all the improvements combined, some of them already in git, some of them I'm still cleaning up. That includes changes like:

  • SwRegionRects::Compress() is now roughly somewhere at O(n*log(n)) I think. I've fixed the pointless restarts on any change, and implemented further optimizations such as Noel's idea to first sort the rectangles and not compare ones that cannot possibly overlap.
  • I have also changed the doubly-inverted paint rectangles handling to simply collecting them, cropping them at the end and compressing them.
  • One of the things I noticed when views collect their paint rectangles is that often they are adjacent and together form one large rectangle. So I have added a rather simple optimization of detecting this case and simply growing the previous rectangle.
  • Since it seems each Writer view really needs to collect its own paint rectangles, I have at least changed it so that they do not keep telling each other about them all the time in LOK mode. Now they collect them, and only once at end they are all combined together and compressed, and often thousands of rectangles become just tens of them.
  • Another thing Writer views like to announce all the time in LOK mode are cursor and selection positions. Now they just set a flag and compute and send the data only once at the end if needed.
  • VCLEventListeners now performs checks only if it knows that a listener has actually removed itself. Which it knows, because it manages the list.
  • Even though LOK now uses tiled rendering to send the view contents to clients, LO core was still rendering also to windows, even though those windows are never shown. That's now avoided by ignoring window invalidations in LOK mode.
  • Noel has written a JSON writer which is faster then the Boost one, and made non-JSON parts use our OString, which is faster and better suited for the fixed-format messages.
  • I have converted the from-message conversions to also use our strings, but more importantly I have changed internal LOK communications to be binary rather than text based, so that they usually do not even have to be converted. Only at the end those relatively few non-duplicated text messages are created.
  • Noel has optimized some queue handling in CallbackFlushHandler and then I have optimized it some more. Including the high-priority timer mentioned above.
  • There have been various other improvements from others from Collabora as part of the recent focus on improving performance.
  • While working on all of this, I noticed that even though we have support for Link Time Optimization (LTO), we do not use it, probably because it was broken on Windows. I've fixed this and sorted out few other small problems, and releases in the future should get a couple percent better performance across the board from this.

This is still work in progress, but it already looks much better, as now most of the time is actually spent doing useful things like doing the actual document changes or drawing and sending document tiles to clients. LOK and Collabora Online performance should improve noticeably, recent (Collabora) versions 6.4.x should include already some improvements, and the upcoming Collabora Online 2021 should have all of them.

And even though this's been an exercise in profiling LibreOffice performance for something nobody thought of back when the original OpenOffice code was written, some of these changes should matter even for desktop LibreOffice and will be included starting with LO 7.3.



6 comments:

  1. Thank you for doing all this incredible, significant, and helpful work!

    ReplyDelete
  2. How can a LibreOffice plug-in for Jami for serverless collaborative real-time editing be implemented??

    https://git.jami.net/savoirfairelinux/jami-project/-/issues/1010

    Thank you

    ReplyDelete
    Replies
    1. Easy, just implement it yourself or find a developer, pay some money him/her and developer implements it for you

      Delete
  3. It's just a brilliant work, Luboš!

    ReplyDelete
  4. Thanks Luboš, great work and well written article!

    ReplyDelete
  5. What an interesting read, feeling the achievements!

    ReplyDelete