Thursday, May 12, 2022

Improving Calc support for 16384 columns

So I enabled support for up to 16384 columns in Calc by default some time ago, but just getting it to work was not necessarily the end of the work. Making Calc have 16 times more columns means that any operation that works on entire columns is suddenly 16 times slower, or even worse. Similarly this could easily lead to 16x more memory used. So the support not only needs to work, but it also needs to be usable.

It theory adding a number of empty columns to the end of a spreadsheet should not make a difference, but in practice it does. With 1024 columns it is not as necessary to ignore those empty columns as it is with 16k, and a lot of the code dates back to the times when Calc supported even fewer colums (256?), where a being little inefficient here or there didn't show. But now it suddently did.

For example, if you protect or hide all unused columns until the end of the spreadsheet, then hitting the right arrow key on the last accessible cell makes Calc check all cells to the right for whether it's possible to go into them. And checking whether a column is hidden requires searching the list of column information, which is not trivial (it's compacted in order not to waste memory). The barely noticeable cost of this with 1024 columns got large enough to cause noticeable delays. Fortunately the ColHidden() function is smart enough to return the first and last column in the compacted range where the flag is equal, the code doing the cursor navigation just up until now didn't bother using that information, but now it needed to do so.

Another example, and that's quite a large topic, is allocating columns. If most of those new columns are not actually used, then it makes sense to allocate them only when needed, right? That will save memory, and it will make things faster too, because there is no need to check those empty columns. That idea got implemented back when this 16k work was started by others, adding e.g. function GetColumnsRange() that clamped the range to the allocated columns, but the problem is that in practice this is not as simple as that.

One of the issues here is let's say the case of selecting an entire row (clicking the row number to the left of the table does that easily) and then hitting Ctrl+B to make the entire row bold. That should not clamp the column range to the allocated columns, because if I later enter something into cells in those columns, I expect that to be bold. But if Calc allocates all columns for this, maybe I do not intend to enter values anywhere else except the first rows, so allocating all columns will be a waste. The solution to this is having default column data. The ScTable class now, besides having a list of allocated ScColumn's also has a ScColumnData member that stores some data for all not-yet allocated columns. Set the bold flag for all allocated columns and also in the default, and problem solved.

Except then, GetColumnsRange() clamping to allocated columns becomes incorrect, because now it's possible to have set data even beyond allocated columns, such as this bold flag. So I changed GetColumnsRange() to simply return the given range, without any adjustments, and then added the better-named GetAllocatedColumnsRange() for cases where the code knows it wants only the allocated range.

Somewhat similarly to the bold case, merely showing or going to an unallocated column should not allocate it. Otherwise hit e.g. Ctrl+Right one time too many and the cursor going to column XFD would make all columns get allocated. But that causes yet another complication - I am now at an unallocated column and all operations should either detect the column is not allocated and return, or allocate the column if needed. The initial 16k work added CreateColumnIfNotExists() exactly to protect such accesses and allocate the column if needed. It's just that this needed adding to quite many places, and some were still missing it, and others were calling it unnecessarily causing unnecessary column allocations. So I needed to work on these over time. I eventually went as far as change Calc to initially allocate just one column. Since before that Calc used to allocate 64 columns by default, a number of places missing such checks kept working because normally people didn't work with more than 64 columns (and so this 64 default was a reasonable choice at the time, as there was really a lot to check and fix). Now that I have changed this to just one column and fixed all tests, it looks like I've rooted them all out (at least I'm still getting only very few bugreports about something breaking :) ).

Drawing, somewhat unexpectedly, turned out to be a possible performance problem too. There are few ways in which cells to the left can affect drawing of cells to the right. If you enter a too-long text into a cell, it will overflow to the right, into the space of the next cell, or possibly even several cells. So when Calc is drawing let's say a couple of cells around the 10000th column, it actually needs to check also all the 10000 columns before. Somebody back in the day thought about optimizing it, and so before Calc draws cells, function FillInfo() first collects information about all the cells to draw and also all the cells to the left. What possibly(?) was an optimization with 256 or 1024 column is a problem with 16384 columns. Even allocating and clearing all the memory actually had a noticeable performance impact. Sadly, as sometimes happens to be the case with optimizations from the times, whoever wrote this made it slow. Function FillInfo() collects all data necessary for drawing a cell into struct CellInfo, and all that info is collected also for all the cells to the left, even though most of it is not used for them. So I had to find out what was necessary and split that out (and provide proper abstraction, because real programmers back in the day used direct data access, right).

 Some of the problems can be even a bit funny. Have you created e.g. a named range called DAY1, NUM1, LOG10 or even DOG10? Well, now you can't, since now those are valid cell addresses, going up to XFD1. So Calc now needed special backwards compatibility code for this.

I expect the real test of this comes when it becomes part of the LibreOffice 7.4 release or Collabora Online. But so far it seems to work rather well.

This work is funded/sponsored by DEVxDAO as part of its mission to support open source and transparent research and development of emerging technologies and frameworks.

1 comment:

  1. This problem has long ago been solved by writers of spreadsheets. It has been a long time if ever, since worksheet programs actually allocated memory to cells in parts of the worksheet that were beyond the range of cells that were active. Having an enormous number of unused columns should not have a noticeable effect on calculation times or memory usage. It is helpful to understand that modern worksheets are not stored in memory as enormously large arrays with a lot of empty elements. That's just not how it's done.

    Navigating the focus cell through a large number unused columns does not allocate memory to those unused columns. The navigation makes it appear that the user is moving through a large space, but, internally, no such space actually exists. Otherwise, the operation of the worksheet would be unacceptably ponderous.

    I suspect the writer of this article is encountering symptoms of other kinds of problems. Particularly, I would focus on the speed of the graphics-rendering chip of the computer. If the graphics chip is unable to keep up with the navigation through the worksheet, the user will get the impression that the computer as a whole is hopelessly clogged up. It isn't.