Friday, July 15, 2022

Making unsorted lookups in Calc fast

 The VLOOKUP spreadsheet function by default requires the searched data to be sorted, and in that case it performs a fast binary search. If the data is not sorted (for example if it would be impractical to have the data that way), it is possible to explicitly tell VLOOKUP that the data is not sorted, in which case Calc did a linear one-by-one lookup. And there are other functions such as COUNTIF or SUMIF that essentially do a lookup too, and those cannot even be told that the data is sorted and so they processed the data linearly. With large spreadsheets this can actually take a noticeably long time. Bugreports such as tdf#139444, tdf#144777 or tdf#146546 say operations in such spreadsheets take minutes to complete, or even "freeze".

I wanted to do something about those for quite a while, as with the right idea making those much faster should be actually fairly simple. And the simple idea I had was to let Calc to sort the data first and then use fast binary search. These documents usually do lookups in the same fixed range of cells, so the linear search in the same unsorted data was rather a waste when done repeatedly. Surely Calc should be able to sort the data just once, cache it and then use that cached sort order repeatedly. In fact VLOOKUP already had a cache for results of lookup in the same area, used when doing lookup in the same row but different columns.

I finally found the time to do something about this when SUSE filled a bug to Collabora about their internal documents freezing on load and then crashing after 10 minutes. The LO thread pool class has a 10 minutes timeout as a safety measure after which it aborts, and the large number of lookups in the documents actually managed to exceed that timeout. So I can't actually say how slow it was before :), but I can quote Gerald Pfeifer from SUSE reporting the final numbers:

BeforeAfter
76m:56s CPU time (crash)9s CPU time (6s clock time)
126m:23s CPU time (crash)25s CPU time (15s clock time)
160m+ CPU time (crash)38s CPU time (23s clock time)
8m:56s CPU time (8m:56s clock time)14s CPU time (5s clock time)

This work is available in LibreOffice 7.4, and the TDF bugreports show similar improvements. In fact tdf#144777, titled "countifs() in Calc is slower than Excel's countifs()", now has a final comment saying that MS Office 2021 can do a specific document in 26s and LO 7.4 can do it in 2s. Good enough, I guess :).



No comments:

Post a Comment