Optimization of zim::Cache #385

veloman-yunkan · 2020-07-29T08:47:20Z

Current implementation of the LRU-cache in src/cache.h can be unacceptably slow for a large number of elements and certain access patterns, since its get() and put() operations contain branches with O(N) time complexity. Given the current usage of the cache in src/fileimple.cpp the linear complexity is due to the Cache::_getOldest() member function.

The text was updated successfully, but these errors were encountered:

MiguelRocha · 2020-07-29T09:22:55Z

Great issue to explore. It is, definitely, the biggest performance degradation factor in libzim. I'll leave some follow-up question:

Do we really need the cache to be LRU? _getOldest() is called due to this strategy.
Are we taking advantage of using an ordered std::map? should we consider a priority queue?
For some use cases (e.g zimcheck, zimdump) we are only hiting the cache once per item. We do not need the LRU mechanism to be in place, in fact, as I stated before, it is hurting the performance a lot. Is it feasible to have some way to programmatically customize the cache strategy?

kelson42 · 2020-07-29T15:28:06Z

@veloman-yunkan I'm not sure if this ticket is a questionning or a kind of plan to change something... What I believe is that this is really a topic worth a discussion and would like an agreement about a better solution before making a PR.

mgautierfr · 2020-07-29T17:23:02Z

Honestly, I never have a look at the cache code.
And I probably should have done this before :/
I agree with you on the fact that the implementation could be greatly improved.

I don't understand why we have a winner/looser separation. A simple ordering by serial(age) should be enough.
A map is nice to quickly search for a key. But to keep track of the age of the entry it is useless.

We could :

keep a map<key, pair<value, serial>> (dataMap) to have quick access to the value in the cache.
have a map<serial, key> (ageMap) to have a quick access to the oldest element to drop.

The get(key) algorithm would be (pseudo code) :

if (key in dataMap) {
    value, serial = dataMap[key]; //O(log(N))
    ageMap.erase(serial); O(1)
    serial = new_serial();
    dataMap[key] = value, serial; //O(1) (we already know where to insert)
    ageMap[serial] = key; //O(1) (we insert at end)
    return value;
} else {
    raise NoFound
}

The put(key, value) would be :

if (!enoughSpace()) {
    old_serial, old_key = ageMap.begin(); // smallest serial. O(1)
    ageMap.erase(old_serial); // O(1)
    dataMap.erase(old_key); //O(log(N))
}
serial = new_serial();
dataMap[key] = value, serial; //O(log(N))
ageMap[serial] = key; //O(1) (we insert at end)

Would it solve our issue ?
Do you think of other use case ?
Do you have better idea ?

It is, definitely, the biggest performance degradation factor in libzim

Do you have number for this ?
In all my measurements, the biggest time was in the cluster decompression.
The cache size is 16 for the cluster and 512 for the dirent.
I agree that linear search is definitively not the best algorithm, but the number are not so huge.

Is it feasible to have some way to programmatically customize the cache strategy?

What strategy are you thinking about ?

In case of zimdump/zimcheck, we iterate hover the entry in the cluster order. We don't need to keep a cluster in the cache as I soon we need another cluster we know we have finish with the old one (modulo the number of thread). We could simply limit the cache size to the number of thread to reduce the cache size (and speedup the lookup). But not sure it worth it if we reimplement the cache correctly.
Same for the dirent cache, this is usefull for the binary search. But we never use it as we directly use the entry index.

veloman-yunkan · 2020-08-17T14:30:26Z

@mgautierfr @kelson42 Before optimizing (rather, reimplementing) our own cache, maybe it makes sense to use a 3rd-party cache?

veloman-yunkan · 2020-08-17T14:36:18Z

My own plan was to implement something like https://github.com/lamerman/cpp-lru-cache, with the only difference of using an std::map instead of an std::unordered_map.

mgautierfr · 2020-08-17T14:53:25Z

I had a look in the implementation and it seems good.
And it is a header only library (as mustache). It should be relatively easy to compile on all platforms.

It should be pretty easy to test (at least) cache implementation/efficiency and take decision based on results.

veloman-yunkan · 2020-08-19T15:30:03Z

Benchmarking cpp-lru-cache against the current implementation of zim::Cache promises significant speed-up of zimcheck:

ZIM file	size (MB)	article count	cluster count	zimcheck -A runtime (zim::Cache)	zimcheck -A runtime (cpp-lru-cache)
wikipedia_en_climate_change_nopic_2020-01.zim	31	7646	51	9.5s	6s
wikipedia_hy_all_mini_2020-08.zim	563	509611	1526	575s	229s

veloman-yunkan · 2020-08-19T15:32:37Z

Now how do we proceed from here? cpp-lru-cache needs a couple of minor modifications. On the other hand it is small & simple to an extent that if I implemented it on my own (instead of googling) we might spend overall less time than if we decide to integrate cpp-lru-cache in libzim.

mgautierfr · 2020-08-19T16:05:19Z

The speed up is good. Promising.
In top of that, zimcheck -A does a lot of check that spend time.
Can you do another benchmark with zimcheck only ensuring that we are looping over all dirent/cluster but do no check ?
(Only check for internal link and patch getLink function to do nothing and return empty vector should do the trick). This way we would really (more) benchmark the cache speedup.

What kind of minor modifications ?
If they can be pushed upstream we should made a PR and use the project.

But indeed the implementation is pretty simple. Maybe it is simpler to take the two methods and replace our put/get method with the ones of cpp-lru-cache. (depend of what we need to do).

kelson42 · 2020-08-19T16:22:54Z

Two remarks:

@legoktm You might be interested by this discussion
@veloman-yunkan We might need such an algorithm/code in kiwix-serve as well to fix kiwix-serve ZIM fd needs to be smarter kiwix/kiwix-tools#142

veloman-yunkan · 2020-08-19T19:18:37Z

One more datapoint. Having performed in the context of a different PR a similar benchmark on a different machine using an earlier version of the "same" ZIM file, I've then observed about 2x shorter runtimes. The difference was unlikely to be attributed to the difference in hardware, so I also ran that earlier ZIM file on my desktop.

ZIM file	size (MB)	article count	cluster count	zimcheck -A runtime (zim::Cache)	zimcheck -A runtime (cpp-lru-cache)
wikipedia_hy_all_mini_2020-07.zim	560	509325	1518	258s	105s
wikipedia_hy_all_mini_2020-08.zim	563	509611	1526	575s	229s

The newer file contains only slightly more data than the old one, but is twice more costly to process. Any ideas what might have caused this?

legoktm · 2020-08-19T20:02:25Z

Thanks for the ping. It doesn't look like any distro has packaged cpp-lru-cache. It also seems inactive with no commits since 2017.

It's only about 70 lines, so given that we also need to modify it, I'd suggest just including that header file in the libzim code (with proper attribution/licensing of course), treating it as one of "our" files rather than trying to use it as a library.

mgautierfr · 2020-08-20T08:48:04Z

Any ideas what might have caused this?

I don't know, zim structures seems relatively identical.
I've you tried to run the tests several times ? Maybe the difference is because of the fs cache, it could be pretty important.

kelson42 · 2020-08-20T21:04:55Z

@veloman-yunkan You have been able to explain the oddity you reported earier?

veloman-yunkan · 2020-08-20T21:15:56Z

@kelson42 No, I haven't looked into it yet. One hypothesis is the cache overflow - if the new & slightly larger data requires a larger resident set size, this may lead to significantly higher number of cache misses, provided that the RSS (in terms of utilized cache entries) of the earlier/smaller data was close to the cache size limit.

veloman-yunkan · 2020-08-20T21:23:01Z

@kelson42 But the actual reason shouldn't prevent us from switching to the better cache. I will debug the noticed oddity later.

Can you do another benchmark with zimcheck only ensuring that we are looping over all dirent/cluster but do no check ?

@mgautierfr I am not going to spend time on this for the same reason. It is obvious that cpp-lru-cache is significantly faster and we should switch to it in any case.

What kind of minor modifications ?

You can see them in #405. But I am going to improve the new cache a little more.

veloman-yunkan · 2020-08-20T21:37:27Z

I've you tried to run the tests several times ? Maybe the difference is because of the fs cache, it could be pretty important.

I repeated the slow runs twice in a row and the run-time figures stayed the same.

veloman-yunkan · 2020-08-21T10:22:42Z

One hypothesis is the cache overflow - if the new & slightly larger data requires a larger resident set size, this may lead to significantly higher number of cache misses, provided that the RSS (in terms of utilized cache entries) of the earlier/smaller data was close to the cache size limit.

This hypothesis is likely wrong. Separately varying the dirent cache size from 512 (default) to 1024 and the cluster cache size from 16 (default) to 24 has very weak (under 3%) effect on the run time of zimcheck -A. I think that it makes sense to file a new issue and work on it separately. Should I file it under libzim or zim-tools?

mgautierfr · 2020-08-21T11:43:38Z

I agree with @veloman-yunkan, zimcheck is slower with the new file, whatever we use the new or old cache system.
Create the issue in zim-tools because it is where the problem appears. If we need to move it to libzim we will do then.

veloman-yunkan added the enhancement label Jul 29, 2020

veloman-yunkan self-assigned this Jul 29, 2020

veloman-yunkan added this to To do in Speed-up via automation Jul 29, 2020

MiguelRocha mentioned this issue Jul 29, 2020

zimcheck performance improvement openzim/zim-tools#134

Closed

kelson42 added the question label Jul 29, 2020

kelson42 assigned mgautierfr and MiguelRocha Jul 29, 2020

This was referenced Aug 10, 2020

Items in an in-memory cluster as separate objects #395

Closed

libzim cache size should be programmatically mutable #311

Closed

[Idea] Memory mapped dirent cache #399

Closed

veloman-yunkan mentioned this issue Aug 20, 2020

LRU cache reimplementation #405

Merged

veloman-yunkan mentioned this issue Aug 21, 2020

zimcheck crash with wikipedia_hy_all_mini.zim (was: Large difference in zimcheck performance on two similar ZIM files) openzim/zim-tools#151

Closed

kelson42 closed this as completed in #405 Aug 21, 2020

Speed-up automation moved this from To do to Done Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of zim::Cache #385

Optimization of zim::Cache #385

veloman-yunkan commented Jul 29, 2020

MiguelRocha commented Jul 29, 2020

kelson42 commented Jul 29, 2020

mgautierfr commented Jul 29, 2020

veloman-yunkan commented Aug 17, 2020

veloman-yunkan commented Aug 17, 2020

mgautierfr commented Aug 17, 2020

veloman-yunkan commented Aug 19, 2020

veloman-yunkan commented Aug 19, 2020 •

edited

mgautierfr commented Aug 19, 2020

kelson42 commented Aug 19, 2020

veloman-yunkan commented Aug 19, 2020

legoktm commented Aug 19, 2020

mgautierfr commented Aug 20, 2020

kelson42 commented Aug 20, 2020

veloman-yunkan commented Aug 20, 2020

veloman-yunkan commented Aug 20, 2020

veloman-yunkan commented Aug 20, 2020

veloman-yunkan commented Aug 21, 2020

mgautierfr commented Aug 21, 2020

Optimization of zim::Cache #385

Optimization of zim::Cache #385

Comments

veloman-yunkan commented Jul 29, 2020

MiguelRocha commented Jul 29, 2020

kelson42 commented Jul 29, 2020

mgautierfr commented Jul 29, 2020

veloman-yunkan commented Aug 17, 2020

veloman-yunkan commented Aug 17, 2020

mgautierfr commented Aug 17, 2020

veloman-yunkan commented Aug 19, 2020

veloman-yunkan commented Aug 19, 2020 • edited

mgautierfr commented Aug 19, 2020

kelson42 commented Aug 19, 2020

veloman-yunkan commented Aug 19, 2020

legoktm commented Aug 19, 2020

mgautierfr commented Aug 20, 2020

kelson42 commented Aug 20, 2020

veloman-yunkan commented Aug 20, 2020

veloman-yunkan commented Aug 20, 2020

veloman-yunkan commented Aug 20, 2020

veloman-yunkan commented Aug 21, 2020

mgautierfr commented Aug 21, 2020

veloman-yunkan commented Aug 19, 2020 •

edited