Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of zim::Cache #385

Closed
veloman-yunkan opened this issue Jul 29, 2020 · 19 comments · Fixed by #405
Closed

Optimization of zim::Cache #385

veloman-yunkan opened this issue Jul 29, 2020 · 19 comments · Fixed by #405

Comments

@veloman-yunkan
Copy link
Collaborator

Current implementation of the LRU-cache in src/cache.h can be unacceptably slow for a large number of elements and certain access patterns, since its get() and put() operations contain branches with O(N) time complexity. Given the current usage of the cache in src/fileimple.cpp the linear complexity is due to the Cache::_getOldest() member function.

@veloman-yunkan veloman-yunkan self-assigned this Jul 29, 2020
@veloman-yunkan veloman-yunkan added this to To do in Speed-up via automation Jul 29, 2020
@MiguelRocha
Copy link
Contributor

Great issue to explore. It is, definitely, the biggest performance degradation factor in libzim. I'll leave some follow-up question:

  • Do we really need the cache to be LRU? _getOldest() is called due to this strategy.
  • Are we taking advantage of using an ordered std::map? should we consider a priority queue?
  • For some use cases (e.g zimcheck, zimdump) we are only hiting the cache once per item. We do not need the LRU mechanism to be in place, in fact, as I stated before, it is hurting the performance a lot. Is it feasible to have some way to programmatically customize the cache strategy?

@kelson42
Copy link
Contributor

@veloman-yunkan I'm not sure if this ticket is a questionning or a kind of plan to change something... What I believe is that this is really a topic worth a discussion and would like an agreement about a better solution before making a PR.

@mgautierfr
Copy link
Collaborator

Honestly, I never have a look at the cache code.
And I probably should have done this before :/
I agree with you on the fact that the implementation could be greatly improved.

  • I don't understand why we have a winner/looser separation. A simple ordering by serial(age) should be enough.
  • A map is nice to quickly search for a key. But to keep track of the age of the entry it is useless.

We could :

  • keep a map<key, pair<value, serial>> (dataMap) to have quick access to the value in the cache.
  • have a map<serial, key> (ageMap) to have a quick access to the oldest element to drop.

The get(key) algorithm would be (pseudo code) :

if (key in dataMap) {
    value, serial = dataMap[key]; //O(log(N))
    ageMap.erase(serial); O(1)
    serial = new_serial();
    dataMap[key] = value, serial; //O(1) (we already know where to insert)
    ageMap[serial] = key; //O(1) (we insert at end)
    return value;
} else {
    raise NoFound
}

The put(key, value) would be :

if (!enoughSpace()) {
    old_serial, old_key = ageMap.begin(); // smallest serial. O(1)
    ageMap.erase(old_serial); // O(1)
    dataMap.erase(old_key); //O(log(N))
}
serial = new_serial();
dataMap[key] = value, serial; //O(log(N))
ageMap[serial] = key; //O(1) (we insert at end)

Would it solve our issue ?
Do you think of other use case ?
Do you have better idea ?

It is, definitely, the biggest performance degradation factor in libzim

Do you have number for this ?
In all my measurements, the biggest time was in the cluster decompression.
The cache size is 16 for the cluster and 512 for the dirent.
I agree that linear search is definitively not the best algorithm, but the number are not so huge.

Is it feasible to have some way to programmatically customize the cache strategy?

What strategy are you thinking about ?

In case of zimdump/zimcheck, we iterate hover the entry in the cluster order. We don't need to keep a cluster in the cache as I soon we need another cluster we know we have finish with the old one (modulo the number of thread). We could simply limit the cache size to the number of thread to reduce the cache size (and speedup the lookup). But not sure it worth it if we reimplement the cache correctly.
Same for the dirent cache, this is usefull for the binary search. But we never use it as we directly use the entry index.

@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr @kelson42 Before optimizing (rather, reimplementing) our own cache, maybe it makes sense to use a 3rd-party cache?

@veloman-yunkan
Copy link
Collaborator Author

My own plan was to implement something like https://github.com/lamerman/cpp-lru-cache, with the only difference of using an std::map instead of an std::unordered_map.

@mgautierfr
Copy link
Collaborator

I had a look in the implementation and it seems good.
And it is a header only library (as mustache). It should be relatively easy to compile on all platforms.

It should be pretty easy to test (at least) cache implementation/efficiency and take decision based on results.

@veloman-yunkan
Copy link
Collaborator Author

Benchmarking cpp-lru-cache against the current implementation of zim::Cache promises significant speed-up of zimcheck:

ZIM file size (MB) article count cluster count zimcheck -A runtime (zim::Cache) zimcheck -A runtime (cpp-lru-cache)
wikipedia_en_climate_change_nopic_2020-01.zim 31 7646 51 9.5s 6s
wikipedia_hy_all_mini_2020-08.zim 563 509611 1526 575s 229s

@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Aug 19, 2020

Now how do we proceed from here? cpp-lru-cache needs a couple of minor modifications. On the other hand it is small & simple to an extent that if I implemented it on my own (instead of googling) we might spend overall less time than if we decide to integrate cpp-lru-cache in libzim.

@mgautierfr
Copy link
Collaborator

The speed up is good. Promising.
In top of that, zimcheck -A does a lot of check that spend time.
Can you do another benchmark with zimcheck only ensuring that we are looping over all dirent/cluster but do no check ?
(Only check for internal link and patch getLink function to do nothing and return empty vector should do the trick). This way we would really (more) benchmark the cache speedup.

What kind of minor modifications ?
If they can be pushed upstream we should made a PR and use the project.

But indeed the implementation is pretty simple. Maybe it is simpler to take the two methods and replace our put/get method with the ones of cpp-lru-cache. (depend of what we need to do).

@kelson42
Copy link
Contributor

Two remarks:

@veloman-yunkan
Copy link
Collaborator Author

One more datapoint. Having performed in the context of a different PR a similar benchmark on a different machine using an earlier version of the "same" ZIM file, I've then observed about 2x shorter runtimes. The difference was unlikely to be attributed to the difference in hardware, so I also ran that earlier ZIM file on my desktop.

ZIM file size (MB) article count cluster count zimcheck -A runtime (zim::Cache) zimcheck -A runtime (cpp-lru-cache)
wikipedia_hy_all_mini_2020-07.zim 560 509325 1518 258s 105s
wikipedia_hy_all_mini_2020-08.zim 563 509611 1526 575s 229s

The newer file contains only slightly more data than the old one, but is twice more costly to process. Any ideas what might have caused this?

@legoktm
Copy link
Member

legoktm commented Aug 19, 2020

Thanks for the ping. It doesn't look like any distro has packaged cpp-lru-cache. It also seems inactive with no commits since 2017.

It's only about 70 lines, so given that we also need to modify it, I'd suggest just including that header file in the libzim code (with proper attribution/licensing of course), treating it as one of "our" files rather than trying to use it as a library.

@mgautierfr
Copy link
Collaborator

Any ideas what might have caused this?

I don't know, zim structures seems relatively identical.
I've you tried to run the tests several times ? Maybe the difference is because of the fs cache, it could be pretty important.

@kelson42
Copy link
Contributor

@veloman-yunkan You have been able to explain the oddity you reported earier?

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 No, I haven't looked into it yet. One hypothesis is the cache overflow - if the new & slightly larger data requires a larger resident set size, this may lead to significantly higher number of cache misses, provided that the RSS (in terms of utilized cache entries) of the earlier/smaller data was close to the cache size limit.

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 But the actual reason shouldn't prevent us from switching to the better cache. I will debug the noticed oddity later.

Can you do another benchmark with zimcheck only ensuring that we are looping over all dirent/cluster but do no check ?

@mgautierfr I am not going to spend time on this for the same reason. It is obvious that cpp-lru-cache is significantly faster and we should switch to it in any case.

What kind of minor modifications ?

You can see them in #405. But I am going to improve the new cache a little more.

@veloman-yunkan
Copy link
Collaborator Author

I've you tried to run the tests several times ? Maybe the difference is because of the fs cache, it could be pretty important.

I repeated the slow runs twice in a row and the run-time figures stayed the same.

@veloman-yunkan
Copy link
Collaborator Author

One hypothesis is the cache overflow - if the new & slightly larger data requires a larger resident set size, this may lead to significantly higher number of cache misses, provided that the RSS (in terms of utilized cache entries) of the earlier/smaller data was close to the cache size limit.

This hypothesis is likely wrong. Separately varying the dirent cache size from 512 (default) to 1024 and the cluster cache size from 16 (default) to 24 has very weak (under 3%) effect on the run time of zimcheck -A. I think that it makes sense to file a new issue and work on it separately. Should I file it under libzim or zim-tools?

@mgautierfr
Copy link
Collaborator

I agree with @veloman-yunkan, zimcheck is slower with the new file, whatever we use the new or old cache system.
Create the issue in zim-tools because it is where the problem appears. If we need to move it to libzim we will do then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Speed-up
  
Done
Development

Successfully merging a pull request may close this issue.

5 participants