-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison of approaches to tackle index memory usage #2523
Comments
I like to start with the setup I used to test during the implementation: I created test data with the following script
This creates 100.000 small different files in 1000 directories. Therefore this should give a lot of data blobs and still quite some tree blobs. Can be changed to simulate even more files.. I ran the inital backup
and did a repeated backup run with unchanged data for all index strategies:
besides measuring with time I also added memory profiling. Standard index
Here, the index already is the huges part of memory. You can also see that time reports 90MB used. This is IMO the garbage collector setting that has been talked about already. Reload indexThis seemes to work. However it displayed a ETA of about 20 minutes while constantly using 100% CPU... This can be easily explained as it constantly rereads (and decrypts!) the index files.. Low-Memory index
Total memory reported by the profiler is halved! Morover, the 12,5 MB Index data set is replaced by a less than 5 MB IDSet. I would estimate that with this approach the index memory requirement can be reduced to about 40%! Bolt index - first run
It also creates a 72 MB my.db file where the index is saved. As expected, the memory consumption is basically gone! Bolt index - second runAs in the bolt setting the db is not deleted after the run, I reran with an already present db:
Memory profile is almost identical to the first run which is as expected. Generally this may indicate that with storage-based index we might have to think about a permanent stored index, because it is able to boost performance quite a lot! |
Maybe worth trying an LSM-based Key/Value store too. LSMs have quite different performance and disk-footprint tradeoffs compared to BTrees, and may work better for the persistent index usecase. I'd personally use RocksDB but there are probably good-enough Go-native stores too. |
Thank you for prototyping this! Looks very encouraging. The index files are currently encrypted, so whatever disk store we use should be encrypted too. I looked at https://github.com/dgraph-io/badger - it's Go native, LSM-based and supports encryption. Low memory index looks good for backups. I'm not sure if it will scale for restores from repos with huge number of blobs though. |
Thanks for the discussion so far. This is why I started thinking about another possibility for optimizing in-memory index. The result is already in my branch index-alternative and named low-mem2. The idea is to store all index data in sorted tables (saving the overhead of the map) an try to not store anything duplicated. The pack ID for instance is saved in a separate table and only referenced in the main table. So I would suggest the following: I will try to make a PR to replace (or optionally replace) the standard index by the memory-optimized one (low-mem2). This will hopefully give a fast improvement about the memory issue. What is your opinion about it? Maybe some of the core maintainers can give me a hint about the right direction to work on. Thank you! |
Cutting the memory usage in half, sound like a reasonable short-term solution to me. I'm currently a bit worried about a possible performance hit by keeping the data in a sorted array. The current implementation of the MasterIndex simply iterates over a list containing the indexes. For larger repositories, with a few thousand index files, restic spends a large fraction of the time for backups / checks with iterating over the different indexes, see #2284. Regarding your low-mem2 branch: I'm a bit confused by the SortByBlob and SortByPack functions. Shouldn't it be possible to just sort by blobID and then just iterate over the whole index for lookups based on the packID (as is done by the current implementation)? (My preference would be to get completely rid of the locking once an index has been finished) |
@MichaelEischer: Thanks for pointing out this issue in masterindex. As far as I can see, there are three possible ways the index is used:
As 1. is the primary use case and 2. is quite similar I agree that we should optimize for this. I would propose to also change the way master index is working and allow an index structure to be filled by the index data of many index files that have been read (and also for the index files created and then only used read-only). About low-mem2:
So how do we continue? As already mentioned I can prepare a PR to exchange the current index/masterindex implementation by an optimized one. |
@aawsome Sorry for the late reply. I could review early parts of the work, however, it would also be best to get some comments on the overall design by the core maintainers. For testing I have an 8TB repo with 40 million files at hand; For my index performance tests I've so far used the The MasterIndex is used for case 1 and 2. Merging all index files in-memory for a static index which would suffice for case 1 should be rather easy to implement. A single sorted list will definitely be faster than the current index once it reaches a certain amount of index files The alternatives to a sorted list that come to my mind right now would be some sort of tree or a hashmap. The former probably would require some additional pointers in the data structure and the latter would be rather similar to the current implementation. I've thought about replacing the MasterIndex with a single large hashmap, but right now the growth strategy for the backing buffer of the hashmap kept me from trying: Go doubles the backing buffer size once a certain load factor is reached, which in the end might require temporarily keeping a rather emtpy new backing buffer along with the old one in memory. This scenario would probably require much more memory than the current implementation. This should be solvable by using two levels of hashmaps where the first one is indexed using the first few bits of the blob ID. I'm just not sure how to efficiently handle blob IDs which are not uniformly distributed and thus could still lead to large imbalances in the second-level hashmaps. It might be worthwhile to look a the index data structures used by databases or borg-backup. Case 3 (prune/rebuild-index/find) is handled by
Searching for packs in a sorted-by-blob list would work but probably a bit slower than the current implementation. The current implementation iterates over all indices and stops once an index containing the packs was found, which on average requires traversing half of the list. Scanning the sorted-by-blob list would always require a full scan. On the other hand, the scanning should be faster, so the overall performance should be more or less comparable. |
@aawsome So right now the focus of the changes would be to optimize the memory usage of the index along with the masterIndex performance, while deferring fundamental changes like using an on-disk database to later? Regarding case 3: Do plan to also replace
Your "Optimize Index" merge request would add another usage which actually requires an index grouped by pack files. Regarding an optimized MasterIndex: My last comment mostly looked at using a single large index datastructure that would probably need to use partitioning by key to avoid large spikes in memory usage when adding new entries. That way of partitioning has the downside that adding a new index likely requires touching every partition. In addition imbalances in the partitions size could cause further problems. However, there is also another option (which is slightly inspired by log structured merge trees): Keep the number of entries in the MasterIndex roughly constant at a value c. This would mean that a new index would after finalizing be merged into one of the existing index parts. That way only a part of the index is modified at a time. For hash maps this could be used to keep the load factor rather high as each part would be filled up before using the next one. For sorted arrays some other strategy sounds more reasonable: Merge index parts into larger ones with roughly 1k, 2k, 4k, 8k, ... entries while only merging the currently largest size class once a certain constant of index parts with that size exists. This should provide |
Yes, that is my focus. We can discuss to add options for the backup command to only save tree blobs in the index (and thus stupidly saving each new blob in the snapshot) or maybe only load the data blobs used in the previous snapshot into the index (and thus only having deduplication for blobs in the previous snapshots). This would allow low-memory clients to backup to large repositories followed by prune operations on big-memory machines to do the full deduplication.
I do think that this is possible and should be the goal of the new implementation. I however didn't analyze it in detail - thanks for your hints! About implementation details: I'll need a bit of time to read and think about it. Thanks you very much for your ideas and support! |
As announced, I prepared an early version of the refactored index, see refactor-index It's still WIP but almost all tests already pass and it from my tests it looks pretty good..
The implementation uses one big (sorted) array to store the index and the Hence, the cruical function is About your comments @MichaelEischer I cannot really estimate the impacts of appending to the indextable, see line 329 in Also after inserting the table needs to be sorted again (at least before searching it the first time). I guess that the current implementation should avoid unneccessary sort operations, but this needs performance tests. Can anybody help with these tests? |
I'm willing to help with performance testing. What exactly do you want to have tested? |
@dimejo: I would prefer 1:1 comparisons (i.e. the identical run with both branches) between the master branch (or last release) and my branch. I think the following commands should at least be testet:
[edit: added list, cat and find commands] Please write if you don't understand what I mean or if you need more detailed instructions how to run these test! |
I'll try to do all the tests this weekend but I need some help enabling profiling. I'm not a programmer and following advice on the internet didn't work out that well :/ |
You can start with To enable profiling, first compile restic with the This creates a |
Awesome! I thought it would be much more complex to enable profiling. Will try it out and report back. |
@aawsome Thanks for your work, that's really a lot of changes. I did some test runs with the In my opinion your commit with the index changes is far too large to be reviewable, thousand lines of added code along with two thousand deletions is really a lot.
I have some (incomplete) comments on the code itself but I didn't think too much yet about the overall code architecture: struct
Race condition in
|
Ah, I've found the problem: Right now the call to PrepareCache is commented out. PrepareCache also configures the PerformReadahead function of the Cache. Without this Check has to load every single tree data blob separately from the backend. |
@MichaelEischer: Thank you very much for giving your very valuable comments to this early stage of implementation 👍 I do know that it is quite a lot of code change - however I think it is also a simplification of the current index implementation. Hence the large number of deleted files/code lines. Sorry about the About your correct comments of the About the sorting: Right now the code is supposed to do the sorting only when needed, i.e. within the first call I'll take care about the other issues - thanks especially for finding the race conditions! |
This.
This is when you should split your changes into separate individual PRs, so it's manageable and with clear intent. |
@rawtaz: IMO the change is not that big considered that it is a complete re-implementation of the index (and please note: also re-implementing some legacy stuff, see However this is WIP and has some issues I'm willing to fix before thinking about how to best integrate this work into the master branch. |
This should be fixed now.
Is changed now.
The whole byPack logic was not used and you correctly pointed out that there were also some errors within. I cleaned this dead code.
I hope I found a nice way to handle this. I basically test
This is now done using the new atomic
Is fixed now.
This is fixed now.
This issue is still open. I did not find a nice way to handle it with just one unfinished index. |
I'll give it a try in the next days.
This still leaves the possibility of a collision between
It seems like you've missed
Hmm, the main blocker here is that you can't get the index pack id before the upload call completed. It would be rather easy to solve by allowing multiple indexes to be unfinished while only the last one may be written to and the predecessors are currently uploading. |
Oops - I was in fact so focused on SortByBlobs that I didn't consider the methods it is called :-(
Is fixed now.
I also fixed this. I now encode/decrypt the index and write it to the main index while locked but write the file to the backend after releasing the lock. Moreover I found another optimization that reduces the memory consumption quite a lot (and also fixes the append-problem): I exchanged the large array by a structure that implements a "paged" list of blobs. Here a list of "pages" is maintained and if it needs to be enlarged, just a new page is allocated. This allows to append without needing to recopy all old elements. Also the overhead is fixed and maximum the page size whereas with append you get quite a large overhead (the array gets an increasing "capacity reserve" to ensure it doesn't need to recopy too often). This optimiziation seems to save another ~15-20% of memory. I'm happy to get feedback and even more like to see performance tests with large repositories,, |
Here are the results of my first comparison.
Then I backuped this data dir three times and run
These steps I run once with the master branch (compiled to Here are the results: time_backup2_old.txt time_backup3_old.txt time_check_old.txt To summarize:
|
I'm finally done with my tests. It took longer than estimated because I had to re-run my tests because of some mysterious results. I wanted to compare results between 1 and multiple cores. Interestingly the VM with only 1 vCPU was a lot faster with almost all operations. Maybe @fd0 knows what's going on. results for VM11 vCPU Note that I've made 1 CPU and 1 memory profiling run for each restic version. Time shown (in seconds) in this table is the average for both runs.
results for VM88 vCPU Note that I've made 1 CPU and 1 memory profiling run for each restic version. Time shown (in seconds) in this table is the average for both runs.
Additional information
files changed added for backup 1:
files changed added for backup 2:
files changed added for backup 3:
Logslogs_vm1_cpu.zip |
@dimejo Thank you very much for your tests! First regarding your question about increased time for the 8 CPU setting: You are comparing the "user time" that is the total CPU seconds used. With parallel processing this is (and should be higher) than using a single CPU as there is always some overhead due to parallelization. In fact your 8 CPU runs were much faster, see the elapsed time. You used the commit In general I'm quite satisfied with these results! Seem my implementation is using much less memory while still having similar performance than the original index implementation. (BTW: I'll fix the obviously open issue with |
@aawsome Thanks for writing that summary comment. One of the things I as a non-core developer (I'm just a maintainer or simple things) experience is that during the time that core developer time is spent on other parts of the project, issues like this one just keeps growing. Then whenever a core developer looks at it, it's a lot or in some cases too much to dig into at the time. The less there is to read for them, the higher are the chances that we can make good use of their time. I took the liberty of editing the initial post in this issue to add a reference to your summary comment. Regarding your questions, I personally cannot answer them. I will have to defer them a while, and I kindly ask for your patience. Our main focus right now with the available time we have is some other stuff, but I can tell you that this issue and your PR is on my radar and I am trying to get eyes on it eventually. I think that with your summary comment, besides fixing bugs (I'm slightly concerned that there might be some more small bugs lurking) and whatever you can find that doesn't deviate from what you wrote in your comment, the most productive thing is to let this rest a bit. The next step would be to get core developer eyes on it, and until that happens it's probably more counter-productive to grow the discussion more. Let me know if my rationale with this doesn't seem to make sense, so I can clarify what I mean. |
I finally found time to test the performance of the backup command on a 8TB, 40M files repository. To keep the overall runtime low I've just backed up a 6GB directory. CPU usage of restic was limited using current master: cpu-idx-master.pdf With this PR restic uses around 6.5GB memory instead of 18GB for the current master, which is a gigantic improvement. The index takes around 9 minutes to load in both cases (the host does not support AES-NI so decryption is a bit slow). The backup takes 34mins for this PR and 18mins for the current master. Taking a look at the cpu profiling graphs shows the difference: The indexTable spends 16 minutes just for sorting the in-memory index. The index is large enough that a single sort call for the in-memory index takes about 75 seconds (!) to complete. During this time the whole backup processing is blocked as the index is locked exclusively. The in-memory index probably needs a way to merge a new index without resorting everything, maybe even implemented incrementally. This should be possible by sorting the new index beforehand and then merge it into the MasterIndex with a single scan over the indexTable. |
@MichaelEischer Thank you very much for doing your test! In fact, the current implementation needs a resort of the index table each time an index file is written. So if I calculated correctly, 13 new index files have been created, right? Actually my judgement is that resorting an almost sorted table with |
When I zoom into the profile data The backup run actually created 16 new index files. The 75 seconds were a guess from a I've rerun the test with the updated code and the performance is a lot better: cpu-idx-table-split.pdf However, using a separate index for new index entries merely hides but does not solve the underlying problem: Just imagine a backup run that adds a few million new files or a large initial backup. The current implementation has I wonder whether there is a data structure that would perfectly fit our needs:
When dropping the requirement for a really compact in-memory representation, hashmaps are a perfect fit, as their queries and updates happen in One solution would be to combine both approaches and collect new entries in a hashmap and merge them into the sorted array from time to time (e.g. when the hashmap's size reaches 10% of the sorted array). That said it is probably best to wait for feedback from @fd0 before making any extensive changes. (Sorry for growing the discussing again.) |
I'd say there are "many" data structures more or less satisfying these needs, but don't know how much effort it would be to switch to them in restic. Basically any B-tree or generally tree structures should satisfy these needs (hash maps are really inefficient for such scenarios as you also found out yourself in your measurements). |
If I'm not mistaken a B-Tree only guarantees that it's nodes are at least half filled. And it would require pointers which would more or less add a 8 byte pointer for every 48bytes of data. If a B-Tree guarantees 70% space usage on average then it would end up with a similar memory overhead as my hashmap based implementation. There might be some variants like B*-Trees which are more efficient though. |
@michaeldorner yeah, that's the "old traditional B-tree" - as you wrote, there are many (I've seen at least ten quite different ones so far) B-tree like structures each having different properties. Other well known tree-like structures include e.g. red-black trees (popular for maps which must guarantee consistent insertion/deletion time & space; this is in contrast with hash-based maps which might suffer from "jumps" or other excessive peaks both in time & space eventually leading to severe security/performance problems). There are also various "compressed" maps - feel free to search for them. So there is really a number of options. Though I'm not against using a small temporary file like discussed in this topic and like e.g. sqlite does - btw. sqlite is highly efficient and I've used it as storage just for the purpose of extremely low memory usage while offering quite high throughput despite using hard disks as temporary cache - feel free to try it out, it might surprise you. Just my 2 cents 😉. |
The sorted, densely packed array of index entries is basically the gold-standard for memory usage. Yes, you could apply some compression, but that would only help with the blob size/offset and pack id idx (approx. 16bytes) the remaining 32bytes are the blob id (sha256 hashes) which I'd assume to be basically incompressible. I'm not sure whether the potential for another 20-30% lower memory usage would justify the even higher complexity. Trees always require one (and some unused space like with B-Trees) or two pointers (red-black-tree and others) per elements which would already add 8 or 16 bytes overhead to the approx. 48 bytes of the index entry itself. I'm also not sure how well the Go garbage collector would handle millions or even billions of index entries. To save the honor of hash tables: cuckoo hashing (constant lookup time) together with 4 entries per bucket is said to achieve over 90% space utilization. If you just use ten of these hash tables then the total memory overhead would also stay below 33% even while growing a hashmap. I think the main question is more whether we want a single data structure (tree/hashmap) with a certain memory overhead. Or whether we want the lower overhead of a sorted packed array, which will need an auxiliary data structure to provide a reasonable insertion speed. If that auxiliary data structure just contains 10-20% of the overall index data, then it won't matter much whether a data structure has 30% or 70% overhead and in that case Go's builtin hashmap would be the simplest solution. Using an on-disk index has it's own share of problems: If the index is too large to be kept in memory, then the index performance will probably implode on HDDs (SSDs should be much better off) unless the index is able to benefit from locality between blobs for e.g. one folder/file (in a fully sorted list of blob ids, the access pattern would basically be uniformly at random). |
@MichaelEischer thanks for the wrap up. I'd still though encourage you to try disk cache (just to get a glimpse of possible performance, not to be incorporated in Restic) using sqlite3 with different setups (e.g. both with and without index, with increased max memory constraints than the very low defaults, using just one table as key-value map-like storage, etc.) while leveraging some sqlite3 pecularities like builtin integer primary key and multithreaded access without explicit locking (sqlite handles it itself) and using one huge transaction for all accesses and not using WAL as it's significantly slower for reads (but slightly faster for writes). Using disc cache might also quite significantly benefit from stream-oriented processing, but don't know whether it's a good fit for a quick tryout. Just an idea based on my good experience with sqlite3-based key-value storage 😉. |
@aawsome Could you split your "list blobs" optimization into a separate PR? I don't plan to merge the That leaves Regarding the major changes, these also have to stay on hold for now. @dumblob The problem with using sqlite or any other generic database for an on-disk index is that without optimizations that make use of locality between blobs in a file (i.e. blobs within a file end up in nearby packs and are usually accessed together), we'll end up with a pseudo-random access to all parts of the index which is larger than the main memory and therefore has to load data from random locations on the disk. That is a generic database implementation has no chance at all to provide a good performance. |
Sure. PR will come soon. |
In fact, the actual implementation of As the usual case is that there are no duplicates within one index files, this actually does not give duplicates even in cases where they might be needed. I will prepare a PR to change |
NOTE TO CODE REVIEWERS / CORE CONTRIBUTORS
Please read this comment for a summary of changes and structured information about each of those changes. This might be a better starting point than reading through all of the comments in this issue.
Describe the issue
As stated in #1988 the memory consumption of the index is one of the biggest issues with memory consumption of restic. The main point is the actual implementation of in-memory index storage in
internal/repository/index.go
I would therefore like to start a new issue to discuss strategies that could remidy the high memory consumption.
As a starting point I opened a branch index-alternatives which implements the following strategies:
Index strategy can be chosen by
restic -i <default|reload|low-mem|bolt>
The branch is considered WIP. I added a general Interface
FileIndex
which should be fulfilled by new index strategies. Also the actual implementation just loads index files (file-per-file) into the standardIndex struct
and then reloads them into a new data structure allowing the GC to clean up. Newly created index files are not affected, so there should be only an effect for repositories with data in it.Please note that most implementations are quick-and-dirty and miss serious error handling, clean-up etc.
More index strategies can be easily added by adding a new implementation for
FileIndex
and adding them inrepository.go
(where loading takes place) andcmd/restic/global.go
(where the flag-i
is interpreted)I appreciate to get feedback on the proposed index strategies.
In this task I would also like to collect good test settings where different index strategies can be compared to each other.
The text was updated successfully, but these errors were encountered: