Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of approaches to tackle index memory usage #2523

Closed
aawsome opened this issue Dec 20, 2019 · 55 comments · Fixed by #2818
Closed

Comparison of approaches to tackle index memory usage #2523

aawsome opened this issue Dec 20, 2019 · 55 comments · Fixed by #2818

Comments

@aawsome
Copy link
Contributor

aawsome commented Dec 20, 2019

NOTE TO CODE REVIEWERS / CORE CONTRIBUTORS

Please read this comment for a summary of changes and structured information about each of those changes. This might be a better starting point than reading through all of the comments in this issue.

Describe the issue

As stated in #1988 the memory consumption of the index is one of the biggest issues with memory consumption of restic. The main point is the actual implementation of in-memory index storage in internal/repository/index.go

I would therefore like to start a new issue to discuss strategies that could remidy the high memory consumption.

As a starting point I opened a branch index-alternatives which implements the following strategies:

  • Standard: index as actually used in restic
  • Reload: Don't save anything. For any index operation reload the file and use the actual implementation for the temporily created index.
  • Low-memory index: Store only full index data for tree blobs. For data blobs only use a IDSet to save which data blobs are present. This allows most index operations. For the missing ones do as in Reload (at least for data blobs)
  • Bolt: Use a DB on disc via bbolt

Index strategy can be chosen by
restic -i <default|reload|low-mem|bolt>

The branch is considered WIP. I added a general Interface FileIndex which should be fulfilled by new index strategies. Also the actual implementation just loads index files (file-per-file) into the standard Index struct and then reloads them into a new data structure allowing the GC to clean up. Newly created index files are not affected, so there should be only an effect for repositories with data in it.

Please note that most implementations are quick-and-dirty and miss serious error handling, clean-up etc.

More index strategies can be easily added by adding a new implementation for FileIndex and adding them in repository.go (where loading takes place) and cmd/restic/global.go (where the flag -i is interpreted)

I appreciate to get feedback on the proposed index strategies.
In this task I would also like to collect good test settings where different index strategies can be compared to each other.

@aawsome
Copy link
Contributor Author

aawsome commented Dec 20, 2019

I like to start with the setup I used to test during the implementation:

I created test data with the following script

#!/bin/sh
mkdir data
for i in `seq 1 1000`; do
        echo $i
        mkdir data/$i
        for j in `seq 1 100`; do
                echo $i$j > data/$i/$j
        done
done

This creates 100.000 small different files in 1000 directories. Therefore this should give a lot of data blobs and still quite some tree blobs. Can be changed to simulate even more files..

I ran the inital backup

restic -r /path/to/repo init
restic -r /path/to/repo backup /path/to/data

and did a repeated backup run with unchanged data for all index strategies:

/usr/bin/time restic -r /path/to/repo -i <strategy> backup /path/to/data

besides measuring with time I also added memory profiling.
Here are the results:

Standard index


Files:           0 new,     0 changed, 100000 unmodified
Dirs:            0 new,     0 changed,     2 unmodified
Added to the repo: 0 B  

processed 100000 files, 567.676 KiB in 0:08
snapshot d7a3f88b saved
13.14user 1.71system 0:11.16elapsed 133%CPU (0avgtext+0avgdata 90564maxresident)k
0inputs+64outputs (0major+1756minor)pagefaults 0swaps

(pprof) top  
Showing nodes accounting for 15.21MB, 97.56% of 15.59MB total
Dropped 84 nodes (cum <= 0.08MB)
Showing top 10 nodes out of 59
      flat  flat%   sum%        cum   cum%
   12.52MB 80.26% 80.26%    12.52MB 80.26%  github.com/restic/restic/internal/repository.(*Index).store
    1.01MB  6.46% 86.72%     1.01MB  6.50%  github.com/restic/chunker.NewWithBoundaries
    0.39MB  2.52% 89.23%     0.39MB  2.52%  reflect.New
    0.36MB  2.29% 91.53%     0.37MB  2.34%  github.com/restic/restic/internal/restic.NodeFromFileInfo
    0.29MB  1.89% 93.41%     1.10MB  7.05%  github.com/restic/restic/internal/archiver.(*Archiver).SaveDir
    0.28MB  1.80% 95.22%     0.28MB  1.80%  bytes.makeSlice
    0.15MB  0.94% 96.15%     0.15MB  0.94%  github.com/restic/restic/internal/archiver.(*TreeSaver).Save
    0.09MB   0.6% 96.76%     0.09MB   0.6%  strings.(*Builder).grow
    0.08MB   0.5% 97.26%     0.08MB   0.5%  github.com/restic/restic/internal/restic.BlobSet.Insert
    0.05MB   0.3% 97.56%     0.38MB  2.46%  encoding/json.Marshal

Here, the index already is the huges part of memory. You can also see that time reports 90MB used. This is IMO the garbage collector setting that has been talked about already.

Reload index

This seemes to work. However it displayed a ETA of about 20 minutes while constantly using 100% CPU... This can be easily explained as it constantly rereads (and decrypts!) the index files..
However, I consider this index strategy is not worth a deeper look....

Low-Memory index

<snip: same output as above>
processed 100000 files, 567.676 KiB in 0:08
snapshot f8822cee saved
13.47user 2.03system 0:11.90elapsed 130%CPU (0avgtext+0avgdata 109584maxresident)k
0inputs+72outputs (0major+1945minor)pagefaults 0swaps

(pprof) top
Showing nodes accounting for 7.68MB, 94.71% of 8.11MB total
Dropped 88 nodes (cum <= 0.04MB)
Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
    4.84MB 59.68% 59.68%     4.84MB 59.68%  github.com/restic/restic/internal/restic.IDSet.Insert
    1.01MB 12.50% 72.18%     1.02MB 12.58%  github.com/restic/chunker.NewWithBoundaries
    0.41MB  5.05% 77.23%     0.41MB  5.05%  reflect.New
    0.35MB  4.26% 81.48%     0.35MB  4.26%  github.com/restic/restic/internal/restic.NodeFromFileInfo
    0.28MB  3.47% 84.95%     0.28MB  3.47%  bytes.makeSlice
    0.27MB  3.31% 88.26%     1.01MB 12.46%  github.com/restic/restic/internal/archiver.(*Archiver).SaveDir
    0.19MB  2.33% 90.60%     0.19MB  2.33%  github.com/restic/restic/internal/archiver.(*TreeSaver).Save
    0.18MB  2.22% 92.81%     4.94MB 60.93%  github.com/restic/restic/internal/repository.(*IndexLowMem).store
    0.08MB  0.96% 93.78%     0.08MB  0.96%  github.com/restic/restic/internal/restic.BlobSet.Insert
    0.08MB  0.93% 94.71%     0.40MB  4.92%  encoding/json.Marshal

Total memory reported by the profiler is halved! Morover, the 12,5 MB Index data set is replaced by a less than 5 MB IDSet. I would estimate that with this approach the index memory requirement can be reduced to about 40%!
Trade-Off is a slightly higher total backup time (13,5s as compared to 13,1s with standard index).
I'm really interested how this index strategy performs in other test settings!

Bolt index - first run

<snip: same output as above>
processed 100000 files, 567.676 KiB in 0:30
snapshot 9ff3d71f saved
37.01user 3.25system 0:34.54elapsed 116%CPU (0avgtext+0avgdata 282816maxresident)k
0inputs+114096outputs (0major+5436minor)pagefaults 0swaps

(pprof) top
Showing nodes accounting for 2592.31kB, 90.31% of 2870.50kB total
Dropped 79 nodes (cum <= 14.35kB)
Showing top 10 nodes out of 81
      flat  flat%   sum%        cum   cum%
 1030.83kB 35.91% 35.91%  1037.16kB 36.13%  github.com/restic/chunker.NewWithBoundaries
  390.02kB 13.59% 49.50%   390.02kB 13.59%  reflect.New
  278.60kB  9.71% 59.20%   278.60kB  9.71%  github.com/restic/restic/internal/restic.NodeFromFileInfo
  274.66kB  9.57% 68.77%   933.59kB 32.52%  github.com/restic/restic/internal/archiver.(*Archiver).SaveDir
  244.05kB  8.50% 77.27%   244.05kB  8.50%  bytes.makeSlice
  169.81kB  5.92% 83.19%   169.81kB  5.92%  github.com/restic/restic/internal/archiver.(*TreeSaver).Save
      80kB  2.79% 85.98%       80kB  2.79%  github.com/restic/restic/internal/restic.BlobSet.Insert
   44.17kB  1.54% 87.52%   292.20kB 10.18%  github.com/restic/restic/internal/archiver.(*TreeSaver).save
   40.16kB  1.40% 88.91%    40.16kB  1.40%  strings.(*Builder).grow
      40kB  1.39% 90.31%       40kB  1.39%  github.com/restic/restic/internal/restic.NewBlobBuffer

It also creates a 72 MB my.db file where the index is saved.

As expected, the memory consumption is basically gone!
This however comes with a big tradeoff in backup time (37s as compared to 13.1s with standard index).
I think it would be interesting to see how this performs with bigger data sets. No clue if this performance issue just adds a factor of about 3 or is even growing more than linear.
Moreover, it seems to me that much more testing and fine-tuning should be considered with storage-based index strategies...

Bolt index - second run

As in the bolt setting the db is not deleted after the run, I reran with an already present db:

processed 100000 files, 567.676 KiB in 0:08
snapshot b364fba4 saved
14.26user 1.94system 0:12.07elapsed 134%CPU (0avgtext+0avgdata 133464maxresident)k
0inputs+88outputs (0major+2601minor)pagefaults 0swaps

Memory profile is almost identical to the first run which is as expected.
I really wonder why this second run is so much faster than the first. Is it the fs cache that is working now? Or does boltdb somehow uses the info of the already stored index to speed things up?

Generally this may indicate that with storage-based index we might have to think about a permanent stored index, because it is able to boost performance quite a lot!

@ifedorenko
Copy link
Contributor

Maybe worth trying an LSM-based Key/Value store too. LSMs have quite different performance and disk-footprint tradeoffs compared to BTrees, and may work better for the persistent index usecase. I'd personally use RocksDB but there are probably good-enough Go-native stores too.

@shtratos
Copy link

Thank you for prototyping this! Looks very encouraging.

The index files are currently encrypted, so whatever disk store we use should be encrypted too. I looked at https://github.com/dgraph-io/badger - it's Go native, LSM-based and supports encryption.

Low memory index looks good for backups. I'm not sure if it will scale for restores from repos with huge number of blobs though.

@aawsome
Copy link
Contributor Author

aawsome commented Jan 10, 2020

Thanks for the discussion so far.
I have to admit that I didn't spend much time in trying other storage-based index options. (badger is still on my to-do-list)
The reason is the following: I see that there will be quite some issues to arise when seriously trying to bring this in production-grade state. I see issues about encryption, how to clean up files ans how to deal with aborted operations and so on.

This is why I started thinking about another possibility for optimizing in-memory index. The result is already in my branch index-alternative and named low-mem2. The idea is to store all index data in sorted tables (saving the overhead of the map) an try to not store anything duplicated. The pack ID for instance is saved in a separate table and only referenced in the main table.
I did not run much tests so far but it seems that performance is just a litte worse than the standard index. On the other side memory consumption is comparable to low-mem, that is saving around 60% compared to standard index in my test setting, see above.
Considering that for most operations (e.g. backup) only specific information of the index is needed, this can be further improved.

So I would suggest the following: I will try to make a PR to replace (or optionally replace) the standard index by the memory-optimized one (low-mem2). This will hopefully give a fast improvement about the memory issue.
The possibilities to use a disc-based index is then open for the future.

What is your opinion about it? Maybe some of the core maintainers can give me a hint about the right direction to work on. Thank you!

@MichaelEischer
Copy link
Member

Cutting the memory usage in half, sound like a reasonable short-term solution to me. I'm currently a bit worried about a possible performance hit by keeping the data in a sorted array. The current implementation of the MasterIndex simply iterates over a list containing the indexes. For larger repositories, with a few thousand index files, restic spends a large fraction of the time for backups / checks with iterating over the different indexes, see #2284.

Regarding your low-mem2 branch: I'm a bit confused by the SortByBlob and SortByPack functions. Shouldn't it be possible to just sort by blobID and then just iterate over the whole index for lookups based on the packID (as is done by the current implementation)? (My preference would be to get completely rid of the locking once an index has been finished)
The code currently is just able to convert an existing Index to the new format. It would be rather nice if it were able to convert a finalized Index to the more efficient format, to get the improved memory consumption also during the initial backup.

@aawsome
Copy link
Contributor Author

aawsome commented Jan 13, 2020

@MichaelEischer: Thanks for pointing out this issue in masterindex. As far as I can see, there are three possible ways the index is used:

  1. read from index files and then read-only
  2. created during the run an then read-only
  3. Operations to modfy the index (this is only purging/recovering)

As 1. is the primary use case and 2. is quite similar I agree that we should optimize for this. I would propose to also change the way master index is working and allow an index structure to be filled by the index data of many index files that have been read (and also for the index files created and then only used read-only).

About low-mem2:

  • I added SortByBlob and SortByPack as there is a use-case for sorted packs. But I also agree we should try to have unblocked parellel operations on the index. As far as I can see, the choice whether to search by blob or search by pack is determined by the main command. Hence even if we do block for resorting, it would be efficiently non-blocking for the main operation. And, of course, we should use RWMutex.
    Maybe search for packs in a sorted-by-blob list is also ok, just needs some testing..
  • About the performance considerations. The fact that restic at the moment just reads linearly through all index files makes me quite confident that one sorted list of blob of all index files and bisection search can even outperform the current implementation for large indices, i.e. for many index files ;-)

So how do we continue? As already mentioned I can prepare a PR to exchange the current index/masterindex implementation by an optimized one.
@MichaelEischer I would appreciate to get early feedback and improvement suggestions - may I give you some early parts of the work for review? Maybe we'll find also some people testing the changes with respect to performance and memory usage for real-life scenarios...

@MichaelEischer
Copy link
Member

MichaelEischer commented Jan 25, 2020

@aawsome Sorry for the late reply. I could review early parts of the work, however, it would also be best to get some comments on the overall design by the core maintainers. For testing I have an 8TB repo with 40 million files at hand; For my index performance tests I've so far used the check command which stresses the index a lot.

The MasterIndex is used for case 1 and 2. Merging all index files in-memory for a static index which would suffice for case 1 should be rather easy to implement. A single sorted list will definitely be faster than the current index once it reaches a certain amount of index files O(log n) vs. O(n). Dynamically adding full/finalized index files (case 2) does not easily combine with a single sorted blob list as it would need to be resorted after each added index. Fully resorting an index with hundreds of millions of entries several times sounds like a rather bad idea to me. [EDIT] Just merging two already sorted lists also works and basically just amounts to a full copy of the existing list. With hundred million entries this most likely still takes a lot of time [/EDIT] . Dynamically growing such a list would also require some way to make room for new entries without having to keep a full copy of the old and new list buffer in memory.

The alternatives to a sorted list that come to my mind right now would be some sort of tree or a hashmap. The former probably would require some additional pointers in the data structure and the latter would be rather similar to the current implementation. I've thought about replacing the MasterIndex with a single large hashmap, but right now the growth strategy for the backing buffer of the hashmap kept me from trying: Go doubles the backing buffer size once a certain load factor is reached, which in the end might require temporarily keeping a rather emtpy new backing buffer along with the old one in memory. This scenario would probably require much more memory than the current implementation. This should be solvable by using two levels of hashmaps where the first one is indexed using the first few bits of the blob ID. I'm just not sure how to efficiently handle blob IDs which are not uniformly distributed and thus could still lead to large imbalances in the second-level hashmaps. It might be worthwhile to look a the index data structures used by databases or borg-backup.

Case 3 (prune/rebuild-index/find) is handled by index/Index.New right now, so this shouldn't collide with changes to the MasterIndex.

ListPack methods on Index/MasterIndex looks like dead code to me, restic still compiles when I remove that method (only the Index test needs some tweaks). The only usages of a ListPack method I could find are on the Repository struct. This would make SortByPack unnecessary and avoid the complexity associated with resorting.

Searching for packs in a sorted-by-blob list would work but probably a bit slower than the current implementation. The current implementation iterates over all indices and stops once an index containing the packs was found, which on average requires traversing half of the list. Scanning the sorted-by-blob list would always require a full scan. On the other hand, the scanning should be faster, so the overall performance should be more or less comparable.

@MichaelEischer
Copy link
Member

MichaelEischer commented Jan 26, 2020

@aawsome So right now the focus of the changes would be to optimize the memory usage of the index along with the masterIndex performance, while deferring fundamental changes like using an on-disk database to later?

Regarding case 3: Do plan to also replace index/Index which is grouped by packID? My current understanding of the restic code is that this would be the only potential users of ListPack.
Right now I can only find two active uses of index/Index.Packs:

  • cmd_list/runList which just needs some kind of blob list and doesn't really care about sorting
  • cmd_prune/pruneRepository which uses it for statistics, iterate over all blobs to find duplicates and to make the decision which pack to rewrite or remove. Only the last usage is a bit more complex to avoid as it would require some additional hashmaps to keep track of used blobs per pack.

Your "Optimize Index" merge request would add another usage which actually requires an index grouped by pack files.

Regarding an optimized MasterIndex: My last comment mostly looked at using a single large index datastructure that would probably need to use partitioning by key to avoid large spikes in memory usage when adding new entries. That way of partitioning has the downside that adding a new index likely requires touching every partition. In addition imbalances in the partitions size could cause further problems.

However, there is also another option (which is slightly inspired by log structured merge trees): Keep the number of entries in the MasterIndex roughly constant at a value c. This would mean that a new index would after finalizing be merged into one of the existing index parts. That way only a part of the index is modified at a time. For hash maps this could be used to keep the load factor rather high as each part would be filled up before using the next one. For sorted arrays some other strategy sounds more reasonable: Merge index parts into larger ones with roughly 1k, 2k, 4k, 8k, ... entries while only merging the currently largest size class once a certain constant of index parts with that size exists. This should provide O(log n) complexity for incrementally adding new index files. Depending on the constant c either adding new index files is rather cheap (for small index parts) or lookup is fast (low number of index parts).

@aawsome
Copy link
Contributor Author

aawsome commented Jan 26, 2020

@aawsome So right now the focus of the changes would be to optimize the memory usage of the index along with the masterIndex performance, while deferring fundamental changes like using an on-disk database to later?

Yes, that is my focus. We can discuss to add options for the backup command to only save tree blobs in the index (and thus stupidly saving each new blob in the snapshot) or maybe only load the data blobs used in the previous snapshot into the index (and thus only having deduplication for blobs in the previous snapshots). This would allow low-memory clients to backup to large repositories followed by prune operations on big-memory machines to do the full deduplication.
I will postpone my work on disc-based index.

Regarding case 3: Do plan to also replace index/Index which is grouped by packID?

I do think that this is possible and should be the goal of the new implementation. I however didn't analyze it in detail - thanks for your hints!

About implementation details: I'll need a bit of time to read and think about it. Thanks you very much for your ideas and support!

@aawsome
Copy link
Contributor Author

aawsome commented Jan 28, 2020

As announced, I prepared an early version of the refactored index, see refactor-index
Also some internals are slightly changed to take use of the new index. For instance check should now require much less memory than before...

It's still WIP but almost all tests already pass and it from my tests it looks pretty good..
Here are the open points:

  • Fix tests for index (this part doesn't compile yet)
  • refactor rebuild-indexand prune to use the new MasterIndex and the new in-memory index
  • performance tests

The implementation uses one big (sorted) array to store the index and the IndexJSON format to keep items created in the current run.
When backing up, the IndexJSON is regularly stored to the repository and added to the main in-memory index.
When loading the index from the repository, each index file is loaded into the IndexJSON format and then added to the main in-memory index.

Hence, the cruical function is AddJSONIndex in internal/index/index_new.go, starting from line 278.

About your comments @MichaelEischer I cannot really estimate the impacts of appending to the indextable, see line 329 in internal/index/index_new.go. Does this mean that the whole table is to copied regularly (you mentioned the "memory spikes")? Or is this handled nicely by the memory managment of go and of the operating system?

Also after inserting the table needs to be sorted again (at least before searching it the first time). I guess that the current implementation should avoid unneccessary sort operations, but this needs performance tests. Can anybody help with these tests?

@dimejo
Copy link
Contributor

dimejo commented Jan 28, 2020

@aawsome:

Can anybody help with these tests?

I'm willing to help with performance testing. What exactly do you want to have tested?

@aawsome
Copy link
Contributor Author

aawsome commented Jan 28, 2020

@dimejo:
Thank you very much for your help!

I would prefer 1:1 comparisons (i.e. the identical run with both branches) between the master branch (or last release) and my branch.
I'm interested in differences about memory usage and performance (total time and cpu usage) for large repositories (many files, many snapshots, etc). Using /usr/bin/time -v is ok, maybe enabling profiling (memory profiling and cpu profiling) can help to better explain differences.
The backend doesn't matter at all - local backend is fine but no must.

I think the following commands should at least be testet:

  • backup 1. initial run 2. another run with (almost) unchanged files; 3. another run with --force
  • check
  • restore
  • maybe mount
  • list blobs
  • cat blob
  • find --blob

[edit: added list, cat and find commands]

Please write if you don't understand what I mean or if you need more detailed instructions how to run these test!

@dimejo
Copy link
Contributor

dimejo commented Jan 29, 2020

I'll try to do all the tests this weekend but I need some help enabling profiling. I'm not a programmer and following advice on the internet didn't work out that well :/

@aawsome
Copy link
Contributor Author

aawsome commented Jan 29, 2020

You can start with
/usr/bin/time -v restic <command>

To enable profiling, first compile restic with the debug option:
go run build.go --tags debug
and then use the flag --mem-profile or --cpu-profile, e.g.:
restic --mem-profile . <command>

This creates a .pprof file that can be analyzed with go tool pprof, e.g.:
go tool pprof -pdf mem.pprof restic
returns a nice pdf.

@dimejo
Copy link
Contributor

dimejo commented Jan 29, 2020

Awesome! I thought it would be much more complex to enable profiling. Will try it out and report back.

@MichaelEischer
Copy link
Member

@aawsome Thanks for your work, that's really a lot of changes.

I did some test runs with the check command and something strange happens. Using the restic master the command completes in 6 minutes (327.25 real 613.08 user 208.48 sys), with the index changes it takes more than an hour (4516.77 real 896.17 user 531.71 sys). The time is spent somewhere after printing "check snapshots, trees and blobs", but didn't have time to take a closer look yet.

In my opinion your commit with the index changes is far too large to be reviewable, thousand lines of added code along with two thousand deletions is really a lot.
I've noticed at least four individual changes:

  • Index.Lookup no longer returns a list, but just a blob instead
  • A lot of index handling code is extracted from the Repository
  • The masterIndex only keeps a single unfinished index
  • The actual low memory index

I have some (incomplete) comments on the code itself but I didn't think too much yet about the overall code architecture:

struct IndexLowMem2.byFile: It took me quite some time to understand the variable name. Please change it to byPackFile or similar to be more descriptive.

IndexLowMem2/FindPack: My understanding of restic.IDs.Find is that it expects a sorted list of IDs. However, idx.packTable does not seem to be sorted as this would break the packIndex in blobWithoutType.

Race condition in index_new/FindBlob: Index might be resorted between the call to SortByBlob and reacquiring the read lock. Currently I don't see a nice way how this could be fixed when using a RWMutex. The same problem also exists in FindBlobByPack.
IndexLowMem2/Lookup: The call to FindBlob and accessing the blobTable is not atomic.

IndexLowMem2/AddJSONIndex: Calls to append for a slice allocate a new buffer when the old one is too small and then copy all data over to the new buffer. As the method first collects the new packTable/indexTable entries and then appends them in one go, this shouldn't be too inefficient.
The packTable slices in idx.byFile therefore will point to old versions of the packTable array when append allocates a new buffer. This will probably waste quite a lot of memory for large packCounts. You could just add an endPackIndex to FileInfo in addition to startPackIndex and then directly access the packTable.
Appending to it.blobTable looks a bit expensive as this requires sorting later on. It should be possible to presort the new table entries and then just merge the two sorted lists (however, merging inplace might be a bit tricky).

MasterIndex.Lookup/Has/Count/... : After the lookup in mi.mainIndex, mi.idx might be saved and added to the mainIndex before the idxMutex is acquired.

MasterIndex.Save: This seems to block the masterIndex while the index upload is pending. The old implementation in Repository.SaveIndex seems to work without blocking.

@MichaelEischer
Copy link
Member

Ah, I've found the problem: Right now the call to PrepareCache is commented out. PrepareCache also configures the PerformReadahead function of the Cache. Without this Check has to load every single tree data blob separately from the backend.

@aawsome
Copy link
Contributor Author

aawsome commented Jan 31, 2020

@MichaelEischer: Thank you very much for giving your very valuable comments to this early stage of implementation 👍

I do know that it is quite a lot of code change - however I think it is also a simplification of the current index implementation. Hence the large number of deleted files/code lines.

Sorry about the PrepareCache issue - I recall I commented it out to handle it later but so far I only did local backend tests and therefore the issue didn't come up..

About your correct comments of the byPack stuff: You are right, this is WIP and not used yet. I left it because I intended to use it later for rebuild-indexand prune. However, I will fist focus on getting the the main use case to work - maybe I'll remove this part before making a PR.

About the sorting: Right now the code is supposed to do the sorting only when needed, i.e. within the first call Lookup or Has. This means when loading many index files into the in-memory index (which is the usual operation at beginning) no sorting takes place. It only happens when all index files are already loaded. During backup - when more index files are written - entries are added to the in-memory index and then each time a resort is needed. This however is a resort of a mostly sorted list. I don't know how sort.Sort handles almost sorted list but usually sort algorithms are used that are very efficient for almost sorted lists.

I'll take care about the other issues - thanks especially for finding the race conditions!

@rawtaz
Copy link
Contributor

rawtaz commented Jan 31, 2020

In my opinion your commit with the index changes is far too large to be reviewable, thousand lines of added code along with two thousand deletions is really a lot.

This.

I do know that it is quite a lot of code change - however I think it is also a simplification of the current index implementation. Hence the large number of deleted files/code lines.

This is when you should split your changes into separate individual PRs, so it's manageable and with clear intent.

@aawsome
Copy link
Contributor Author

aawsome commented Jan 31, 2020

@rawtaz:
The clear intend is a redesign of the index used in restic to get rid of the problems with the current implementation.

IMO the change is not that big considered that it is a complete re-implementation of the index (and please note: also re-implementing some legacy stuff, see index/index_old_format.go).

However this is WIP and has some issues I'm willing to fix before thinking about how to best integrate this work into the master branch.
I'm looking forward to getting ideas how to split it into smaller parts - can you please give me directions what changes could be accepted as PRs?

@aawsome
Copy link
Contributor Author

aawsome commented Jan 31, 2020

I did some test runs with the check command and something strange happens. Using the restic master the command completes in 6 minutes (327.25 real 613.08 user 208.48 sys), with the index changes it takes more than an hour (4516.77 real 896.17 user 531.71 sys). The time is spent somewhere after printing "check snapshots, trees and blobs", but didn't have time to take a closer look yet.

This should be fixed now.

struct IndexLowMem2.byFile: It took me quite some time to understand the variable name. Please change it to byPackFile or similar to be more descriptive.

Is changed now.

IndexLowMem2/FindPack: My understanding of restic.IDs.Find is that it expects a sorted list of IDs. However, idx.packTable does not seem to be sorted as this would break the packIndex in blobWithoutType.

The whole byPack logic was not used and you correctly pointed out that there were also some errors within. I cleaned this dead code.

Race condition in index_new/FindBlob: Index might be resorted between the call to SortByBlob and reacquiring the read lock. Currently I don't see a nice way how this could be fixed when using a RWMutex. The same problem also exists in FindBlobByPack.

I hope I found a nice way to handle this. I basically test sortedByBlob twice if it is set to false. This issue hence should be fixed now ..

IndexLowMem2/Lookup: The call to FindBlob and accessing the blobTable is not atomic.

This is now done using the new atomic indexTable.Get.

The packTable slices in idx.byFile therefore will point to old versions of the packTable array when append allocates a new buffer. This will probably waste quite a lot of memory for large packCounts. You could just add an endPackIndex to FileInfo in addition to startPackIndex and then directly access the packTable.

Is fixed now. packTable was so far not used...

MasterIndex.Lookup/Has/Count/... : After the lookup in mi.mainIndex, mi.idx might be saved and added to the mainIndex before the idxMutex is acquired.

This is fixed now.

MasterIndex.Save: This seems to block the masterIndex while the index upload is pending. The old implementation in Repository.SaveIndex seems to work without blocking.

This issue is still open. I did not find a nice way to handle it with just one unfinished index.
I also do not know the impact of this with respect to performance. Should only affect backup. I hope to see some tests if this is relevant..

@MichaelEischer
Copy link
Member

I did some test runs with the check command and something strange happens. Using the restic master the command completes in 6 minutes (327.25 real 613.08 user 208.48 sys), with the index changes it takes more than an hour (4516.77 real 896.17 user 531.71 sys). The time is spent somewhere after printing "check snapshots, trees and blobs", but didn't have time to take a closer look yet.

This should be fixed now.

I'll give it a try in the next days.

Race condition in index_new/FindBlob: Index might be resorted between the call to SortByBlob and reacquiring the read lock. Currently I don't see a nice way how this could be fixed when using a RWMutex. The same problem also exists in FindBlobByPack.

I hope I found a nice way to handle this. I basically test sortedByBlob twice if it is set to false. This issue hence should be fixed now ..

This still leaves the possibility of a collision between SortByBlob and AddJSONIndex. The check that the index is sorted properly and the actual index search must happen in the same critical section. Unluckily the RWMutex does not support an atomic downgrade from a write to a read lock, which would be the easiest way to keep the lock after sorting. I currently see two possibilities:

  • Take the readlock, check that the index is sorted and run the search. If the index is not sorted, release the readlock and get the writelock, sort the index and complete just that search while holding the writelock.
  • Take the readlock, check that the index is sorted and run the search. If the index is not sorted, release the readlock and get the writelock, sort the index and retry from the start.

MasterIndex.Lookup/Has/Count/... : After the lookup in mi.mainIndex, mi.idx might be saved and added to the mainIndex before the idxMutex is acquired.

It seems like you've missed MasterIndex.Lookup.

MasterIndex.Save: This seems to block the masterIndex while the index upload is pending. The old implementation in Repository.SaveIndex seems to work without blocking.

This issue is still open. I did not find a nice way to handle it with just one unfinished index.
I also do not know the impact of this with respect to performance. Should only affect backup. I hope to see some tests if this is relevant..

Hmm, the main blocker here is that you can't get the index pack id before the upload call completed. It would be rather easy to solve by allowing multiple indexes to be unfinished while only the last one may be written to and the predecessors are currently uploading.

@aawsome
Copy link
Contributor Author

aawsome commented Feb 3, 2020

This still leaves the possibility of a collision between SortByBlob and AddJSONIndex. The check that the index is sorted properly and the actual index search must happen in the same critical section. Unluckily the RWMutex does not support an atomic downgrade from a write to a read lock, which would be the easiest way to keep the lock after sorting. I currently see two possibilities:

* Take the readlock, check that the index is sorted and run the search. If the index is not sorted, release the readlock and get the writelock, sort the index and complete just that search while holding the writelock.

* Take the readlock, check that the index is sorted and run the search. If the index is not sorted, release the readlock and get the writelock, sort the index and retry from the start.

Oops - I was in fact so focused on SortByBlobs that I didn't consider the methods it is called :-(
Thanks for your suggestions! I used the second option now.

It seems like you've missed MasterIndex.Lookup.

Is fixed now.

Hmm, the main blocker here is that you can't get the index pack id before the upload call completed. It would be rather easy to solve by allowing multiple indexes to be unfinished while only the last one may be written to and the predecessors are currently uploading.

I also fixed this. I now encode/decrypt the index and write it to the main index while locked but write the file to the backend after releasing the lock.

Moreover I found another optimization that reduces the memory consumption quite a lot (and also fixes the append-problem): I exchanged the large array by a structure that implements a "paged" list of blobs. Here a list of "pages" is maintained and if it needs to be enlarged, just a new page is allocated. This allows to append without needing to recopy all old elements. Also the overhead is fixed and maximum the page size whereas with append you get quite a large overhead (the array gets an increasing "capacity reserve" to ensure it doesn't need to recopy too often). This optimiziation seems to save another ~15-20% of memory.

I'm happy to get feedback and even more like to see performance tests with large repositories,,

@aawsome
Copy link
Contributor Author

aawsome commented Feb 4, 2020

Here are the results of my first comparison.
I used a data dir with 550.000 files (results in 550.000 data blobs) using the script

#!/bin/sh
mkdir data
for i in `seq 1 550`; do
        echo $i
        mkdir data/$i
        for j in `seq 1 1000`; do
                echo $i$j > data/$i/$j
        done
done

Then I backuped this data dir three times and run check after:

/usr/bin/time/restic -r /path/to/repo --mem-profile . backup /path/to/data
/usr/bin/time/restic -r /path/to/repo --mem-profile . backup /path/to/data
/usr/bin/time/restic -r /path/to/repo --mem-profile . backup /path/to/data -f
/usr/bin/time/restic -r /path/to/repo --mem-profile . check

These steps I run once with the master branch (compiled to restic.old) and my branch (compiled to restic.new).
I just used my old laptop with local SSD.

Here are the results:
time_backup1_old.txt
mem_backup1_old.pdf
time_backup1_new.txt
mem_backup1_new.pdf

time_backup2_old.txt
mem_backup2_old.pdf
time_backup2_new.txt
mem_backup2_new.pdf

time_backup3_old.txt
mem_backup2_old.pdf
time_backup3_new.txt
mem_backup3_new.pdf

time_check_old.txt
mem_check_old.pdf
time_check_new.txt
mem_check_new.pdf

To summarize:

  • The new index implementation uses only 22% of the memory used by the index implementation im master (!!!)
  • whith check index related memory usage is even decreased more (the IDSet was removed)
  • With the new implementation the index memory usage is no longer the one outstanding memory consumer making optimizations of other parts worthwhile.
  • Speed and CPU usage was comparable. The only case where the new implementation was a bit slower was the third backup run which highly uses index lookups. However, we are talking about a newly generated repository with only few index files. Hence the main drawback of the master index implementation (looping over all index files) didn't play a big role here. I believe that things completely change with many index files and that the new implementation then will outperform the current one clearly.

@dimejo
Copy link
Contributor

dimejo commented Feb 4, 2020

I'm finally done with my tests. It took longer than estimated because I had to re-run my tests because of some mysterious results. I wanted to compare results between 1 and multiple cores. Interestingly the VM with only 1 vCPU was a lot faster with almost all operations. Maybe @fd0 knows what's going on.

results for VM1

1 vCPU
2GB RAM

Note that I've made 1 CPU and 1 memory profiling run for each restic version. Time shown (in seconds) in this table is the average for both runs.

restic v.0.9.6 restic new-index difference
init repo 2,24 2,27 1,12%
1st backup 1574,36 1542,80 -2,00%
2nd backup 556,95 541,71 -2,74%
3rd backup (--force) 1192,90 1195,53 0,22%
forget 0,51 0,52 2,97%
prune 43,87 44,02 0,34%
check 30,77 31,59 2,68%
list blobs 2,92 3,36 14,90%
cat blob 2,58 2,60 0,97%
find blob 22,86 21,17 -7,39%
restore 895,01 883,57 -1,28%

results for VM8

8 vCPU
32GB RAM

Note that I've made 1 CPU and 1 memory profiling run for each restic version. Time shown (in seconds) in this table is the average for both runs.

restic v.0.9.6 restic new-index difference
init repo 2,11 2,09 -0,95%
1st backup 1894,47 1832,65 -3,26%
2nd backup 827,95 776,38 -6,23%
3rd backup (--force) 1414,60 1411,98 -0,19%
forget 0,60 0,56 -6,72%
prune 93,10 89,84 -3,50%
check 70,22 68,44 -2,53%
list blobs 3,86 7,24 87,68%
cat blob 3,88 3,69 -4,90%
find blob 30,95 29,11 -5,95%
restore 1150,49 1089,66 -5,29%

Additional information

$ uname -r
5.4.15-200.fc31.x86_64
$restic-orig version
debug enabled
restic 0.9.6 (v0.9.6-40-gd70a4a93) compiled with go1.13.6 on linux/amd64
$ restic-newindex version
debug enabled
restic 0.9.6 (v0.9.6-43-g5db7c80f) compiled with go1.13.6 on linux/amd64

files changed added for backup 1:

Files:       487211 new,     0 changed,     0 unmodified
Dirs:            2 new,     0 changed,     0 unmodified
Added to the repo: 62.069 GiB

files changed added for backup 2:

Files:       166940 new,     0 changed, 487211 unmodified
Dirs:            0 new,     2 changed,     0 unmodified
Added to the repo: 6.805 GiB

files changed added for backup 3:

Files:       654215 new,     0 changed,     0 unmodified
Dirs:            2 new,     0 changed,     0 unmodified
Added to the repo: 4.029 MiB

Logs

logs_vm1_cpu.zip
logs_vm1_mem.zip
logs_vm8_cpu.zip
logs_vm8_mem.zip

@aawsome
Copy link
Contributor Author

aawsome commented Feb 4, 2020

@dimejo Thank you very much for your tests!

First regarding your question about increased time for the 8 CPU setting: You are comparing the "user time" that is the total CPU seconds used. With parallel processing this is (and should be higher) than using a single CPU as there is always some overhead due to parallelization. In fact your 8 CPU runs were much faster, see the elapsed time.

You used the commit 5db7c80f for your newindex tests. Is it possible that you retest with the latest commit 26ec33dd? There should be another decrease in memory usage and (if I did everything right) maybe even better performance.

In general I'm quite satisfied with these results! Seem my implementation is using much less memory while still having similar performance than the original index implementation. (BTW: I'll fix the obviously open issue with list blobs)

@aawsome
Copy link
Contributor Author

aawsome commented Feb 14, 2020

@seqizz Thanks for sending the bug report. I have to admit that I only tested the last commit with go run build.go -T which didn't show any issue... I fixed it in commit 5b4009f

@rawtaz
Copy link
Contributor

rawtaz commented Feb 17, 2020

@aawsome Thanks for writing that summary comment. One of the things I as a non-core developer (I'm just a maintainer or simple things) experience is that during the time that core developer time is spent on other parts of the project, issues like this one just keeps growing. Then whenever a core developer looks at it, it's a lot or in some cases too much to dig into at the time. The less there is to read for them, the higher are the chances that we can make good use of their time. I took the liberty of editing the initial post in this issue to add a reference to your summary comment.

Regarding your questions, I personally cannot answer them. I will have to defer them a while, and I kindly ask for your patience. Our main focus right now with the available time we have is some other stuff, but I can tell you that this issue and your PR is on my radar and I am trying to get eyes on it eventually.

I think that with your summary comment, besides fixing bugs (I'm slightly concerned that there might be some more small bugs lurking) and whatever you can find that doesn't deviate from what you wrote in your comment, the most productive thing is to let this rest a bit. The next step would be to get core developer eyes on it, and until that happens it's probably more counter-productive to grow the discussion more. Let me know if my rationale with this doesn't seem to make sense, so I can clarify what I mean.

@MichaelEischer
Copy link
Member

I finally found time to test the performance of the backup command on a 8TB, 40M files repository. To keep the overall runtime low I've just backed up a 6GB directory. CPU usage of restic was limited using GOMAXPROCS=1.

current master: cpu-idx-master.pdf
this pr: cpu-idx-table.pdf

With this PR restic uses around 6.5GB memory instead of 18GB for the current master, which is a gigantic improvement. The index takes around 9 minutes to load in both cases (the host does not support AES-NI so decryption is a bit slow). The backup takes 34mins for this PR and 18mins for the current master. Taking a look at the cpu profiling graphs shows the difference: The indexTable spends 16 minutes just for sorting the in-memory index. The index is large enough that a single sort call for the in-memory index takes about 75 seconds (!) to complete. During this time the whole backup processing is blocked as the index is locked exclusively.

The in-memory index probably needs a way to merge a new index without resorting everything, maybe even implemented incrementally. This should be possible by sorting the new index beforehand and then merge it into the MasterIndex with a single scan over the indexTable.

@aawsome
Copy link
Contributor Author

aawsome commented Mar 7, 2020

@MichaelEischer Thank you very much for doing your test! In fact, the current implementation needs a resort of the index table each time an index file is written. So if I calculated correctly, 13 new index files have been created, right?
The good message is that the Lookups in IndexJSON (which are unsorted and thus require linear search) do not show up at all, so this seems to be a good way to handle "unfinished" index files

Actually my judgement is that resorting an almost sorted table with sort.Sort should be almost optimal. (However, trying out sort.Stable or a pure mergesort should still be worth a try...) I do think that the main issue is: If we want to have one sorted index array and insert things into it we need to move ~6.5GB of data in the memory around.
So I propose to separate the index entries read from index files from those newly generated (and already saved). This makes three data structures in the MasterIndex implementation. I already changed this in my refactor-index branch. Would you mind re-testing your setup?

@MichaelEischer
Copy link
Member

MichaelEischer commented Mar 8, 2020

When I zoom into the profile data IndexJSON takes 1.22 seconds in total. So this seems fine for now. The worst case would be a lot of small files in case a map might be faster.

The backup run actually created 16 new index files. The 75 seconds were a guess from a check run which only triggers re-sorting once. Both quicksort (used by sort.Sort) and mergesort have a best-case time complexity of O(n log n), thus presorted data does not improve performance. As a side-note the in-memory index requires less space than the on-disk representation ^^ .

I've rerun the test with the updated code and the performance is a lot better: cpu-idx-table-split.pdf

However, using a separate index for new index entries merely hides but does not solve the underlying problem: Just imagine a backup run that adds a few million new files or a large initial backup. The current implementation has O(n * n*log n) complexity. The first n relates to the number of appends to the index which scales somewhat linear with the amount of new backup data. And the latter part of n*log n is the already mentioned complexity for sorting the index. By replacing quicksort with a linear merge this is reduced to O(n * n) which could be good enough for index sizes below several dozen gigabytes. Beyond that it won't scale, but I guess Petabyte repositories are a problem of their own. The linear merge could even be implemented to work incrementally. I have an idea how to reduce the complexity down to O(n * log n) but that would increase the lookup costs to O(log^2 n) and require some extensive code changes.

I wonder whether there is a data structure that would perfectly fit our needs:

  • it should have low memory overhead
  • amortized sub-linear query speed, preferably log n or less
  • amortized sub-linear insertion costs, preferably log n or less. With n insertions this would then lead to costs of O(n * log n) for a whole backup run

When dropping the requirement for a really compact in-memory representation, hashmaps are a perfect fit, as their queries and updates happen inO(1). However, they require a lot more memory: My current prototype uses about 70% more memory than this PR.

One solution would be to combine both approaches and collect new entries in a hashmap and merge them into the sorted array from time to time (e.g. when the hashmap's size reaches 10% of the sorted array).

That said it is probably best to wait for feedback from @fd0 before making any extensive changes. (Sorry for growing the discussing again.)

@dumblob
Copy link

dumblob commented Apr 16, 2020

I wonder whether there is a data structure that would perfectly fit our needs:

I'd say there are "many" data structures more or less satisfying these needs, but don't know how much effort it would be to switch to them in restic. Basically any B-tree or generally tree structures should satisfy these needs (hash maps are really inefficient for such scenarios as you also found out yourself in your measurements).

@MichaelEischer
Copy link
Member

If I'm not mistaken a B-Tree only guarantees that it's nodes are at least half filled. And it would require pointers which would more or less add a 8 byte pointer for every 48bytes of data. If a B-Tree guarantees 70% space usage on average then it would end up with a similar memory overhead as my hashmap based implementation. There might be some variants like B*-Trees which are more efficient though.

@dumblob
Copy link

dumblob commented Apr 17, 2020

@michaeldorner yeah, that's the "old traditional B-tree" - as you wrote, there are many (I've seen at least ten quite different ones so far) B-tree like structures each having different properties. Other well known tree-like structures include e.g. red-black trees (popular for maps which must guarantee consistent insertion/deletion time & space; this is in contrast with hash-based maps which might suffer from "jumps" or other excessive peaks both in time & space eventually leading to severe security/performance problems). There are also various "compressed" maps - feel free to search for them. So there is really a number of options.

Though I'm not against using a small temporary file like discussed in this topic and like e.g. sqlite does - btw. sqlite is highly efficient and I've used it as storage just for the purpose of extremely low memory usage while offering quite high throughput despite using hard disks as temporary cache - feel free to try it out, it might surprise you.

Just my 2 cents 😉.

@MichaelEischer
Copy link
Member

The sorted, densely packed array of index entries is basically the gold-standard for memory usage. Yes, you could apply some compression, but that would only help with the blob size/offset and pack id idx (approx. 16bytes) the remaining 32bytes are the blob id (sha256 hashes) which I'd assume to be basically incompressible. I'm not sure whether the potential for another 20-30% lower memory usage would justify the even higher complexity.

Trees always require one (and some unused space like with B-Trees) or two pointers (red-black-tree and others) per elements which would already add 8 or 16 bytes overhead to the approx. 48 bytes of the index entry itself. I'm also not sure how well the Go garbage collector would handle millions or even billions of index entries.

To save the honor of hash tables: cuckoo hashing (constant lookup time) together with 4 entries per bucket is said to achieve over 90% space utilization. If you just use ten of these hash tables then the total memory overhead would also stay below 33% even while growing a hashmap.

I think the main question is more whether we want a single data structure (tree/hashmap) with a certain memory overhead. Or whether we want the lower overhead of a sorted packed array, which will need an auxiliary data structure to provide a reasonable insertion speed. If that auxiliary data structure just contains 10-20% of the overall index data, then it won't matter much whether a data structure has 30% or 70% overhead and in that case Go's builtin hashmap would be the simplest solution.

Using an on-disk index has it's own share of problems: If the index is too large to be kept in memory, then the index performance will probably implode on HDDs (SSDs should be much better off) unless the index is able to benefit from locality between blobs for e.g. one folder/file (in a fully sorted list of blob ids, the access pattern would basically be uniformly at random).

@dumblob
Copy link

dumblob commented Apr 23, 2020

@MichaelEischer thanks for the wrap up. I'd still though encourage you to try disk cache (just to get a glimpse of possible performance, not to be incorporated in Restic) using sqlite3 with different setups (e.g. both with and without index, with increased max memory constraints than the very low defaults, using just one table as key-value map-like storage, etc.) while leveraging some sqlite3 pecularities like builtin integer primary key and multithreaded access without explicit locking (sqlite handles it itself) and using one huge transaction for all accesses and not using WAL as it's significantly slower for reads (but slightly faster for writes).

Using disc cache might also quite significantly benefit from stream-oriented processing, but don't know whether it's a good fit for a quick tryout. Just an idea based on my good experience with sqlite3-based key-value storage 😉.

@MichaelEischer
Copy link
Member

@aawsome Could you split your "list blobs" optimization into a separate PR?

I don't plan to merge the blob IDSet removal from checker.go as #2328 also provides that optimization but without having to involve the master index.

That leaves LookupAll as the last minor change. That function would be useful when adding support for damaged repositories to prune. However, without some index performance optimizations first, it will probably just reduce the overall performance of restic. So this will have to wait.

Regarding the major changes, these also have to stay on hold for now.

@dumblob The problem with using sqlite or any other generic database for an on-disk index is that without optimizations that make use of locality between blobs in a file (i.e. blobs within a file end up in nearby packs and are usually accessed together), we'll end up with a pseudo-random access to all parts of the index which is larger than the main memory and therefore has to load data from random locations on the disk. That is a generic database implementation has no chance at all to provide a good performance.

@aawsome
Copy link
Contributor Author

aawsome commented Jun 14, 2020

@aawsome Could you split your "list blobs" optimization into a separate PR?

Sure. PR will come soon.

@aawsome
Copy link
Contributor Author

aawsome commented Jun 14, 2020

That leaves LookupAll as the last minor change. That function would be useful when adding support for damaged repositories to prune. However, without some index performance optimizations first, it will probably just reduce the overall performance of restic. So this will have to wait.

In fact, the actual implementation of Lookup is a (partly) LookupAll in the sense that it returns all results within an index file, but only the results of the first index file that has a match.

As the usual case is that there are no duplicates within one index files, this actually does not give duplicates even in cases where they might be needed.

I will prepare a PR to change Lookup and add LookupAll where it is useful. Then we can discuss if this is helpful or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants