Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose hash over ChunkListEntry to borg list --format {chunk_ids_$HASH} #5167

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

hrehfeld
Copy link
Contributor

  • First time messing with the code -- No test yet, can somebody help writing one?
  • Existing tests and flake give no warnings
  • Not sure about the format key naming, suggestions?

Rationale: because checksums are so slow, and borg diff is much faster, I tried to expose to borg list a hash over the data that borg diff seems to use to compare items

@hrehfeld hrehfeld changed the title Expose hash over ChunkListEntry data to borg list --format {chunk_ids_$HASH} Expose hash over ChunkListEntry data to borg list --format {chunk_ids_$HASH} Apr 29, 2020
@hrehfeld hrehfeld changed the title Expose hash over ChunkListEntry data to borg list --format {chunk_ids_$HASH} Expose hash over ChunkListEntry to borg list --format {chunk_ids_$HASH} Apr 29, 2020
@codecov-io
Copy link

codecov-io commented Apr 29, 2020

Codecov Report

Merging #5167 into master will decrease coverage by 0.11%.
The diff coverage is 44.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5167      +/-   ##
==========================================
- Coverage   83.98%   83.86%   -0.12%     
==========================================
  Files          37       37              
  Lines        9839     9854      +15     
  Branches     1638     1640       +2     
==========================================
+ Hits         8263     8264       +1     
- Misses       1097     1110      +13     
- Partials      479      480       +1     
Impacted Files Coverage Δ
src/borg/helpers/parseformat.py 87.66% <44.44%> (-1.36%) ⬇️
src/borg/archive.py 82.62% <0.00%> (-0.62%) ⬇️
src/borg/repository.py 84.29% <0.00%> (-0.19%) ⬇️
src/borg/archiver.py 81.39% <0.00%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08a12cc...d7b3620. Read the comment docs.

Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!

yeah, I see the use case. but i am somehow concerned that a lot of people might use it incorrectly, draw wrong conclusions and file support tickets if outcome is not as they expected...

the problem is that chunk lists (or hashes of chunk lists) can only be compared if can_compare_chunk_ids is True (see do_diff). as we do not know how people use the chunk list hashes that borg list would display, we can't be sure they are doing valid comparisons.

otoh, chunker params usually do not change that often and also false positives are unlikely (different files showing same hash). but false negatives could happen (same files showing different hash due to different chunker params - or even different hash algo as you let people choose it).

src/borg/helpers/parseformat.py Outdated Show resolved Hide resolved
@hrehfeld
Copy link
Contributor Author

yeah, I see the use case. but i am somehow concerned that a lot of people might use it incorrectly, draw wrong conclusions and file support tickets if outcome is not as they expected...

Good point. I will try to add a hashsum for the chunker parameters itself as well, and add a note in the format key documentation that it's only valid to compare if those are equal.

otoh, chunker params usually do not change that often and also false positives are unlikely (different files showing same hash).

Is there anything here that could cause false positives besides the usual hash algo collisions? Are those more likely here?

but false negatives could happen (same files showing different hash due to different chunker params - or even different hash algo as you let people choose it).

Hmm, why can the choice of hash algo influence false negatives?

One more question: this is doing the correct thing, isn't it? If chunker parameters are equal, equal checksums of the chunk ids would mean content of item/file is identical?

Can you maybe write a test for this, or point me to a test that I can copy&modify that tests similar functionality?

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Apr 30, 2020

  1. I don't think a hash over the chunker params is useful. These are only a few values and looking at them "as is" is simpler than at a hash of them.

  2. I don't think so.

  3. If you first create a list with hash algo x and later accidentially create a list with hash algo y, the hashes for the same file will be different. If the hash bit length is different, then the reason might be pretty obvious, but if only the algorithm is different and length is same, it is not that obvious.

Yes, i think it is correct. The borg diff code has identical = (item1.chunks == item2.chunks) if can_compare_chunk_ids is True. And comparing hashes of lists is about as good as comparing the complete list, except for the minimal risk of a hash collision.

@@ -688,7 +688,8 @@ class ItemFormatter(BaseFormatter):
('size', 'csize', 'dsize', 'dcsize', 'num_chunks', 'unique_chunks'),
('mtime', 'ctime', 'atime', 'isomtime', 'isoctime', 'isoatime'),
tuple(sorted(hash_algorithms)),
tuple(('chunk_ids_%s' % alg for alg in sorted(hash_algorithms))),
tuple(['chunk_ids_%s' % alg for alg in sorted(hash_algorithms)] + [
'chunker_params_%s' % alg for alg in sorted(hash_algorithms)]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the chunker params are same within 1 archive, so guess you do not want to show them for each file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, true, but this would make parsing the output of list harder. As it is now, you can just compare (item['chunk_ids_checksum'], item['chunker_params']) after parsing the output for each item.

src/borg/helpers/parseformat.py Outdated Show resolved Hide resolved
def hash_item(self, hash_function, item):
if 'chunks' not in item:
return ""
def prepare_hash_function(self, hash_function):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not "prepare" the hash function in any way, it just picks one by its name, which is what one passes into it, rather than a function, thus naming this something like get_hash_function(self, name) would be less misleading and more self-explanatory.

@elho
Copy link
Contributor

elho commented Jul 6, 2020

but false negatives could happen (same files showing different hash due to different chunker params

While ideally, a hash that could be quickly generated without having to look at the file content that still guarantees the content to be identical (collisions aside) would of course be what one would prefer, in practice it is not possible.

Possible are the existing hashes/checksums that are slow due to looking at the file content.

Also possible is this shortcut of quiclky getting the list of chunks, that, once established (by other means) that the actual file content has been put into these chunks and that (by borg check) that the content of the chunks is unchanged, allows to deduce that the file still has the same contents as it still is being composed of the same chunks in the same order.

If however one does rechunk a repo, each chunk is read, the data is processed (encryption, compression) and written to a new, different chunk, then without another separate file content verifycation, there is no guarantee that no bit error has happened in that process, so to me it is the desired behaviour for a hash based on the list of chunks to be different, because it is different chunks.
One can quiclky gerenate new lists of hashes of the new chunks and after (still way slower) having somehow verified all new chunk contents benig correct as well, the new hashes can be used like the old ones.

@elho
Copy link
Contributor

elho commented Jul 6, 2020

1. I don't think a hash over the chunker params is useful. These are only a few values and looking at them "as is" is simpler than at a hash of them.

Yes, my first thought as well, plus for people that do change them at some point, seeing which ones (ie. the current or older ones) where used for that file of list output is a benefit as well.

And comparing hashes of lists is about as good as comparing the complete list, except for the minimal risk of a hash collision.

Minus the potential extra "collision", if two chunks of different data due to different chunker params end up with the same hash (which for a sigle chunk file would be enough). That could be prevented by doing the hash over the chunker params and the chunk IDs, calling it something like {chunks_fingerprint_*} to reflect it being something abstract that characterizes the chunks the file consists of.

3. If you first create a list with hash algo x and later accidentially create a list with hash algo y, the hashes for the same file will be different. If the hash bit length is different, then the reason might be pretty obvious, but if only the algorithm is different and length is same, it is not that obvious.

The exact same could be said of the various existing hashes of file contents, yet noone (I hope 😉) would pose this as a valid argument to remove these/substitute them by a single one, thus I do not see this as a valid argument here.
With a fixed one, should that at some point be deemed not good enough anymore, or worse, outright broken, there needs to be some transition and people need to wait for a newer borg with the new hash to trickle down to the next stable release of their distro of choice, etc.
With the ability to choose any of the supported hash algorithms, peolpe can generate lists using one, or multiple hashes, either now (to be prepared for one of them to become "bad") or (right) at the point the information of the one they use being borken going public. If more concerned about data corruption by bad hardware rather than advanced attackers, they can also still use the old hash to verify integrity against old hash listst and subsequently redo them with new hashes or whatever fits their use case.

@elho
Copy link
Contributor

elho commented Aug 25, 2020

I meanwhile have the functionality I was after, ie. list of actual chunks, not a hash over that, implemented in shell using borg debug and jq:

borg debug dump-archive REPO::ARCHIVE - \
        | jq '._items[] | if .chunks != null then "\(.path)  \(.chunks | map(.[0] | ltrimstr("\u007f")) | join(","))" else "\(.path)  -" end' \
        | sed --expression='s/^"//;s/"$//'

Not sure yet, why borg debug inserts that unicode DELETE char in front of each chunk ID, but above command produces the same output as my own (hackish) patch adding a chunkids format key to list did. 😃

@ThomasWaldmann
Copy link
Member

https://github.com/borgbackup/borg/blob/1.1-maint/src/borg/helpers.py#L1240 comment about special character.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 12, 2023

Due to @solrize asking about it on IRC, I have thought about such a "full file hash" (computed over the chunkids list) again.

Guess, as it is now, the code here can only be used to compare files in the same repo, IF chunker_params are the same.

With borg2, we'll get the concept of "related repositories" that share the same chunker secret and the same id-hash secret, thus (if chunker params are the same also), files will get chunked exactly the same way and the chunk ids will also be computed the same way. So we could quickly determine identical files in different related repos if these conditions are met.

conditions_fingerprint = H(chunker_secret, chunker_params, id_hash_secret, id_hash_algo)

chunks_fingerprint = H(chunks_id_list)

fingerprint = conditions_fingerprint + "-" + chunks_fingerprint

Note:

  • if left part (conditions fp) is already different, it means there is no point in comparing the fingerprints.
  • if left part is identical and right part (chunks fingerprint) is different, it means the file content is different
  • if left part is identical and right part (chunks fingerprint) is identical, it means the file content is identical

H could be one specific cryptographic hash. Considering sha256 is hw accelerated in modern CPUs and data isn't that much anyway, we could just take that.

Also: we could store the fingerprint into item metadata at borg create time, if we like. Maybe not worth it though as it should be rather quick to compute at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants