Expose hash over ChunkListEntry to `borg list --format {chunk_ids_$HASH}` #5167

hrehfeld · 2020-04-29T23:00:44Z

First time messing with the code -- No test yet, can somebody help writing one?
Existing tests and flake give no warnings
Not sure about the format key naming, suggestions?

Rationale: because checksums are so slow, and borg diff is much faster, I tried to expose to borg list a hash over the data that borg diff seems to use to compare items

codecov-io · 2020-04-29T23:44:12Z

Codecov Report

Merging #5167 into master will decrease coverage by 0.11%.
The diff coverage is 44.44%.

@@            Coverage Diff             @@
##           master    #5167      +/-   ##
==========================================
- Coverage   83.98%   83.86%   -0.12%     
==========================================
  Files          37       37              
  Lines        9839     9854      +15     
  Branches     1638     1640       +2     
==========================================
+ Hits         8263     8264       +1     
- Misses       1097     1110      +13     
- Partials      479      480       +1

Impacted Files	Coverage Δ
src/borg/helpers/parseformat.py	`87.66% <44.44%> (-1.36%)`	⬇️
src/borg/archive.py	`82.62% <0.00%> (-0.62%)`	⬇️
src/borg/repository.py	`84.29% <0.00%> (-0.19%)`	⬇️
src/borg/archiver.py	`81.39% <0.00%> (+0.35%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08a12cc...d7b3620. Read the comment docs.

ThomasWaldmann

thanks for the PR!

yeah, I see the use case. but i am somehow concerned that a lot of people might use it incorrectly, draw wrong conclusions and file support tickets if outcome is not as they expected...

the problem is that chunk lists (or hashes of chunk lists) can only be compared if can_compare_chunk_ids is True (see do_diff). as we do not know how people use the chunk list hashes that borg list would display, we can't be sure they are doing valid comparisons.

otoh, chunker params usually do not change that often and also false positives are unlikely (different files showing same hash). but false negatives could happen (same files showing different hash due to different chunker params - or even different hash algo as you let people choose it).

src/borg/helpers/parseformat.py

hrehfeld · 2020-04-30T13:27:26Z

yeah, I see the use case. but i am somehow concerned that a lot of people might use it incorrectly, draw wrong conclusions and file support tickets if outcome is not as they expected...

Good point. I will try to add a hashsum for the chunker parameters itself as well, and add a note in the format key documentation that it's only valid to compare if those are equal.

otoh, chunker params usually do not change that often and also false positives are unlikely (different files showing same hash).

Is there anything here that could cause false positives besides the usual hash algo collisions? Are those more likely here?

but false negatives could happen (same files showing different hash due to different chunker params - or even different hash algo as you let people choose it).

Hmm, why can the choice of hash algo influence false negatives?

One more question: this is doing the correct thing, isn't it? If chunker parameters are equal, equal checksums of the chunk ids would mean content of item/file is identical?

Can you maybe write a test for this, or point me to a test that I can copy&modify that tests similar functionality?

ThomasWaldmann · 2020-04-30T14:45:21Z

I don't think a hash over the chunker params is useful. These are only a few values and looking at them "as is" is simpler than at a hash of them.
I don't think so.
If you first create a list with hash algo x and later accidentially create a list with hash algo y, the hashes for the same file will be different. If the hash bit length is different, then the reason might be pretty obvious, but if only the algorithm is different and length is same, it is not that obvious.

Yes, i think it is correct. The borg diff code has identical = (item1.chunks == item2.chunks) if can_compare_chunk_ids is True. And comparing hashes of lists is about as good as comparing the complete list, except for the minimal risk of a hash collision.

ThomasWaldmann · 2020-04-30T14:52:06Z

src/borg/helpers/parseformat.py

@@ -688,7 +688,8 @@ class ItemFormatter(BaseFormatter):
        ('size', 'csize', 'dsize', 'dcsize', 'num_chunks', 'unique_chunks'),
        ('mtime', 'ctime', 'atime', 'isomtime', 'isoctime', 'isoatime'),
        tuple(sorted(hash_algorithms)),
-        tuple(('chunk_ids_%s' % alg for alg in sorted(hash_algorithms))),
+        tuple(['chunk_ids_%s' % alg for alg in sorted(hash_algorithms)] + [
+            'chunker_params_%s' % alg for alg in sorted(hash_algorithms)]),


the chunker params are same within 1 archive, so guess you do not want to show them for each file?

hmm, true, but this would make parsing the output of list harder. As it is now, you can just compare (item['chunk_ids_checksum'], item['chunker_params']) after parsing the output for each item.

src/borg/helpers/parseformat.py

elho · 2020-07-06T11:52:18Z

src/borg/helpers/parseformat.py

-    def hash_item(self, hash_function, item):
-        if 'chunks' not in item:
-            return ""
+    def prepare_hash_function(self, hash_function):


This does not "prepare" the hash function in any way, it just picks one by its name, which is what one passes into it, rather than a function, thus naming this something like get_hash_function(self, name) would be less misleading and more self-explanatory.

elho · 2020-07-06T13:26:30Z

but false negatives could happen (same files showing different hash due to different chunker params

While ideally, a hash that could be quickly generated without having to look at the file content that still guarantees the content to be identical (collisions aside) would of course be what one would prefer, in practice it is not possible.

Possible are the existing hashes/checksums that are slow due to looking at the file content.

Also possible is this shortcut of quiclky getting the list of chunks, that, once established (by other means) that the actual file content has been put into these chunks and that (by borg check) that the content of the chunks is unchanged, allows to deduce that the file still has the same contents as it still is being composed of the same chunks in the same order.

If however one does rechunk a repo, each chunk is read, the data is processed (encryption, compression) and written to a new, different chunk, then without another separate file content verifycation, there is no guarantee that no bit error has happened in that process, so to me it is the desired behaviour for a hash based on the list of chunks to be different, because it is different chunks.
One can quiclky gerenate new lists of hashes of the new chunks and after (still way slower) having somehow verified all new chunk contents benig correct as well, the new hashes can be used like the old ones.

elho · 2020-07-06T14:04:22Z

1. I don't think a hash over the chunker params is useful. These are only a few values and looking at them "as is" is simpler than at a hash of them.

Yes, my first thought as well, plus for people that do change them at some point, seeing which ones (ie. the current or older ones) where used for that file of list output is a benefit as well.

And comparing hashes of lists is about as good as comparing the complete list, except for the minimal risk of a hash collision.

Minus the potential extra "collision", if two chunks of different data due to different chunker params end up with the same hash (which for a sigle chunk file would be enough). That could be prevented by doing the hash over the chunker params and the chunk IDs, calling it something like {chunks_fingerprint_*} to reflect it being something abstract that characterizes the chunks the file consists of.

3. If you first create a list with hash algo x and later accidentially create a list with hash algo y, the hashes for the same file will be different. If the hash bit length is different, then the reason might be pretty obvious, but if only the algorithm is different and length is same, it is not that obvious.

The exact same could be said of the various existing hashes of file contents, yet noone (I hope 😉) would pose this as a valid argument to remove these/substitute them by a single one, thus I do not see this as a valid argument here.
With a fixed one, should that at some point be deemed not good enough anymore, or worse, outright broken, there needs to be some transition and people need to wait for a newer borg with the new hash to trickle down to the next stable release of their distro of choice, etc.
With the ability to choose any of the supported hash algorithms, peolpe can generate lists using one, or multiple hashes, either now (to be prepared for one of them to become "bad") or (right) at the point the information of the one they use being borken going public. If more concerned about data corruption by bad hardware rather than advanced attackers, they can also still use the old hash to verify integrity against old hash listst and subsequently redo them with new hashes or whatever fits their use case.

elho · 2020-08-25T15:52:18Z

I meanwhile have the functionality I was after, ie. list of actual chunks, not a hash over that, implemented in shell using borg debug and jq:

borg debug dump-archive REPO::ARCHIVE - \
        | jq '._items[] | if .chunks != null then "\(.path)  \(.chunks | map(.[0] | ltrimstr("\u007f")) | join(","))" else "\(.path)  -" end' \
        | sed --expression='s/^"//;s/"$//'

Not sure yet, why borg debug inserts that unicode DELETE char in front of each chunk ID, but above command produces the same output as my own (hackish) patch adding a chunkids format key to list did. 😃

ThomasWaldmann · 2020-08-25T18:59:36Z

https://github.com/borgbackup/borg/blob/1.1-maint/src/borg/helpers.py#L1240 comment about special character.

ThomasWaldmann · 2023-03-12T01:04:40Z

Due to @solrize asking about it on IRC, I have thought about such a "full file hash" (computed over the chunkids list) again.

Guess, as it is now, the code here can only be used to compare files in the same repo, IF chunker_params are the same.

With borg2, we'll get the concept of "related repositories" that share the same chunker secret and the same id-hash secret, thus (if chunker params are the same also), files will get chunked exactly the same way and the chunk ids will also be computed the same way. So we could quickly determine identical files in different related repos if these conditions are met.

conditions_fingerprint = H(chunker_secret, chunker_params, id_hash_secret, id_hash_algo)

chunks_fingerprint = H(chunks_id_list)

fingerprint = conditions_fingerprint + "-" + chunks_fingerprint

Note:

if left part (conditions fp) is already different, it means there is no point in comparing the fingerprints.
if left part is identical and right part (chunks fingerprint) is different, it means the file content is different
if left part is identical and right part (chunks fingerprint) is identical, it means the file content is identical

H could be one specific cryptographic hash. Considering sha256 is hw accelerated in modern CPUs and data isn't that much anyway, we could just take that.

Also: we could store the fingerprint into item metadata at borg create time, if we like. Maybe not worth it though as it should be rather quick to compute at any time.

hrehfeld added 2 commits April 30, 2020 00:50

list: extract function for hashfunction name to hashfunction

079424b

list: expose checksum over chunkmetadata to list --format

bc05f08

hrehfeld changed the title ~~Expose hash over ChunkListEntry data to borg list --format {chunk_ids_$HASH}~~ Expose hash over ChunkListEntry data to borg list --format {chunk_ids_$HASH} Apr 29, 2020

hrehfeld changed the title ~~Expose hash over ChunkListEntry data to borg list --format {chunk_ids_$HASH}~~ Expose hash over ChunkListEntry to borg list --format {chunk_ids_$HASH} Apr 29, 2020

ThomasWaldmann reviewed Apr 30, 2020

View reviewed changes

src/borg/helpers/parseformat.py Outdated Show resolved Hide resolved

list formatter: also add chunker_params_<hash>

6c8d1f5

ThomasWaldmann reviewed Apr 30, 2020

View reviewed changes

hrehfeld added 3 commits April 30, 2020 17:19

list formatter: fix: chunk ids are sufficient to hash.

07c2ae5

list formatter: chunker_params are not hashed but given as plaintext

e760f34

list formatter: only expose sha256 variant of chunk_ids_checksum

d7b3620

elho reviewed Jul 6, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose hash over ChunkListEntry to `borg list --format {chunk_ids_$HASH}` #5167

Expose hash over ChunkListEntry to `borg list --format {chunk_ids_$HASH}` #5167

hrehfeld commented Apr 29, 2020

codecov-io commented Apr 29, 2020 •

edited

ThomasWaldmann left a comment

hrehfeld commented Apr 30, 2020

ThomasWaldmann commented Apr 30, 2020 •

edited

ThomasWaldmann Apr 30, 2020

hrehfeld Apr 30, 2020

elho Jul 6, 2020

elho commented Jul 6, 2020

elho commented Jul 6, 2020

elho commented Aug 25, 2020

ThomasWaldmann commented Aug 25, 2020

ThomasWaldmann commented Mar 12, 2023 •

edited

Expose hash over ChunkListEntry to borg list --format {chunk_ids_$HASH} #5167

Are you sure you want to change the base?

Expose hash over ChunkListEntry to borg list --format {chunk_ids_$HASH} #5167

Conversation

hrehfeld commented Apr 29, 2020

codecov-io commented Apr 29, 2020 • edited

Codecov Report

ThomasWaldmann left a comment

Choose a reason for hiding this comment

hrehfeld commented Apr 30, 2020

ThomasWaldmann commented Apr 30, 2020 • edited

ThomasWaldmann Apr 30, 2020

Choose a reason for hiding this comment

hrehfeld Apr 30, 2020

Choose a reason for hiding this comment

elho Jul 6, 2020

Choose a reason for hiding this comment

elho commented Jul 6, 2020

elho commented Jul 6, 2020

elho commented Aug 25, 2020

ThomasWaldmann commented Aug 25, 2020

ThomasWaldmann commented Mar 12, 2023 • edited

Expose hash over ChunkListEntry to `borg list --format {chunk_ids_$HASH}` #5167

Expose hash over ChunkListEntry to `borg list --format {chunk_ids_$HASH}` #5167

codecov-io commented Apr 29, 2020 •

edited

ThomasWaldmann commented Apr 30, 2020 •

edited

ThomasWaldmann commented Mar 12, 2023 •

edited