Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A test with the files which actually collide #82

Open
ghost opened this issue Oct 2, 2023 · 95 comments
Open

A test with the files which actually collide #82

ghost opened this issue Oct 2, 2023 · 95 comments

Comments

@ghost
Copy link

ghost commented Oct 2, 2023

Attached is a zip file with two files which have the same sha1 but different sha256.

I'm trying to add the folder with them to the zpaqfranz archive, and I always see only one of them stored under the two names, not two of them:

C:\tmp\collision>unzip collision-example.zip
Archive:  collision-example.zip
   creating: baad/
 extracting: baad/messageA
 extracting: baad/messageB

C:\tmp\collision>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 208feafe1c6a95c73f662514ac48761f25e1f3b74922521a98d9ce287f4a2197

C:\tmp\collision>openssl dgst -sha1 baad\*
SHA1(baad\messageA)= 8ac60ba76f1999a1ab70223f225aefdc78d4ddc0
SHA1(baad\messageB)= 8ac60ba76f1999a1ab70223f225aefdc78d4ddc0

C:\tmp\collision>zpaqfranz.exe a baad.zpaqf baad
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
Creating baad.zpaqf at offset 0 + 0
Add 2023-10-02 03:39:47         2              1.280 (   1.25 KB) 8T (1 dirs)
3 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.842) = 1.842 @ 26.60 KB/s

0.047 seconds (000:00:00) (all OK)

C:\tmp\collision>mkdir result

C:\tmp\collision>cd result

C:\tmp\collision\result>..\zpaqfranz.exe x ..\baad.zpaqf
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
../baad.zpaqf:
1 versions, 3 files, 1.842 bytes (1.80 KB)
Extract 1.280 bytes (1.25 KB) in 2 files (1 folders) / 8 T


0.031 seconds (000:00:00) (all OK)

C:\tmp\collision\result>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c

Obviously one content under two names is there.

I've read "additional checks" are default, it's unexpected. Maybe the switch is needed:

C:\tmp\collision\result>cd ..

C:\tmp\collision>zpaqfranz.exe a baad-2.zpaqf baad -sha256
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-sha256 -noconsole
Creating baad-2.zpaqf at offset 0 + 0
Add 2023-10-02 03:41:54         2              1.280 (   1.25 KB) 8T (1 dirs)
3 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.982) = 1.982 @ 26.60 KB/s

0.047 seconds (000:00:00) (all OK)

C:\tmp\collision>mkdir result-256

C:\tmp\collision>cd result-256

C:\tmp\collision\result-256>..\zpaqfranz.exe x ..\baad-2.zpaqf
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
../baad-2.zpaqf:
1 versions, 3 files, 1.982 bytes (1.94 KB)
Extract 1.280 bytes (1.25 KB) in 2 files (1 folders) / 8 T


0.032 seconds (000:00:00) (all OK)

C:\tmp\collision\result-256>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c

collision-example.zip

@fcorbelli
Copy link
Owner

This is expected behavior
There is no way to avoid SHA-1 collisions (other than breaking backwards compatibility of .zpaq archives, which I don't want to do)

zpaq simply doesn't detect anything, it doesn't provide any warnings.
Look at this example

Z:\pippo>zpaqfranz sum message* -sha256
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:sum                                       1 - command
franz:-sha256 -hw
Getting SHA-256 ignoring .zfs and :$DATA

No multithread: Found (1.25 KB) => 1.280 bytes (1.25 KB) / 2 files in 0.000000
|SHA-256: 208FEAFE1C6A95C73F662514AC48761F25E1F3B74922521A98D9CE287F4A2197 [                640]     |messageB
|SHA-256: 3EAD211681CEC93D265C8AC123DD062E105408CEBF82FA6E2B126F4F40BCB88C [                640]     |messageA

SHA-1 collision

Z:\pippo>zpaqfranz sum message* -sha1
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:sum                                       1 - command
franz:-sha1 -hw
Getting SHA-1 ignoring .zfs and :$DATA

No multithread: Found (1.25 KB) => 1.280 bytes (1.25 KB) / 2 files in 0.000000
|SHA-1: 8AC60BA76F1999A1AB70223F225AEFDC78D4DDC0 [                640]     |messageA
|SHA-1: 8AC60BA76F1999A1AB70223F225AEFDC78D4DDC0 [                640] === |messageB
Z:\pippo>zpaq64 a 715.zpaq *
zpaq v7.15 journaling archiver, compiled Aug 17 2016
Creating 715.zpaq at offset 0 + 0
Adding 0.001280 MB in 2 files -method 14 -threads 32 at 2023-10-02 08:28:34.
+ messageA 640
+ messageB 640 -> 0
[1..1] 652 -method 14,20,0
2 +added, 0 -removed.

0.000000 + (0.001280 -> 0.000640 -> 0.001739) = 0.001739 MB
0.016 seconds (all OK)

Z:\pippo>zpaq64 x 715.zpaq -test
zpaq v7.15 journaling archiver, compiled Aug 17 2016
715.zpaq: 1 versions, 2 files, 1 fragments, 0.001739 MB
Extracting 0.001280 MB in 2 files -threads 32
[1..1] -> 652
> messageA
> messageB
0.000 seconds (all OK)

Then

Z:\pippo>zpaq64 x 715.zpaq -to extracted
zpaq v7.15 journaling archiver, compiled Aug 17 2016
715.zpaq: 1 versions, 2 files, 1 fragments, 0.001739 MB
Extracting 0.001280 MB in 2 files -threads 32
[1..1] -> 652
> extracted/messageA
> extracted/messageB
0.016 seconds (all OK)

Finally

Z:\pippo>zpaqfranz sum extracted -sha256
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:sum                                       1 - command
franz:-sha256 -hw
Getting SHA-256 ignoring .zfs and :$DATA

No multithread: Found (1.25 KB) => 1.280 bytes (1.25 KB) / 2 files in 0.016000
|SHA-256: 3EAD211681CEC93D265C8AC123DD062E105408CEBF82FA6E2B126F4F40BCB88C [                640]     |extracted/messageA
|SHA-256: 3EAD211681CEC93D265C8AC123DD062E105408CEBF82FA6E2B126F4F40BCB88C [                640] === |extracted/messageB

0.079 seconds (000:00:00) (all OK)

As you can see, zpaq

zpaq happily extracts the data, without any warning that a collision has occurred.
Attention, we are talking about DETECTION, not AVOIDANCE (impossible to avoid with this archive file format)

@fcorbelli
Copy link
Owner

fcorbelli commented Oct 2, 2023

Now let's see zpaqfranz default

Z:\pippo>zpaqfranz a zpaqfranz.zpaq messag*
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:-hw
Creating zpaqfranz.zpaq at offset 0 + 0
Add 2023-10-02 10:33:04         2              1.280 (   1.25 KB) 32T (0 dirs)
2 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.803) = 1.803 @ 83.33 KB/s

0.031 seconds (000:00:00) (all OK)

This time, if you test the file, you'll get

Z:\pippo>zpaqfranz t zpaqfranz.zpaq
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:-hw
zpaqfranz.zpaq:
1 versions, 2 files, 1.803 bytes (1.76 KB)
To be checked 1.280 in 2 files (32 threads)
7.15 stage time       0.02 no error detected (RAM ~34.07 MB), try CRC-32 (if any)
Checking                 2 blocks with CRC-32 (1.280 not-0 bytes)
ERROR:  STORED CRC-32 92433266 != DECOMPRESSED 072E2B0E (ck 00000001) messageB

CRC-32 time           0.00s
Blocks               1.280 (           2)
Zeros                    0 (           0) 0.000000 s
Total                1.280 speed 1.280.000/sec (1.22 MB/s)
ERRORS        : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
-----------------------------------------------------------------------------------------------------
GOOD            : 00000001 of 00000002 (stored=decompressed)
WITH ERRORS

0.032 seconds (000:00:00) (with warnings)

If you put a "-verify" in add

Z:\pippo>zpaqfranz a test2.zpaq mess* -verify
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:-hw -verify
Creating test2.zpaq at offset 0 + 0
Add 2023-10-02 10:35:52         2              1.280 (   1.25 KB) 32T (0 dirs)
29604 SOMETHING WRONG ON messageA

GURU-C: on file messageA
GURU: CRC-32 from fragments 92433266
GURU: CRC-32 from file      072E2B0E
2 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.794) = 1.794 @ 40.32 KB/s

Short version:
no way to DETECT a SHA-1 collision in zpaq
yes, you can DETECT SHA-1 collision in zpaqfranz

EDIT: for performance reasons, the double-check is NOT active during add() but in test(). Older zpaqfranz builds do this way, but it was a modified behavior so as not to slow down in the normal case

@ghost
Copy link
Author

ghost commented Oct 2, 2023

EDIT: for performance reasons, the double-check is NOT active during add() but in test(). Older zpaqfranz builds do this way, but it was a modified behavior so as not to slow down in the normal case

When I pack 3 files, messageA, messageB and some-big.mp4 then, to find the collision, if I invoke "t" the time needed will be proportional to the number of all the bytes in the archive:

Checking                68 blocks with CRC-32 (1.089.413.820 not-0 bytes)
ERROR:  STORED CRC-32 92433266 != DECOMPRESSED 072E2B0E (ck 00000001) messageB

CRC-32 time           0.01s
Blocks       1.089.413.820 (          68)
Total        1.089.413.820 speed 34.044.181.875/sec (31.71 GB/s)
  ...
ERRORS        : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
   ...
6.375 seconds (000:00:06) (with warnings)

But if I could just get a list of the filenames, the values of the stored crc32 and the values of the sha1 of each file, I could detect the collisions in the above example in a few milliseconds, i.e. not bound by the number of bytes stored. It's just a comparison of crc32 values for the same sha1 values.

  1. How can I get such a list from the archive using your program? If it's not possible, what do you think about implementing in your program such kind of listing?

  2. What do you think about implementing a possibility to perform such kind of a collision detection in your program (not bound by the number of bytes stored)?

@fcorbelli
Copy link
Owner

When I pack 3 files, messageA, messageB and some-big.mp4 then, to find the collision, if I invoke "t" the time needed will be proportional to the number of all the bytes in the archive:
(...)
But if I could just get a list of the filenames, the values of the stored crc32 and the values of the sha1 of each file, I could detect the collisions in the above example in a few milliseconds, i.e. not bound by the number of bytes stored. It's just a comparison of crc32 values for the same sha1 values.

1. How can I get such a list from the archive using your program? If it's not possible, what do you think about implementing in your program such kind of listing?

with the -checksum switch

zpaqfranz l test.zpaq -checksum
2. What do you think about implementing a possibility to perform such kind of a collision detection in your program (not bound by the number of bytes stored)?

This require to compute and store the SHA-1 of the entire file (aka: overhead)
But yes, you can, if you really want

zpaqfranz a test5 message* -sha1
Z:\pippo>zpaqfranz l test5.zpaq -checksum
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:-checksum -hw
test5.zpaq:
1 versions, 2 files, 1.810 bytes (1.77 KB)


- 2023-10-02 00:53:08                 640 A     SHA-1: 8AC60BA76F1999A1AB70223F225AEFDC78D4DDC0 CRC32: 072E2B0E messageA
- 2023-10-02 00:53:11                 640 A     SHA-1: 8AC60BA76F1999A1AB70223F225AEFDC78D4DDC0 CRC32: 92433266 messageB

                1.280 (1.25 KB) of 1.280 (1.25 KB) in 2 files shown
                1.810 compressed

0.000 seconds (000:00:00) (all OK)

@ghost
Copy link
Author

ghost commented Oct 2, 2023

For my example I see:

 ../zpaqfranz.exe l ../a1.zpaq -checksum
   ...
- 2023-10-02 00:53:08                 640 A     XXHASH64: 5ED04C85E7C36E96 CRC32: 072E2B0E messageA
- 2023-10-02 00:53:11                 640 A     XXHASH64: E9127211F02DB4EE CRC32: 92433266 messageB
- 2022-05-19 15:12:12       1.089.412.540 A     XXHASH64: 27841A60E9F3A8CF CRC32: A55A20FE some-big.mp4

That is, I see different XXHASH64 anyway, so crc32 appears completely unnecessary?

Then, if I'd use the original zpaq (I can't get the same result with your) I can get the list of the files which are "same" even if now I know they have different XXHASH64:

zpaq l ../a1.zpaq -s-1
zpaq v7.15 journaling archiver, compiled Aug 17 2016
../a1.zpaq: 1 versions, 3 files, 15292 fragments, 952.652696 MB

- 2023-10-01 22:53:08          640 A     messageA 1
- 2023-10-01 22:53:11          640 A     messageB 1
- 2022-05-19 13:12:12   1089412540 A     some-big.mp4 2-15292    

Which means that I can recognize the collisions without decompressing all the files, without even knowing crc32, and all that based on the information already stored in the existing archive packed by zpaqfranz

What do you think about all that?

@fcorbelli
Copy link
Owner

For my example I see:

 ../zpaqfranz.exe l ../a1.zpaq -checksum

You missed the -sha1 switch (creating the archive)

That is, I see different XXHASH64 anyway, so crc32 appears completely unnecessary?
No, because CRC-32 is used for SHA-1-collision-detection

Then, if I'd use the original zpag (I can't get the same result with your) I can get the list of the files which are "same" even if now I know they have different XXHASH64:

zpaq l ../a1.zpaq -s-1
zpaq v7.15 journaling archiver, compiled Aug 17 2016
../a1.zpaq: 1 versions, 3 files, 15292 fragments, 952.652696 MB

- 2023-10-01 22:53:08          640 A     messageA 1
- 2023-10-01 22:53:11          640 A     messageB 1
- 2022-05-19 13:12:12   1089412540 A     random.bin 2-15292    

Which means that I can recognize the collisions without decompressing all the files, without even knowing crc32, and all that based on the information already stored in the existing archive packed by zpaqfranz

What do you think about all that?

You can't at all, with zpaq

With zpaqfranz you can (manually) if you use the -sha1 switch (when adding files)

I can make a new (simply) switch that

  • store the SHA1 (just like -sha1) in the a (add)
  • quickly check if some files have the same SHA1 but different CRC-32 in the t (test)

Or a new (more work) switch that

  • store the SHA-1 TOO (with XXHASH64 or whatever)
  • quickly check if some files get the same SHA1 but different CRC-32 DURING the a (add). This seems a bit overkill, even for me

@ghost
Copy link
Author

ghost commented Oct 2, 2023

You missed the -sha1 switch (creating the archive)

I have just given you an example where without that -sha1 all necessary info to report the collision is already stored in the file!

Specifically, it is known that both files share the same fragment (fragment 1):

 - 2023-10-01 22:53:08          640 A     messageA 1
 - 2023-10-01 22:53:11          640 A     messageB 1

which is what zpaq is able to list with

 zpaq l ../a1.zpaq -s-1

as I've already shown. Please try yourself.

@fcorbelli
Copy link
Owner

No, because different files can share the same fragment
That's exactly the "deduplicator"

@fcorbelli
Copy link
Owner

  jDC20231002114900c0000000001 8 jDC� ->  8 OK
    csize = 1061 at 104
  jDC20231002114900d0000000001 652 jDC� ->  652 OK
    1061 -> 652 [1..2) = 640 OK
  jDC20231002114900h0000000001 28 jDC� ->  28 OK
    [1..2) 1061 -> 640 OK
  jDC20231002114900i0000000001 68 jDC� ->  68 OK
    20231001225308 messageA 7720000000 1
    20231001225311 messageB 7720000000 1

Nothing "strange" here

@ghost
Copy link
Author

ghost commented Oct 2, 2023

No, because different files can share the same fragment
That's exactly the "deduplicator"

If two files share all the fragments, but they have different XXHASH64 they were deduplicated due to the same sha1 even if they shouldn't have been (the different fragments should have been stored). I assumed that is exactly what one would like to know?

@fcorbelli
Copy link
Owner

I could extract the list of fragments of each file, and verify that those with the same list have different CRC-32.
Yes, I can, but does it make sense?
This is a function for command t (test) not of a (add)
A corruption of a block, which you cannot check from the fragment list, is much more frequent (and dangerous) than a SHA-1 collision

@ghost
Copy link
Author

ghost commented Oct 2, 2023

does it make sense?

Why would anybody want to not be notified of data loss produced by the execution of "a" ?

If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time. I am not suggesting stopping the process, but precisely reporting the problem of "a" which could be known at the very time "a" is perfomed.

@fcorbelli
Copy link
Owner

does it make sense?

Why would anybody want to not be notified of data loss produced by the execution of "a" ?

Because it's veeeery uncommon

If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time. I don't suggest stopping the process, but precisely reporting the problem of "a" which could be known at the very time "a" is perfomed.

zpaqfranz can already do that, but with more overhead
Some users want faster, not safer.

I need to do some tests on fragment packaging times, for a quantitative evaluation

@ghost
Copy link
Author

ghost commented Oct 2, 2023

If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time.

zpaqfranz can already do that, but with more overhead

Again, I still miss this: how can one recognize the error of "a" without running "t" in zpaqfranz ?

@fcorbelli
Copy link
Owner

I'll try to explain better
During the add phase, for each file, I should keep the list of fragments that compose it (somehow compact) and its CRC-32, and then search for each file already in the archive (and there can be millions) , if the list of fragments (which can be hundreds of thousands per file) matches.
It's doable, with a little effort, but the overhead could be significant, i.e. slowing down every single add, whether the collision occurs or not

@fcorbelli
Copy link
Owner

If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time.

zpaqfranz can already do that, but with more overhead

Again, I still miss this: how can one recognize the error of "a" without running "t" in zpaqfranz ?

With the -verify switch

@ghost
Copy link
Author

ghost commented Oct 2, 2023

Again, I still miss this: how can one recognize the error of "a" without running "t" in zpaqfranz ?

With the -verify switch

Ah, thanks, I still don't understand what is the cause of performance hit you claim in:

[EDIT: for performance reasons, the double-check is NOT active during add() but in test().

I still believe the overhead of the verification if there was data loss due to the run of "a" could be practically unnoticeable, compared to the current cost of "t"

Without the existence of the list of fragments the decompression is impossible, and that list could be generated, if it already doesn't exist, much faster than decompressing all the files.

And if the "list" of fragments in the archive is already a stored as a tree then comparing if two files are stored as the same "list" of fragments is equivalent to comparison if two files have the identical root node?

@ghost
Copy link
Author

ghost commented Oct 2, 2023

I fully agree with you that "a corruption of a block, which you cannot check from the fragment list, is much more frequent (and dangerous) than a SHA-1 collision".

@fcorbelli
Copy link
Owner

Right now, just for fun, I'm writing a hash "packager" of the fragment verctos.
The real question to ask is whether I also want to keep the file name or not
That is, whether to have a "SHA-1 collision found" warning, period, or "collision with file X".
In the second case more memory and more time, but perhaps it is better

...work in progress...

@ghost
Copy link
Author

ghost commented Oct 2, 2023

IMO it should be not about "sha1 collision" as such but about the "data loss" due to the deduplication algorithm used in "a", even if it were something else and I think such "data loss" can be detected as soon as the algorithm used for deduplication is not the same as some checksum which is also available in the process.

Regarding the file names, I would expect they are also in some way deduplicated in RAM and only references to the "roots" of them used in the code?

@ghost
Copy link
Author

ghost commented Oct 2, 2023

IMO it should be not about "sha1 collision" as such but about the "data loss" due to the deduplication algorithm used in "a", even if it were something else and I think such "data loss" can be detected as soon as the algorithm used for deduplication is not the same as some checksum which is also available in the process.

And I still don't know if some additional checksum is anyway computed, could it be used to even automatically avoid data loss during "a" due to otherwise "too simple" deduplication algorithm -- to automatically store what would be otherwise "lost".

@fcorbelli
Copy link
Owner

And I still don't know if some additional checksum is anyway computed, could it be used to even automatically avoid data loss during "a" due to otherwise "too simple" deduplication algorithm -- to automatically store what would be otherwise "lost".

The archive format does not support anything different.
Standard zpaq does not store any kind of checksum (no CRC-32) at file level.
It is not possible to store more data for single fragments

@ghost
Copy link
Author

ghost commented Oct 2, 2023

Standard zpaq does not store any kind of checksum (no CRC-32) at file level.
It is not possible to store more data for single fragments

That is about what is stored in the archive. But your program calculates more by default, and maybe that information could be used to force storing "colliding" fragments in the archive, even if the original zpaq would not store them, and still keep the archive readable by zpaq. E.g. if zpaq identifies the stored as the numbers (that's how zpaq reports it) then why not storing messageA as fragment 1, messageB as fragment 2 in that example, if it could be known that messageA and messageB are not the same?

@fcorbelli
Copy link
Owner

Standard zpaq does not store any kind of checksum (no CRC-32) at file level.
It is not possible to store more data for single fragments

That is about what is stored in the archive. But your program calculates more by default, and maybe that information could be used to force storing "colliding" fragments in the archive, even if the original zpaq would not store them, and still keep the archive readable by zpaq. E.g. if zpaq identifies the stored as the numbers (that's how zpaq reports it) then why not storing messageA as fragment 1, messageB as fragment 2 in that example, if it could be known that messageA and messageB are not the same?

zpaq does not identifies files by numbers, but by full filename
<"messageA",[list of fragments]>
<"messageB",[list of fragments]>

@fcorbelli
Copy link
Owner

Please check the attached pre-release
58_11a.zip

@ghost
Copy link
Author

ghost commented Oct 2, 2023

Can you please explain what can be seen in that 58_11a.zip? I some fail to see anything on my example of these 3 files.

@fcorbelli
Copy link
Owner

If you do anything, for example a l (list), you should see a collision detected

@fcorbelli
Copy link
Owner

fcorbelli commented Oct 2, 2023

Z:\pippo>zpaqfranz a test1.zpaq message*
zpaqfranz v58.11a-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-02)
franz:-hw
Creating test1.zpaq at offset 0 + 0
Add 2023-10-02 15:43:25         3              1.920 (   1.88 KB) 32T (0 dirs)
3 +added, 0 -removed.

0 + (1.920 -> 640 -> 1.811) = 1.811 @ 117.19 KB/s

0.016 seconds (000:00:00) (all OK)

Z:\pippo>zpaqfranz l test1.zpaq
zpaqfranz v58.11a-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-02)
franz:-hw
test1.zpaq:
1 versions, 3 files, 1.811 bytes (1.77 KB)

#####################################################################################################
87487: **** SUSPECTED SHA-1 COLLISION ****
File   <<messageC>>
map to <<messageA>>
#####################################################################################################



- 2023-10-02 00:53:08                 640 A     messageA
- 2023-10-02 00:53:11                 640 A     messageB
- 2023-10-02 00:53:08                 640 A     messageC

                1.920 (1.88 KB) of 1.920 (1.88 KB) in 3 files shown
                1.811 compressed

0.016 seconds (000:00:00) (all OK)

Z:\pippo>

@fcorbelli
Copy link
Owner

Z:\pippo>zpaqfranz t test1.zpaq
zpaqfranz v58.11a-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-02)
franz:-hw
test1.zpaq:
1 versions, 3 files, 1.811 bytes (1.77 KB)

#####################################################################################################
87487: **** SUSPECTED SHA-1 COLLISION ****
File   <<messageC>>
map to <<messageA>>
#####################################################################################################

To be checked 1.920 in 3 files (32 threads)
7.15 stage time       0.01 no error detected (RAM ~34.07 MB), try CRC-32 (if any)
Checking                 3 blocks with CRC-32 (1.920 not-0 bytes)
ERROR:  STORED CRC-32 92433266 != DECOMPRESSED 072E2B0E (ck 00000001) messageB

CRC-32 time           0.00s
Blocks               1.920 (           3)
Zeros                    0 (           0) 0.000000 s
Total                1.920 speed 1.920.000/sec (1.83 MB/s)
ERRORS          : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
-----------------------------------------------------------------------------------------------------
GOOD            : 00000002 of 00000003 (stored=decompressed)
WITH ERRORS

0.063 seconds (000:00:00) (with warnings)

@fcorbelli
Copy link
Owner

OK, my fault, I am running with 3 different files 😄

@ghost
Copy link
Author

ghost commented Oct 4, 2023

OK. After I've checked the zpaq format description, to avoid saying something inconsistent, I hope I'm summarizing the whole discussion correctly then with:

For the collision detection only the crc32 hashes of the whole files are compared, and it is impossible to do more than only detect the problems on the level of whole files. The problem with the zpaq format is that one would need to store e.g. crc32 hashes for every fragment and not only once per file, to be able to reasonably efficiently detect the problem on the level of every fragment and then force storing the fragment which in the original zpaq would surely not be stored because of the sha1 collision. That additional info has to be stored so that it would not make problems to the original zpaq, and it doesn't seem there's "right" space for that in the format. So what you implemented (the detection on the level of the whole files), unless some solution which we aren't aware of exists, is the "best it can be done" as long as the format should remain compatible, which is anyway the goal of zpaqfranz (otherwise it should not have the name starting with zpaq).

So only detection is realistic, only on the level of the whole files, only if the additional checksums exist (which are, at least, default when zpaqfranz produces the archives).

I hope I haven't missed anything. Thanks.

@fcorbelli
Copy link
Owner

The reconstruction is quite accurate.
I will add a couple of details
It is possible to create dummy data fragments, that is, linked to dummy files
By dummy file I mean a file that is related to a Windows ADS, what I call in zpaqfranz a "virtual file"
Translation
zpaq, by default, does not show ADS files
I use this behavior to store, within "dummy" files, additional information, which the zpaq format cannot maintain (e.g., a list of files)


So far so good BUT
to maintain backward compatibility, it is NOT enough to invent a method for storing data that zpaq does not "see," but rather (in this case) you need one that zpaq DOES see.
That is, a file stored according to classical rules, i.e., those of zpaq, not the virtual files of zpaqfranz, that can later be extracted by zpaq
Otherwise, to go concrete, only zpaqfranz would be able to extract a SHA-1 collision file (assuming I develop the whole system to have it stored "covertly", an hard issue indeed)

Does it sound complex? Yes, it is, very much so,otherwise I would have already been doing it for years 😄

@ghost
Copy link
Author

ghost commented Oct 5, 2023

to go concrete, only zpaqfranz would be able to extract a SHA-1 collision file

If I'd want to solve a problem manually: if I'd want to store the file in any system which uses sha1, and if I knew that that file would collide, I'd just append in front of the file's content 32 bytes. 8 bytes would be a string "sha1coll\0" 16 bytes would be a fixed uuid which I would generate only once for eternity, and the last 8 bytes would be a current timestamp. Then the content of the file would follow. As soon as I'd store such modified file, it would have a different sha1, it would be surely stored and expanded back. After the extraction, I could easily recognize that it has these 32-bytes extra and remove it. I wouldn't even bother to implement automatic removal on extraction, as the whole scenario would anyway happen extremely rarely, and I would know that the archive still contains all the original bytes (but prefixed with 32 bytes more). In that way, storing the content of the file, to prevent complete loss of the data, is not too big problem. I would also not care that the zpaq would extract 32 bytes in front of the content of the file. I could always strip them if I need the original.

And I would definitely not want the action of storage of such modified file to be a new "version" of the archive made without my control. Imagine if the archive is already big -- one more version could be too much. So thinking about the subject more, leaving the handling of such a file independently of the archiving program is from my perspective "good enough", it's only the detection of the data loss that is nice to have. And once detected, manual intervention on the file, as described, or any other way, should also be good enough because we anyway don't expect the collisions to happen by chance, only as a product of human intervention.

I now think the detection is good enough. Thanks.

@fcorbelli
Copy link
Owner

If I'd want to solve a problem manually: if I'd want to store the file in any system which uses sha1, and if I knew that that file would collide, I'd just append in front of the file's content 32 bytes. 8 bytes would be a string "sha1coll\0" 16 bytes would be a fixed uuid which I would generate only once for eternity, and the last 8 bytes would be a current timestamp.

Well, no 😄
It's much more complex than that
The zpaq deduplicator does not operate on files, but rather on parts of files, on fragments, by means of an appropriate function that (I'll spare you the explanation) "breaks" individual files into pieces. So your method would not work.
You have to operate with much more "brutal" systems, such as applying an XOR on each byte of the source file [to be sure not collide anymore], than XOR back in extraction (but we cannot, zpaq will not XOR-back the file)

Then the content of the file would follow. As soon as I'd store such modified file, it would have a different sha1, it would be surely stored and expanded back. After the extraction, I could easily recognize that it has these 32-bytes extra and remove it.

Well no
No, because standard zpaq don't know something about this new "tricky file"
The user will get two different file, say messageB (wrong) and fixed_messageB after extraction

I wouldn't even bother to implement automatic removal on extraction, as the whole scenario would anyway happen extremely rarely, and I would know that the archive still contains all the original bytes (but prefixed with 32 bytes more). In that way, storing the content of the file, to prevent complete loss of the data, is not too big problem. I would also not care that the zpaq would extract 32 bytes in front of the content of the file. I could always strip them if I need the original.

My method, the one I described above, is the "right" version of yours.
With additional transactions you can mark as deleted (the wrong file, in n+1 version) and therefore the "right" file (in n+2 version) will be extracted from zpaq, without having to change its name
At least, in theory 😃

And I would definitely not want the action of storage of such modified file to be a new "version" of the archive made without my control. Imagine if the archive is already big -- one more version could be too much.

No, at all.
Because a version is something very common with zpaq. In fact, it is THE reason to use it. The order of magnitude is a few KB more. It's very common to get 1.000 or 5.000 versions inside a zpaq archive

So thinking about the subject more, leaving the handling of such a file independently of the archiving program is from my perspective "good enough", it's only the detection of the data loss that is nice to have. And once detected, manual intervention on the file, as described, or any other way, should also be good enough because we anyway don't expect the collisions to happen by chance, only as a product of human intervention.

I now think the detection is good enough. Thanks.

I'm glad I've explained the "why" of the choices behind it
Just the method of detecting SHA-1 alone is at least complex and took me a lot of work.
Finally, there is THE PROBLEM, namely the performance
Reading the list of files in a zpaq archive (the whole list, i.e., with the -all switch) can be long, very long. Taking even minutes, for very large files.
So checking that the foo.txt file, from 3 years ago, had the same SHA-1 fragments as the pluto.txt file, from today, is not at all easy, if you want to get it in seconds.
In the first implementation of the detector, in fact, it operated only on the latest version, and not on all
The collision command, on the other hand, reads everything, but clearly slower

@fcorbelli
Copy link
Owner

Short version: try this one

zpaqfranz a test.zpaq messageA messageB

(for now manually get messageB is wrong)
Re-add to the archive, ordered (-stdout), imposed (-force), not fragmented (-nodedup)

zpaqfranz a test.zpaq messageB -stdout -force -nodedup

Now extract messageB to messageB_ok (zpaqfranz) and _715 (standard zpaq) and check all SHA-256

zpaqfranz x test.zpaq messageB -to messageB_ok -space
zpaq64 x test.zpaq messageB -to messageB_715
zpaqfranz sum message* -sha256

I think I will try to implement over the weekend
However, there remains the problem, which I don't know how to overcome, of the at least theoretical need to re-read all dt

@ghost
Copy link
Author

ghost commented Oct 5, 2023

So your method would not work.

You are right, it would not if an "attacker" would produce new collision specially to confuse zpaq in a way to make sure that even once the file has these 32 bytes added the zpaq produces the same later fragments and the collision happens only there. It seems to me that then planing the "default protection" from such an attacker is waste of time then, and is one more argument to just keep detection, but not do any automatic store.

@fcorbelli
Copy link
Owner

Please try my previous post

@ruptotus
Copy link

ruptotus commented Oct 5, 2023

Hmm... is Your discussion means that I should stop using zpaq to backup my data? Or collisions are really rare in real world?

@fcorbelli
Copy link
Owner

Hmm... is Your discussion means that I should stop using zpaq to backup my data? Or collisions are really rare in real world?

They are much more than rare, for normal files
They are purpose-built.
Recall that zpaqfranz, with the t (test) command, already performs the SHA-1 collision test
So, if you want to be on the safe side, you can just do it

@ghost
Copy link
Author

ghost commented Oct 5, 2023

So I tried a simpler example than yours and I see that

 zpaqfranz a test baad/messageA baad/messageB  -nodedup

packed both files, in a way that now

 del /q baad\*
 zpaq x test

Unpacks the correct messageA and messageB. So in this case your -nodedup was enough, and it was compatible with zpaq? Can you explain how?

@fcorbelli
Copy link
Owner

Because you really want the deduplicator on
And you will not know that messageA and messageB will collide, until you do a full run (add)

Please check the attached (very rough) pre-release
58_11p.zip

zpaqfranz a test.zpaq messageA messageB
zpaqfranz x test.zpaq -to good\
zpaq64 x test.zpaq -to goodtoo\

Versus

zpaq64 a wrong.zpaq messageA messageB
zpaq64 x wrong.zpaq -to notgood\

@fcorbelli
Copy link
Owner

In you want other test files...
collisions.zip

@ghost
Copy link
Author

ghost commented Oct 5, 2023

I think that as long as you don't do collision detection on the level of a single fragment but on the level of the file, it would be still possible to confuse your detection that you do now (over the whole file) with these "attacks through the fragments" that you also mentioned? Imagine:

fileX content:   longXonlyFragment - ThePointWhereFragmentsAreSplit - theCollisionFragmentA
fileY content:   longYonlyFragment - ThePointWhereFragmentsAreSplit - theCollisionFragmentB

Now both the files have from the start all file level checksums different anyway. From the file level comparisons nothing suspicions can be detected. The collision and the data loss happens only on the fragment level. That's why I don't think it has sense to plan automatic storage as the response to the collisions when anyway not all collisions of all the fragments are detected.

@fcorbelli
Copy link
Owner

I think that as long as you don't do collision detection on the level of a single fragment but on the level of the file, it would be still possible to confuse your detection that you do now (over the whole file) with these "attacks through the fragments" that you also mentioned? Imagine:

fileX content:   longXonlyFragment - ThePointWhereFragmentsAreSplit - theCollisionFragmentA
fileY content:   longYonlyFragment - ThePointWhereFragmentsAreSplit - theCollisionFragmentB

Now both the files have from the start all file level checksums different anyway. From the file level comparisons nothing suspicions can be detected. The collision and the data loss happens only on the fragment level. That's why I don't think it has sense to plan automatic storage as the response to the collisions when anyway not all collisions of all the fragments are detected.

I'm not very worried about this
If I have a SUSPECTED SHA-1 collision, via CRC-32, I store the file again, in one piece
If the file is very large, I will have a waste of space.
Since I actually expect collisions to exist only on specially prepared files, I don't think it will happen in practice
Therefore the resulting archive will be larger than it would be if there was collision management at fragment level, but certainly every type of "trick" will be bypassed

The problem would arise in the case of a double collision: same SHA1 on the fragments, same CRC-32 in the entire file
In this case I could evolve with the check of the true hash, i.e. the XXHASH (or SHA-256 or whatever you want)
But it really seems like an atomic bomb to kill a mosquito

@ghost
Copy link
Author

ghost commented Oct 5, 2023

If I have a SUSPECTED SHA-1 collision, via CRC-32, I store the file again, in one piece

My argument is that as soon as you assume that the attacker is attacking zpaq or zpaqfranz as a program you can't then argue that your file-level check will work. You have to do a cryptographic correct solution simply because of your attack model.

On another side, the checks you already implemented already reduce the chance of accidental data loss (that's a different attack model: of the kind "I have everything in the archive -- ups I don't and there was no message about that") for which we know, by the very presence of these already constructed files, that is clearly possible.

@fcorbelli
Copy link
Owner

zpaqfranz is not some kind of security software
"Attacking" zpaqfranz... why?

It is a program that is used locally, like 7z
What's the point of corrupting an archive?
Could I upload multiple files with SHA-1 collisions to a server that uses zpaq to backup?
Yes, and then?
Nothing special happens
The backup is done, AT MAX the "rogue files" cannot be restored. The "regular one" will not get "infected" (better, affected)
Cannot stop zpaq's restore via SHA-1 collisions
That's all

I am therefore more concerned about a possible "real" collision, i.e. relating to real world data, rather than an attack.
This is why I have implemented SHA-1 DETECTION for a long time
Do not really care on "attacks"

As mentioned it is not possible to correct the collision issue, except by losing backwards compatibility
If this is tolerated, as I demonstrated above with EXP01 I can do it in a few minutes
But it's not worth it

@ghost
Copy link
Author

ghost commented Oct 5, 2023

I think we can agree: my guess is that the file level checks are enough for to allow one to not lose the file even if it's intentionally constructed to have the same sha1. We have already such files.

My argument was that as soon one brings to the discussion "second or later fragment" which would by the zpaq rules of fragments have the same sha1, then it's about a different goal, is not about handling what we already have -- files with that property (in collisions.zip) -- but about an imagined special attack that depends on the zpaq rule of fragments specifically. And based on your description of how you do the checks, such an attack would still succeed. But I also say it's not worth spending time on it, like you say: ""Attacking" zpaqfranz... why?"

I also don't think it should be about more than files "only" constructed to have same sha1. For such scenarios, my "fix" with a "prefix" would do solve the storage. And I also suggested that any attempt to automatically (without user doing anything) store the second file, in that context, is probably also already too much.

I think we are almost agreeing.

@fcorbelli
Copy link
Owner

58_11q.zip
The attached pre-release get

  • new command collision (aka: zpaqfranz collision 1.zpaq)
  • new switch -collision for t (aka: zpaqfranz t 1.zpaq -collision zpaqfranz t 1.zpaq -collision -all)
  • new switch -collision for a (zpaqfranz a 1.zpaq messagea messageb -collision)
  • new switch -collision for l (zpaqfranz l z:\1.zpaq -collision zpaqfranz l z:\1.zpaq -collision -all)

the -all will extend detection to all versions (slower, but sure)

Adding -collision to the add command allows files to be extracted, correctly, even with zpaq and not just zpaqfranz (at least in theory, I haven't done very extensive testing)

@ghost
Copy link
Author

ghost commented Oct 8, 2023

new command collision (aka: zpaqfranz collision 1.zpaq)

I assume the new command just tests for collisions without having to list all the files? Is it equivalent to l -collision -all ?

Is collision test also off by default in "t"? Ithought it's "cheap enough" compared to all other "test" operations to be always on there (and doing "-all")?

Thanks.

@fcorbelli
Copy link
Owner

  1. yes. It is just a bit faster (not very much)
  2. yes, because it is already done. It is a duplicate

@ghost ghost closed this as completed Oct 9, 2023
@tansy
Copy link
Contributor

tansy commented Jan 4, 2024

If it's almost broken, Wouldn't be better to use sha2 (256)? It's more unique as it's longer and potentially faster as Intel, and AMD, included it in the hardware, and potentially other processor platforms.
Hash itself is bigger, true, but test would be equivalent to 4 64-bit integers, instead of 3 in sha1, and even hashing isn't that slow as the compression itself. From my test it's as fast as gzip decompression.

I know it would be new format but maybe that's the way to go - to make a new format - better designed and possibly simpler where it can be simpler, and with better features and so on.

@fcorbelli
Copy link
Owner

zpaqfranz already use HW accelerated SHA-256 (if any)
The hash is longer, therefore torning to pieces all zpaq's file format
So no, I'll keep SHA-1 with CRC-32 detection, more then enough

@tansy
Copy link
Contributor

tansy commented Jan 4, 2024

So you use two different (sha1 and sha2) hashes?

@fcorbelli
Copy link
Owner

For data deduplication SHA1 is always used
For SHA-1 collision CRC-32 is used too (on by default in zpaqfranz)
You can store the full HW-accelerated (if any) full-file SHA256 (or even SHA3)

zpaqfranz a z:\pippero *.exe -sha256
zpaqfranz l z:\pippero -checksum

but this will not "save" against SHA-1 collision
In the next release I am implementing the ability to store arbitrary data, so essentially also implement a global collision checking system. But, frankly, it seems unnecessary to me, and so I think I will abandon it. Too much effort, too slow, too much complexity for a problem that I don't have

@fcorbelli fcorbelli reopened this Jan 4, 2024
@tansy
Copy link
Contributor

tansy commented Jan 4, 2024

Well, 160 bits of sha1 plus 32 bits of crc32 is 192 bits.
Using sha256 can only make sense if it replaces sha1 as it's not faster (even accelerated is just as fast) and formally* bigger than sha1.
It could even be better and faster as it would eliminate additional crc32 computation.

  • formally, as 160 bits is 3-64 it integers, plus 32 bits of crc32 gives 4 comparisons, while sha256 is 4 64-bit integers, which also gives 4 comparisons.

@fcorbelli
Copy link
Owner

It is impossible, because the 20-bytes-long SHA-1 are stored inside a specific block type (the h) for every fragment, CRC-32 and XXHASH/SHA-256/BLAKE3 or whatever are stored together with the file name (that's full-file) inside i-block type

You can disable CRC-32 computation (and hashing too) with the -nochecksum switch (or -715). The speed difference is almost zero, and no SHA-1 detect-collision is possible

The real limitation is the monothreaded deduplicator: a possible multithreaded development would make it faster, but less efficient

You can see here how hard is "hack" the file format https://encode.su/threads/4178-hacking-zpaq-file-format-(!)
And here 20-bytes SHA-256 https://encode.su/threads/3605-Are-the-first-20-bytes-of-a-SHA256-safer-cryptographically-than-SHA1

@ghost
Copy link
Author

ghost commented Jan 30, 2024

a problem with holes, aka sequence of zeros not stored by zpaq, that rely on fseek to zero-fill output file

I've just recently seen why the mentioned feature exists, extracting the attached zpaq on Windows NTFS and seeing in Explorer the "Properties / Size on disk" of the resulting file. Really nice to know.

the mentioned zpaq file is inside of this zip here, it seems that github rejects "unsupported" formats

@fcorbelli
Copy link
Owner

There are other "hidden gems" as well.
For example, files that are difficult to compress (* according to the estimator implemented in zpaq) are not compressed at all, for the first 4 methods. For method 5, however, it still attempts to compress them
This makes it possible to raise the compression level even for a mix of already compressed/uncompressed files (ex. JPGs mixed with text source code) etc.
There is a (initial) study in zpaqfranz on grouping by type with different compression levels. But it is really complex, as there is a block "chopper" => suspended for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants