New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A test with the files which actually collide #82
Comments
This is expected behavior zpaq simply doesn't detect anything, it doesn't provide any warnings.
SHA-1 collision
Then
Finally
As you can see, zpaq zpaq happily extracts the data, without any warning that a collision has occurred. |
Now let's see zpaqfranz default
This time, if you test the file, you'll get
If you put a "-verify" in add
Short version: |
When I pack 3 files, messageA, messageB and some-big.mp4 then, to find the collision, if I invoke "t" the time needed will be proportional to the number of all the bytes in the archive:
But if I could just get a list of the filenames, the values of the stored crc32 and the values of the sha1 of each file, I could detect the collisions in the above example in a few milliseconds, i.e. not bound by the number of bytes stored. It's just a comparison of crc32 values for the same sha1 values.
|
with the -checksum switch
This require to compute and store the SHA-1 of the entire file (aka: overhead)
|
For my example I see:
That is, I see different XXHASH64 anyway, so crc32 appears completely unnecessary? Then, if I'd use the original zpaq (I can't get the same result with your) I can get the list of the files which are "same" even if now I know they have different XXHASH64:
Which means that I can recognize the collisions without decompressing all the files, without even knowing crc32, and all that based on the information already stored in the existing archive packed by zpaqfranz What do you think about all that? |
You missed the -sha1 switch (creating the archive)
You can't at all, with zpaq With zpaqfranz you can (manually) if you use the -sha1 switch (when adding files) I can make a new (simply) switch that
Or a new (more work) switch that
|
I have just given you an example where without that -sha1 all necessary info to report the collision is already stored in the file! Specifically, it is known that both files share the same fragment (fragment 1):
which is what zpaq is able to list with
as I've already shown. Please try yourself. |
No, because different files can share the same fragment |
Nothing "strange" here |
If two files share all the fragments, but they have different XXHASH64 they were deduplicated due to the same sha1 even if they shouldn't have been (the different fragments should have been stored). I assumed that is exactly what one would like to know? |
I could extract the list of fragments of each file, and verify that those with the same list have different CRC-32. |
Why would anybody want to not be notified of data loss produced by the execution of "a" ? If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time. I am not suggesting stopping the process, but precisely reporting the problem of "a" which could be known at the very time "a" is perfomed. |
Because it's veeeery uncommon
zpaqfranz can already do that, but with more overhead I need to do some tests on fragment packaging times, for a quantitative evaluation |
Again, I still miss this: how can one recognize the error of "a" without running "t" in zpaqfranz ? |
I'll try to explain better |
With the -verify switch |
Ah, thanks, I still don't understand what is the cause of performance hit you claim in:
I still believe the overhead of the verification if there was data loss due to the run of "a" could be practically unnoticeable, compared to the current cost of "t" Without the existence of the list of fragments the decompression is impossible, and that list could be generated, if it already doesn't exist, much faster than decompressing all the files. And if the "list" of fragments in the archive is already a stored as a tree then comparing if two files are stored as the same "list" of fragments is equivalent to comparison if two files have the identical root node? |
I fully agree with you that "a corruption of a block, which you cannot check from the fragment list, is much more frequent (and dangerous) than a SHA-1 collision". |
Right now, just for fun, I'm writing a hash "packager" of the fragment verctos. ...work in progress... |
IMO it should be not about "sha1 collision" as such but about the "data loss" due to the deduplication algorithm used in "a", even if it were something else and I think such "data loss" can be detected as soon as the algorithm used for deduplication is not the same as some checksum which is also available in the process. Regarding the file names, I would expect they are also in some way deduplicated in RAM and only references to the "roots" of them used in the code? |
And I still don't know if some additional checksum is anyway computed, could it be used to even automatically avoid data loss during "a" due to otherwise "too simple" deduplication algorithm -- to automatically store what would be otherwise "lost". |
The archive format does not support anything different. |
That is about what is stored in the archive. But your program calculates more by default, and maybe that information could be used to force storing "colliding" fragments in the archive, even if the original zpaq would not store them, and still keep the archive readable by zpaq. E.g. if zpaq identifies the stored as the numbers (that's how zpaq reports it) then why not storing messageA as fragment 1, messageB as fragment 2 in that example, if it could be known that messageA and messageB are not the same? |
zpaq does not identifies files by numbers, but by full filename |
Please check the attached pre-release |
Can you please explain what can be seen in that 58_11a.zip? I some fail to see anything on my example of these 3 files. |
If you do anything, for example a l (list), you should see a collision detected |
|
|
OK, my fault, I am running with 3 different files 😄 |
OK. After I've checked the zpaq format description, to avoid saying something inconsistent, I hope I'm summarizing the whole discussion correctly then with: For the collision detection only the crc32 hashes of the whole files are compared, and it is impossible to do more than only detect the problems on the level of whole files. The problem with the zpaq format is that one would need to store e.g. crc32 hashes for every fragment and not only once per file, to be able to reasonably efficiently detect the problem on the level of every fragment and then force storing the fragment which in the original zpaq would surely not be stored because of the sha1 collision. That additional info has to be stored so that it would not make problems to the original zpaq, and it doesn't seem there's "right" space for that in the format. So what you implemented (the detection on the level of the whole files), unless some solution which we aren't aware of exists, is the "best it can be done" as long as the format should remain compatible, which is anyway the goal of zpaqfranz (otherwise it should not have the name starting with zpaq). So only detection is realistic, only on the level of the whole files, only if the additional checksums exist (which are, at least, default when zpaqfranz produces the archives). I hope I haven't missed anything. Thanks. |
The reconstruction is quite accurate. So far so good BUT Does it sound complex? Yes, it is, very much so,otherwise I would have already been doing it for years 😄 |
If I'd want to solve a problem manually: if I'd want to store the file in any system which uses sha1, and if I knew that that file would collide, I'd just append in front of the file's content 32 bytes. 8 bytes would be a string "sha1coll\0" 16 bytes would be a fixed uuid which I would generate only once for eternity, and the last 8 bytes would be a current timestamp. Then the content of the file would follow. As soon as I'd store such modified file, it would have a different sha1, it would be surely stored and expanded back. After the extraction, I could easily recognize that it has these 32-bytes extra and remove it. I wouldn't even bother to implement automatic removal on extraction, as the whole scenario would anyway happen extremely rarely, and I would know that the archive still contains all the original bytes (but prefixed with 32 bytes more). In that way, storing the content of the file, to prevent complete loss of the data, is not too big problem. I would also not care that the zpaq would extract 32 bytes in front of the content of the file. I could always strip them if I need the original. And I would definitely not want the action of storage of such modified file to be a new "version" of the archive made without my control. Imagine if the archive is already big -- one more version could be too much. So thinking about the subject more, leaving the handling of such a file independently of the archiving program is from my perspective "good enough", it's only the detection of the data loss that is nice to have. And once detected, manual intervention on the file, as described, or any other way, should also be good enough because we anyway don't expect the collisions to happen by chance, only as a product of human intervention. I now think the detection is good enough. Thanks. |
Well, no 😄
Well no
My method, the one I described above, is the "right" version of yours.
No, at all.
I'm glad I've explained the "why" of the choices behind it |
Short version: try this one
(for now manually get messageB is wrong)
Now extract messageB to messageB_ok (zpaqfranz) and _715 (standard zpaq) and check all SHA-256
I think I will try to implement over the weekend |
You are right, it would not if an "attacker" would produce new collision specially to confuse zpaq in a way to make sure that even once the file has these 32 bytes added the zpaq produces the same later fragments and the collision happens only there. It seems to me that then planing the "default protection" from such an attacker is waste of time then, and is one more argument to just keep detection, but not do any automatic store. |
Please try my previous post |
Hmm... is Your discussion means that I should stop using zpaq to backup my data? Or collisions are really rare in real world? |
They are much more than rare, for normal files |
So I tried a simpler example than yours and I see that
packed both files, in a way that now
Unpacks the correct messageA and messageB. So in this case your -nodedup was enough, and it was compatible with zpaq? Can you explain how? |
Because you really want the deduplicator on Please check the attached (very rough) pre-release
Versus
|
In you want other test files... |
I think that as long as you don't do collision detection on the level of a single fragment but on the level of the file, it would be still possible to confuse your detection that you do now (over the whole file) with these "attacks through the fragments" that you also mentioned? Imagine:
Now both the files have from the start all file level checksums different anyway. From the file level comparisons nothing suspicions can be detected. The collision and the data loss happens only on the fragment level. That's why I don't think it has sense to plan automatic storage as the response to the collisions when anyway not all collisions of all the fragments are detected. |
I'm not very worried about this The problem would arise in the case of a double collision: same SHA1 on the fragments, same CRC-32 in the entire file |
My argument is that as soon as you assume that the attacker is attacking zpaq or zpaqfranz as a program you can't then argue that your file-level check will work. You have to do a cryptographic correct solution simply because of your attack model. On another side, the checks you already implemented already reduce the chance of accidental data loss (that's a different attack model: of the kind "I have everything in the archive -- ups I don't and there was no message about that") for which we know, by the very presence of these already constructed files, that is clearly possible. |
zpaqfranz is not some kind of security software It is a program that is used locally, like 7z I am therefore more concerned about a possible "real" collision, i.e. relating to real world data, rather than an attack. As mentioned it is not possible to correct the collision issue, except by losing backwards compatibility |
I think we can agree: my guess is that the file level checks are enough for to allow one to not lose the file even if it's intentionally constructed to have the same sha1. We have already such files. My argument was that as soon one brings to the discussion "second or later fragment" which would by the zpaq rules of fragments have the same sha1, then it's about a different goal, is not about handling what we already have -- files with that property (in collisions.zip) -- but about an imagined special attack that depends on the zpaq rule of fragments specifically. And based on your description of how you do the checks, such an attack would still succeed. But I also say it's not worth spending time on it, like you say: ""Attacking" zpaqfranz... why?" I also don't think it should be about more than files "only" constructed to have same sha1. For such scenarios, my "fix" with a "prefix" would do solve the storage. And I also suggested that any attempt to automatically (without user doing anything) store the second file, in that context, is probably also already too much. I think we are almost agreeing. |
58_11q.zip
the -all will extend detection to all versions (slower, but sure) Adding -collision to the add command allows files to be extracted, correctly, even with zpaq and not just zpaqfranz (at least in theory, I haven't done very extensive testing) |
I assume the new command just tests for collisions without having to list all the files? Is it equivalent to l -collision -all ? Is collision test also off by default in "t"? Ithought it's "cheap enough" compared to all other "test" operations to be always on there (and doing "-all")? Thanks. |
|
If it's almost broken, Wouldn't be better to use sha2 (256)? It's more unique as it's longer and potentially faster as Intel, and AMD, included it in the hardware, and potentially other processor platforms. I know it would be new format but maybe that's the way to go - to make a new format - better designed and possibly simpler where it can be simpler, and with better features and so on. |
zpaqfranz already use HW accelerated SHA-256 (if any) |
So you use two different (sha1 and sha2) hashes? |
For data deduplication SHA1 is always used
but this will not "save" against SHA-1 collision |
Well, 160 bits of sha1 plus 32 bits of crc32 is 192 bits.
|
It is impossible, because the 20-bytes-long SHA-1 are stored inside a specific block type (the h) for every fragment, CRC-32 and XXHASH/SHA-256/BLAKE3 or whatever are stored together with the file name (that's full-file) inside i-block type You can disable CRC-32 computation (and hashing too) with the -nochecksum switch (or -715). The speed difference is almost zero, and no SHA-1 detect-collision is possible The real limitation is the monothreaded deduplicator: a possible multithreaded development would make it faster, but less efficient You can see here how hard is "hack" the file format https://encode.su/threads/4178-hacking-zpaq-file-format-(!) |
I've just recently seen why the mentioned feature exists, extracting the attached zpaq on Windows NTFS and seeing in Explorer the "Properties / Size on disk" of the resulting file. Really nice to know. |
There are other "hidden gems" as well. |
Attached is a zip file with two files which have the same sha1 but different sha256.
I'm trying to add the folder with them to the zpaqfranz archive, and I always see only one of them stored under the two names, not two of them:
Obviously one content under two names is there.
I've read "additional checks" are default, it's unexpected. Maybe the switch is needed:
collision-example.zip
The text was updated successfully, but these errors were encountered: