Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent output with different nbWorkers values #31

Open
johnsanc314 opened this issue Feb 5, 2024 · 7 comments
Open

Inconsistent output with different nbWorkers values #31

johnsanc314 opened this issue Feb 5, 2024 · 7 comments

Comments

@johnsanc314
Copy link

I noticed that different nbWorkers parameter values can result in slightly different output for the compressed data.

I cannot find a pattern to when this happens, but when it does, it appears to be consistent for that number of nbWorkers for a particular input. It looks like an additional 3 bytes of "01 00 00" is added at the very end of the compressed stream. This is always the only difference. In my testing thus far, this extra 3 bytes appears to only occur with higher nbWorker values (22+), but some other higher values also omit these 3 bytes.

Any idea why this might be happening?

Example: 12 vs 32 nbWorkers for compressing an ISO file that is originally 3,469,070,336 bytes, packed into a zip archive:
image

@johnsanc314
Copy link
Author

johnsanc314 commented Feb 5, 2024

I just found this which is the exact same thing I observed:
klauspost/compress#277

This seems like a minor thing, but for my particular use case, consistently reproducible archives are important. This appears to be the only issue with consistency I have found. (The multi-threading works wonderful by the way!)

@oleg-st
Copy link
Owner

oleg-st commented Feb 5, 2024

Could you provide code and input samples?

@gjefferyes
Copy link

gjefferyes commented Feb 6, 2024

compressTest.txt

Attached is a sample program that demonstrates the original problem as described above.
It turn out that if you have multiple works, and then you call Flush(), before you Close() the compression stream, then sometimes these 3 extra bytes are added to the output.

Quick solution was to remove the Flush(), and just keep the Close(). (Flush is not needed anyway, as far as I can tell.)

   using (MemoryStream sOut = new MemoryStream())
   {
       Compressor c = new Compressor(1);
       c.SetParameter(ZSTD_cParameter.ZSTD_c_nbWorkers, numberOfWorkers);
       using (var zOut = new ZstdSharp.CompressionStream(sOut, c))
       {
           zOut.Write(mIn, 0, mIn.Length);

           // Remove this Flush and the problem goes away.
           zOut.Flush();
           zOut.Close();
       }
       byte[] outArray=sOut.ToArray();
}

So may well be a non-issue, just need to know not to call Flush before close.

(I can provide a link to the test file we are using, if you want to look at this any further.)

Thanks

@oleg-st
Copy link
Owner

oleg-st commented Feb 6, 2024

01 00 00 is an empty last block, if the compressor has no data at the end it outputs an empty last block, if there is data from which a block can be formed, it becomes the last block and the empty last block becomes unnecessary.
Flush forces a block to form, which increases the chance of being left with no data at the end.

Changed Flush so that it outputs all the data.

@oleg-st
Copy link
Owner

oleg-st commented Feb 9, 2024

@johnsanc314 @gjefferyes
Could you check it with master? (with Flush before Dispose)

@gjefferyes
Copy link

gjefferyes commented Mar 4, 2024

So I can confirm that with the latest changes if I call flush() before close() then I always get the 01 00 00 on the end of the stream.

If I call zOut.Flush() followed by zOut.Close() I always now get 01 00 00
If I just call zOut.Close() I get the exact same stream without the 01 00 00 on the end.

I guess my next question would be which is 'correct' should a zstd stream have the 01 00 00 on the end of it, or is this not needed?
I know the length of the file I am decompressing, so I don't run into any decompression issues with the 01 00 00 not being there, so I don't feel like it is needed that I actually call Flush at all?

(It would appear that the 7z zstd fork does not have 01 00 00 on the end of its streams)

@oleg-st
Copy link
Owner

oleg-st commented Mar 4, 2024

Both situations are correct, if at Close compressor has no data for the last block, then there will be an empty last block 01 00 00 (which says that compression stream is finished). To always have this situation you should make Flush before Close.
In .NET Streams, there is no way to specify that a block is last in the Write method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants