Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extraction of 'large' (>= ~2 GiB) files #38

Merged
merged 1 commit into from Mar 25, 2023

Conversation

mstock
Copy link
Contributor

@mstock mstock commented Mar 15, 2023

When extracting Tar archives with 'large' files (i.e. files larger than ~2 GiB) I noticed that this either didn't work (macOS) or resulted in corrupt/too small files (Linux). While debugging, I noticed that there are limits to the amount of data that syswrite writes, on Linux, it's at (2**31 - 4096) bytes while on macOS, it appears to be at (2**31 - 1) bytes. In addition to that, syswrite on Linux doesn't return an error in that case, but just the amount of data actually written (which is well within the specified behavior of write(2)), while on macOS, it returns an error - which explains the behavior I observed. There's also an older bug report on rt.cpan.org which seems to describe the same issue on Windows.

This PR changes the code which writes extracted files to (1) write smaller chunks (1 GiB) and (2) write until all data was actually written. This seemed to fix the problem in my tests with the file where the extraction failed before. I'm not sure if this is the best solution for the problem, but at least on Linux, the current implementation seems problematic since it may produce incomplete files without warning or error.

The behaviour of `syswrite` depends on the platform and seems to be
different when getting closer to writing about 2 GiB or more at once.
On Linux, it will write at most (2**31 - 4096) bytes [1,2] and not
return an error when more data was passed in but just return the
amount of data that was actually written - so the original
implementation was producing incomplete/corrupt files during
extraction when they were larger than (2**31 - 4096) bytes. On macOS,
the limit appears to be (2**31 - 1) bytes, otherwise, an error is
returned.

So in order to correctly extract files close to or larger than 2 GiB,
it's necessary to write less than about 2 GiB at once and redo write
operations until all data actually has been written.

[1] https://www.man7.org/linux/man-pages/man2/write.2.html#NOTES
[2] https://stackoverflow.com/questions/70368651/why-cant-linux-write-more-than-2147479552-bytes
@bingos bingos merged commit 242a65d into jib:master Mar 25, 2023
@mstock
Copy link
Contributor Author

mstock commented Mar 27, 2023

Thanks for merging and releasing this! I did notice though that the tests I added seem to cause some CPAN testers test failures on 32 bit Perls since the largest block sizes (2**31 and 2**32) from

for my $block_size ((1, BLOCK, 1024 * 1024, 2**31 - 4096, 2**31, 2**32)) {
local $Archive::Tar::EXTRACT_BLOCK_SIZE = $block_size;
seem to result in a wraparound and thus negative length. So on 32 bit Perls, it seems the maximum value that can be used in this test must be 2**31 - 1. If you like, I can create a follow-up PR with such a fix.

@bingos
Copy link
Collaborator

bingos commented Mar 27, 2023

That would be awesome, many thanks.

I had seen the test failures but lacked tuits to start to dig into what was going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants