Support extraction of 'large' (>= ~2 GiB) files #38

mstock · 2023-03-15T08:42:30Z

When extracting Tar archives with 'large' files (i.e. files larger than ~2 GiB) I noticed that this either didn't work (macOS) or resulted in corrupt/too small files (Linux). While debugging, I noticed that there are limits to the amount of data that syswrite writes, on Linux, it's at (2**31 - 4096) bytes while on macOS, it appears to be at (2**31 - 1) bytes. In addition to that, syswrite on Linux doesn't return an error in that case, but just the amount of data actually written (which is well within the specified behavior of write(2)), while on macOS, it returns an error - which explains the behavior I observed. There's also an older bug report on rt.cpan.org which seems to describe the same issue on Windows.

This PR changes the code which writes extracted files to (1) write smaller chunks (1 GiB) and (2) write until all data was actually written. This seemed to fix the problem in my tests with the file where the extraction failed before. I'm not sure if this is the best solution for the problem, but at least on Linux, the current implementation seems problematic since it may produce incomplete files without warning or error.

The behaviour of `syswrite` depends on the platform and seems to be different when getting closer to writing about 2 GiB or more at once. On Linux, it will write at most (2**31 - 4096) bytes [1,2] and not return an error when more data was passed in but just return the amount of data that was actually written - so the original implementation was producing incomplete/corrupt files during extraction when they were larger than (2**31 - 4096) bytes. On macOS, the limit appears to be (2**31 - 1) bytes, otherwise, an error is returned. So in order to correctly extract files close to or larger than 2 GiB, it's necessary to write less than about 2 GiB at once and redo write operations until all data actually has been written. [1] https://www.man7.org/linux/man-pages/man2/write.2.html#NOTES [2] https://stackoverflow.com/questions/70368651/why-cant-linux-write-more-than-2147479552-bytes

mstock · 2023-03-27T07:45:51Z

Thanks for merging and releasing this! I did notice though that the tests I added seem to cause some CPAN testers test failures on 32 bit Perls since the largest block sizes (2**31 and 2**32) from

archive-tar-new/t/02_methods.t

Lines 591 to 592 in dcf41a3

    
           for my $block_size ((1, BLOCK, 1024 * 1024, 2**31 - 4096, 2**31, 2**32)) { 
        
               local $Archive::Tar::EXTRACT_BLOCK_SIZE = $block_size;

seem to result in a wraparound and thus negative length. So on 32 bit Perls, it seems the maximum value that can be used in this test must be 2**31 - 1. If you like, I can create a follow-up PR with such a fix.

bingos · 2023-03-27T16:15:50Z

That would be awesome, many thanks.

I had seen the test failures but lacked tuits to start to dig into what was going on.

bingos merged commit 242a65d into jib:master Mar 25, 2023

mstock mentioned this pull request Mar 28, 2023

Pod formatting and test failure fixes #39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support extraction of 'large' (>= ~2 GiB) files #38

Support extraction of 'large' (>= ~2 GiB) files #38

mstock commented Mar 15, 2023

mstock commented Mar 27, 2023

bingos commented Mar 27, 2023

Support extraction of 'large' (>= ~2 GiB) files #38

Support extraction of 'large' (>= ~2 GiB) files #38

Conversation

mstock commented Mar 15, 2023

mstock commented Mar 27, 2023

bingos commented Mar 27, 2023