Skip to content

Commit

Permalink
Support extraction of large files
Browse files Browse the repository at this point in the history
The behaviour of `syswrite` depends on the platform and seems to be
different when getting closer to writing about 2 GiB or more at once.
On Linux, it will write at most (2**31 - 4096) bytes [1,2] and not
return an error when more data was passed in but just return the
amount of data that was actually written - so the original
implementation was producing incomplete/corrupt files during
extraction when they were larger than (2**31 - 4096) bytes. On macOS,
the limit appears to be (2**31 - 1) bytes, otherwise, an error is
returned.

So in order to correctly extract files close to or larger than 2 GiB,
it's necessary to write less than about 2 GiB at once and redo write
operations until all data actually has been written.

[1] https://www.man7.org/linux/man-pages/man2/write.2.html#NOTES
[2] https://stackoverflow.com/questions/70368651/why-cant-linux-write-more-than-2147479552-bytes
  • Loading branch information
mstock authored and bingos committed Mar 25, 2023
1 parent 196b082 commit 242a65d
Show file tree
Hide file tree
Showing 2 changed files with 50 additions and 4 deletions.
26 changes: 22 additions & 4 deletions lib/Archive/Tar.pm
Expand Up @@ -24,6 +24,7 @@ use strict;
use vars qw[$DEBUG $error $VERSION $WARN $FOLLOW_SYMLINK $CHOWN $CHMOD
$DO_NOT_USE_PREFIX $HAS_PERLIO $HAS_IO_STRING $SAME_PERMISSIONS
$INSECURE_EXTRACT_MODE $ZERO_PAD_NUMBERS @ISA @EXPORT $RESOLVE_SYMLINK
$EXTRACT_BLOCK_SIZE
];

@ISA = qw[Exporter];
Expand All @@ -39,6 +40,7 @@ $DO_NOT_USE_PREFIX = 0;
$INSECURE_EXTRACT_MODE = 0;
$ZERO_PAD_NUMBERS = 0;
$RESOLVE_SYMLINK = $ENV{'PERL5_AT_RESOLVE_SYMLINK'} || 'speed';
$EXTRACT_BLOCK_SIZE = 1024 * 1024 * 1024;

BEGIN {
use Config;
Expand Down Expand Up @@ -894,10 +896,18 @@ sub _extract_file {

if( $entry->size ) {
binmode $fh;
syswrite $fh, $entry->data or (
$self->_error( qq[Could not write data to '$full'] ),
return
);
my $offset = 0;
my $content = $entry->get_content_by_ref();
while ($offset < $entry->size) {
my $written
= syswrite $fh, $$content, $EXTRACT_BLOCK_SIZE, $offset;
if (defined $written) {
$offset += $written;
} else {
$self->_error( qq[Could not write data to '$full': $!] );
return;
}
}
}

close $fh or (
Expand Down Expand Up @@ -2163,6 +2173,14 @@ numbers. Added for compatibility with C<busybox> implementations.
It won't work for terminal, pipe or sockets or every non seekable source.
=head $Archive::Tar::EXTRACT_BLOCK_SIZE
This variable holds an integer with the block size that should be used when
writing files during extraction. It defaults to 1 GiB. Please note that this
cannot be arbitrarily large since some operating systems limit the number of
bytes that can be written in one call to C<write(2)>, so if this is too large,
extraction may fail with an error.
=cut
=head1 FAQ
Expand Down
28 changes: 28 additions & 0 deletions t/02_methods.t
Expand Up @@ -570,6 +570,34 @@ SKIP: { ### pesky warnings
}


### extract tests with different $EXTRACT_BLOCK_SIZE values ###
SKIP: { ### pesky warnings
skip $ebcdic_skip_msg, 517 if ord "A" != 65;

skip('no IO::String', 517) if !$Archive::Tar::HAS_PERLIO &&
!$Archive::Tar::HAS_PERLIO &&
!$Archive::Tar::HAS_IO_STRING &&
!$Archive::Tar::HAS_IO_STRING;

my $tar = $Class->new;
ok( $tar->read( $TAR_FILE ), "Read in '$TAR_FILE'" );

for my $aref ( [$tar, \@EXPECT_NORMAL],
[$TARBIN, \@EXPECTBIN],
[$TARX, \@EXPECTX]
) {
my($obj, $struct) = @$aref;

for my $block_size ((1, BLOCK, 1024 * 1024, 2**31 - 4096, 2**31, 2**32)) {
local $Archive::Tar::EXTRACT_BLOCK_SIZE = $block_size;

ok( $obj->extract, " Extracted with 'extract'" );
check_tar_extract( $obj, $struct );
}
}
}


### clear tests ###
SKIP: { ### pesky warnings
skip $ebcdic_skip_msg, 3 if ord "A" != 65;
Expand Down

0 comments on commit 242a65d

Please sign in to comment.