Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse tar headers incrementally #2127

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

kientzle
Copy link
Contributor

@kientzle kientzle commented Apr 14, 2024

This rebuilds the tar reader to parse all header data incrementally as it appears in the stream.

This definitively fixes a longstanding issue with unsupported pax attributes. Libarchive must limit the amount of data that it reads into memory, and this has caused problems with large unknown attributes. By scanning iteratively, we can instead identify an attribute by name and then decide whether to read it into memory or whether to skip it without reading.

This design also allows us to vary our sanity limits for different pax attributes (e.g., an attribute that is a single number can be limited to a few dozen bytes while an attribute holding an ACL is allowed to be a few hundred kilobytes). This allows us to be a little more resistant to malicious archives that might try to force allocation of very large amounts of memory, though there is still work to be done here.

This includes a number of changes to archive_entry processing to allow us to consistently keep the first appearance of any given value instead of the original architecture that recursively cached data in memory in order to effectively process all the data from back-to-front.

Resolves #1855
Resolves #1939

This rebuilds the tar reader to parse all header
data incrementally as it appears in the stream.

This definitively fixes a longstanding issue with unsupported
pax attributes.  Libarchive must limit the amount of data that
it reads into memory, and this has caused problems with large
unknown attributes.  By scanning iteratively, we can instead
identify an attribute and decide whether to read it into memory or
whether to skip it without reading.

This design also allows us to vary our sanity limits for
different pax attributes (e.g., an attribute that is a single number
can be limited to a few dozen bytes while an attribute holding an ACL is
allowed to be a few megabytes).  This allows us to be a little more
resistent to malicious archives that might try to force allocation of
very large amounts of memory, though there is still work to be done
here.

This builds on earlier work to make tar header parsing be purely
iterative by keeping the _first_ appearance of any given value instead
of the original architecture that recursively cached data in memory
in order to effectively process all the data from back-to-front.
@kientzle kientzle requested a review from mmatuska April 14, 2024 18:37
@kientzle
Copy link
Contributor Author

CC: @jvreelanda

@kientzle
Copy link
Contributor Author

Hmmm.... Looks like I need to look closely at the symlink handling to see why this test is failing on mingw-gcc. All other platforms are testing clean:

173/744 Test #173: libarchive_test_pax_filename_encoding ....................................***Failed    0.02 sec

172: test_pax_filename_encoding
D:\a\libarchive\libarchive\libarchive\test\test_pax_filename_encoding.c:167: filename != archive_entry_hardlink(entry)
                           filename = "abc\xCC\x8Cmno\xFCxyz" (length 12)
      archive_entry_hardlink(entry) = "abc\xE2\x95\xA0\xC3\xAEmno\xE2\x81\xBFxyz" (length 17)
D:\a\libarchive\libarchive\libarchive\test\test_pax_filename_encoding.c:173: filename != archive_entry_symlink(entry)
                          filename = "abc\xCC\x8Cmno\xFCxyz" (length 12)
      archive_entry_symlink(entry) = "abc\xE2\x95\xA0\xC3\xAEmno\xE2\x81\xBFxyz" (length 17)

@kientzle
Copy link
Contributor Author

I think the mingw-gcc test failure is pointing out a subtle issue with the new code: Pax archives store information with differing charset conventions. In particular, filenames and link names are stored using unspecified character encoding in the legacy ustar header but in UTF-8 in the pax header. So I've managed to inadvertently break link name conversions, since I now accumulate the raw version and only do a single conversion at the end, at which point I've lost track of which header (and hence which character encoding) the name came from. The solution is probably to accumulate the converted version so I can ensure the correct conversion. (This will be a problem someday when I tackle redesigning the API to eliminate most character-set conversions. Oh, well.)

Instead of keeping a local copy of the linkname, push
it into the entry directly when we see it.  This requires
modifying archive_entry to store a single "linkname" value
and separately mark whether that name is a symlink or hardlink.
This should not break any existing use, since no entry should
ever have both a symlink and a hardlink value.

For tar, this lets us correctly handle PAX link names, which
are stored early in the archive and only later do we find out
whether that name is a symlink or hardlink.  (We did this before
by storing the link name and putting it into the entry only
after we knew whether this was a hard or sym link.)

Nice side effect:  This slightly reduces the size of the
archive_entry struct.

Still TODO:  I need to correctly handle character set
for link names.
@kientzle
Copy link
Contributor Author

Reworking the linkname character-set handling seems to have addressed the mingw-gcc regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant