Parse tar headers incrementally #2127

kientzle · 2024-04-14T18:37:06Z

This rebuilds the tar reader to parse all header data incrementally as it appears in the stream.

This definitively fixes a longstanding issue with unsupported pax attributes. Libarchive must limit the amount of data that it reads into memory, and this has caused problems with large unknown attributes. By scanning iteratively, we can instead identify an attribute by name and then decide whether to read it into memory or whether to skip it without reading.

This design also allows us to vary our sanity limits for different pax attributes (e.g., an attribute that is a single number can be limited to a few dozen bytes while an attribute holding an ACL is allowed to be a few hundred kilobytes). This allows us to be a little more resistant to malicious archives that might try to force allocation of very large amounts of memory, though there is still work to be done here.

This includes a number of changes to archive_entry processing to allow us to consistently keep the first appearance of any given value instead of the original architecture that recursively cached data in memory in order to effectively process all the data from back-to-front.

Resolves #1855
Resolves #1939

This rebuilds the tar reader to parse all header data incrementally as it appears in the stream. This definitively fixes a longstanding issue with unsupported pax attributes. Libarchive must limit the amount of data that it reads into memory, and this has caused problems with large unknown attributes. By scanning iteratively, we can instead identify an attribute and decide whether to read it into memory or whether to skip it without reading. This design also allows us to vary our sanity limits for different pax attributes (e.g., an attribute that is a single number can be limited to a few dozen bytes while an attribute holding an ACL is allowed to be a few megabytes). This allows us to be a little more resistent to malicious archives that might try to force allocation of very large amounts of memory, though there is still work to be done here. This builds on earlier work to make tar header parsing be purely iterative by keeping the _first_ appearance of any given value instead of the original architecture that recursively cached data in memory in order to effectively process all the data from back-to-front.

kientzle · 2024-04-14T18:37:49Z

CC: @jvreelanda

libarchive/archive_read_support_format_tar.c

kientzle · 2024-04-17T13:00:03Z

Hmmm.... Looks like I need to look closely at the symlink handling to see why this test is failing on mingw-gcc. All other platforms are testing clean:

173/744 Test #173: libarchive_test_pax_filename_encoding ....................................***Failed    0.02 sec

172: test_pax_filename_encoding
D:\a\libarchive\libarchive\libarchive\test\test_pax_filename_encoding.c:167: filename != archive_entry_hardlink(entry)
                           filename = "abc\xCC\x8Cmno\xFCxyz" (length 12)
      archive_entry_hardlink(entry) = "abc\xE2\x95\xA0\xC3\xAEmno\xE2\x81\xBFxyz" (length 17)
D:\a\libarchive\libarchive\libarchive\test\test_pax_filename_encoding.c:173: filename != archive_entry_symlink(entry)
                          filename = "abc\xCC\x8Cmno\xFCxyz" (length 12)
      archive_entry_symlink(entry) = "abc\xE2\x95\xA0\xC3\xAEmno\xE2\x81\xBFxyz" (length 17)

kientzle · 2024-04-28T04:42:01Z

I think the mingw-gcc test failure is pointing out a subtle issue with the new code: Pax archives store information with differing charset conventions. In particular, filenames and link names are stored using unspecified character encoding in the legacy ustar header but in UTF-8 in the pax header. So I've managed to inadvertently break link name conversions, since I now accumulate the raw version and only do a single conversion at the end, at which point I've lost track of which header (and hence which character encoding) the name came from. The solution is probably to accumulate the converted version so I can ensure the correct conversion. (This will be a problem someday when I tackle redesigning the API to eliminate most character-set conversions. Oh, well.)

Instead of keeping a local copy of the linkname, push it into the entry directly when we see it. This requires modifying archive_entry to store a single "linkname" value and separately mark whether that name is a symlink or hardlink. This should not break any existing use, since no entry should ever have both a symlink and a hardlink value. For tar, this lets us correctly handle PAX link names, which are stored early in the archive and only later do we find out whether that name is a symlink or hardlink. (We did this before by storing the link name and putting it into the entry only after we knew whether this was a hard or sym link.) Nice side effect: This slightly reduces the size of the archive_entry struct. Still TODO: I need to correctly handle character set for link names.

kientzle · 2024-05-20T21:44:41Z

Reworking the linkname character-set handling seems to have addressed the mingw-gcc regression.

kientzle requested a review from mmatuska April 14, 2024 18:37

github-advanced-security bot found potential problems Apr 14, 2024

View reviewed changes

libarchive/archive_read_support_format_tar.c Fixed Show fixed Hide fixed

kientzle added 3 commits April 14, 2024 11:49

Use size_t for attribute size limits

e43ae6a

FATAL if AppleDouble extension is rediculously large

2bd27bb

Fix two uninitialized vars

eaf89dc

kientzle mentioned this pull request Apr 14, 2024

tar archive containing sparse file shows wrong file size, silently extracts corrupted file #2125

Open

kientzle added 3 commits April 28, 2024 20:17

Expand some error messages

22a26c6

Use the correct PAX header character set when reading 'linkpath'

c175e04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse tar headers incrementally #2127

Parse tar headers incrementally #2127

kientzle commented Apr 14, 2024 •

edited

kientzle commented Apr 14, 2024

kientzle commented Apr 17, 2024

kientzle commented Apr 28, 2024

kientzle commented May 20, 2024

Parse tar headers incrementally #2127

Are you sure you want to change the base?

Parse tar headers incrementally #2127

Conversation

kientzle commented Apr 14, 2024 • edited

kientzle commented Apr 14, 2024

kientzle commented Apr 17, 2024

kientzle commented Apr 28, 2024

kientzle commented May 20, 2024

kientzle commented Apr 14, 2024 •

edited