New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse tar headers incrementally #2127
base: master
Are you sure you want to change the base?
Parse tar headers incrementally #2127
Conversation
This rebuilds the tar reader to parse all header data incrementally as it appears in the stream. This definitively fixes a longstanding issue with unsupported pax attributes. Libarchive must limit the amount of data that it reads into memory, and this has caused problems with large unknown attributes. By scanning iteratively, we can instead identify an attribute and decide whether to read it into memory or whether to skip it without reading. This design also allows us to vary our sanity limits for different pax attributes (e.g., an attribute that is a single number can be limited to a few dozen bytes while an attribute holding an ACL is allowed to be a few megabytes). This allows us to be a little more resistent to malicious archives that might try to force allocation of very large amounts of memory, though there is still work to be done here. This builds on earlier work to make tar header parsing be purely iterative by keeping the _first_ appearance of any given value instead of the original architecture that recursively cached data in memory in order to effectively process all the data from back-to-front.
CC: @jvreelanda |
Hmmm.... Looks like I need to look closely at the symlink handling to see why this test is failing on mingw-gcc. All other platforms are testing clean:
|
I think the mingw-gcc test failure is pointing out a subtle issue with the new code: Pax archives store information with differing charset conventions. In particular, filenames and link names are stored using unspecified character encoding in the legacy ustar header but in UTF-8 in the pax header. So I've managed to inadvertently break link name conversions, since I now accumulate the raw version and only do a single conversion at the end, at which point I've lost track of which header (and hence which character encoding) the name came from. The solution is probably to accumulate the converted version so I can ensure the correct conversion. (This will be a problem someday when I tackle redesigning the API to eliminate most character-set conversions. Oh, well.) |
Instead of keeping a local copy of the linkname, push it into the entry directly when we see it. This requires modifying archive_entry to store a single "linkname" value and separately mark whether that name is a symlink or hardlink. This should not break any existing use, since no entry should ever have both a symlink and a hardlink value. For tar, this lets us correctly handle PAX link names, which are stored early in the archive and only later do we find out whether that name is a symlink or hardlink. (We did this before by storing the link name and putting it into the entry only after we knew whether this was a hard or sym link.) Nice side effect: This slightly reduces the size of the archive_entry struct. Still TODO: I need to correctly handle character set for link names.
Reworking the linkname character-set handling seems to have addressed the mingw-gcc regression. |
This rebuilds the tar reader to parse all header data incrementally as it appears in the stream.
This definitively fixes a longstanding issue with unsupported pax attributes. Libarchive must limit the amount of data that it reads into memory, and this has caused problems with large unknown attributes. By scanning iteratively, we can instead identify an attribute by name and then decide whether to read it into memory or whether to skip it without reading.
This design also allows us to vary our sanity limits for different pax attributes (e.g., an attribute that is a single number can be limited to a few dozen bytes while an attribute holding an ACL is allowed to be a few hundred kilobytes). This allows us to be a little more resistant to malicious archives that might try to force allocation of very large amounts of memory, though there is still work to be done here.
This includes a number of changes to archive_entry processing to allow us to consistently keep the first appearance of any given value instead of the original architecture that recursively cached data in memory in order to effectively process all the data from back-to-front.
Resolves #1855
Resolves #1939