Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix occasional bug in iterating over gzipped WARC's with missing headers #1097

Open
dolsysmith opened this issue Dec 2, 2021 · 0 comments
Open

Comments

@dolsysmith
Copy link
Contributor

dolsysmith commented Dec 2, 2021

For at least one collection (0287d41512b3492b801db3256112c103), the Twitter rest exporter throws a UnicodeDecodeError. In this case, the content-encoding header, which should be set to gzip, was either missing or duplicated by a different value for a certain number of lines in the warc.gz files. The warcio.WARCIterator class, which is used by warc_iter.py to read the WARC's, defaults in these cases to a type of reader that does not allow for proper decoding of the content, which, in every case tested, appears to be an empty bytestring.

Solution: in warc_iter.py, wrap the line line = stream.readline().decode('utf-8') in a try/except block, simply skipping the line if the decoding fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant