Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspicion of incorrect handling of content length in WARC records #121

Open
ThomasA opened this issue Dec 3, 2020 · 1 comment
Open

Comments

@ThomasA
Copy link

ThomasA commented Dec 3, 2020

I am currently processing a large-ish (on the order of 600GB) batch of WARC files containing a number of dumped homepages.
I am sifting through all of these files for image content which I then extract and do some further processing of. Once in a while, I come across records in the WARC files that cause PIL to emit warnings, all similar to this:

WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 14392491347
Remainder: b'l>\r\n'

Looking at the remainder above, it seems to me that PIL is expecting the content to end a little bit too early since it does in fact seem to end with a newline a few characters further ahead.
Now, I cannot quite guess where this problem comes from. I guess it could be an error in the software that originally encoded the image. It could be an error in the software that originally dumped the web page into the WARC file, or perhaps it could be an error in warcio determining the size of the payload in the WARC record?
I would both appreciate any help in determining the cause of this problem, which I can help debug, and I am also going to refer to this issue in another issue I will post shortly which I suspect may be related.

@ikreymer
Copy link
Member

ikreymer commented Dec 5, 2020

Hm, perhaps the error message can be improved. I think this usually is a sign that the Content-Length is too short, eg. in the above case, if it was +2, it would parse correctly.
Do you have an example that you could share? It would be good to confirm that this is the case, and not warcio messing up on the parsing..

Probably to make this easier to it should print the offset of the valid record to make it easier to extract the record for testing..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants