Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLs are sometimes not retried correctly #507

Open
JustAnotherArchivist opened this issue Apr 22, 2021 · 0 comments
Open

URLs are sometimes not retried correctly #507

JustAnotherArchivist opened this issue Apr 22, 2021 · 0 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

I've noticed that sometimes, URLs are not retried properly. The most recent example is job 172fw8g4egszevx4i56uu06cm. One of about 1700 such URLs on that job:

$ zstdgrep -F 'https://usc.gov.mm/?q=node/66' usc.gov.mm-inf-20210314-042931-172fw-meta.warc.gz
2021-03-14 04:33:31,023 - wpull.processor.web - INFO - Fetching ‘https://usc.gov.mm/?q=node/66’.
2021-03-14 04:33:51,038 - wpull.processor.base - ERROR - Fetching ‘https://usc.gov.mm/?q=node/66’ encountered an error: Connect timed out.

This URL was only attempted once and obviously not retrieved correctly. Further, no ignores matching this URL were present. So it should've been retried, yet it wasn't. I've seen another example of this in the past couple months but can't find it anymore.

I haven't looked into this in detail yet. One thing I noticed (but may be entirely irrelevant) is that all affected URLs on that job, per my crude check with a couple samples from wpull2-log-extract-errors in my little-things, are on https://usc.gov.mm/. Note that the job was started on HTTP, and the HTTPS server on that domain is actually broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant