Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve blacklisting and repeat fetching #5

Open
iancoleman opened this issue Jul 7, 2017 · 1 comment
Open

Improve blacklisting and repeat fetching #5

iancoleman opened this issue Jul 7, 2017 · 1 comment

Comments

@iancoleman
Copy link
Owner

Some pages require blacklisting due to their content (or lack of content).

Investigate these incidences and ensure the blacklisting functionality of the scraper (ie fetch.py) is working correctly.

Sometimes archive.org returns an error status (eg 500) or an error page (eg containing the content Connection Failure).

Detect these events and pause, then repeat the fetch until it works as desired. Have some sort of backoff on the repeat so archive.org doesn't get hit too frequently.

@iancoleman
Copy link
Owner Author

Another error to catch when parsing yearlySummary as per traceback

Removing outdated yearly summary: /home/user/cia_data/country_html/yearly_summaries/https%3A%2F%2Fweb.archive.org%2F__wb%2Fcalendarcaptures%3Furl%3Dhttps%253A%252F%252Fwww.cia.gov%252Flibrary%252Fpublications%252Fthe-world-factbook%252Fgeos%252Fau.html%26selected_year%3D2017
Fetching https://web.archive.org/__wb/calendarcaptures?url=https%3A%2F%2Fwww.cia.gov%2Flibrary%2Fpublications%2Fthe-world-factbook%2Fgeos%2Fau.html&selected_year=2017
Traceback (most recent call last):
  File "fetch.py", line 266, in <module>
    getPage(countryPage)
  File "fetch.py", line 59, in getPage
    data = json.loads(yearlySummaryContent)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant