Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz #224

Open
5 of 7 tasks
Loumstar opened this issue Apr 5, 2022 · 1 comment
Open
5 of 7 tasks

Comments

@Loumstar
Copy link

Loumstar commented Apr 5, 2022

Mandatory

  • I read the documentation (readme and wiki).
  • I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.
  • I confirm that this bug report does not report on a specific news site where news-please does not work. Please keep in mind that news-please is a generic crawler so it is expected that it will not work for all sites well or even at all.

Describe the bug
crawl_from_commoncrawl crashes when attempting to filter warc files by date using the timestamp at the end (inside __extract_date_from_warc_filename). This appears to be because it loops through all warc files for each month using, e.g.:

aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2022/04/ --no-sign-request

This will return something like:

2022-04-01 03:05:03 1072709550 crawl-data/CC-NEWS/2022/04/CC-NEWS-20220401000546-00192.warc.gz
2022-04-01 05:05:03 1072707459 crawl-data/CC-NEWS/2022/04/CC-NEWS-20220401012435-00193.warc.gz
....
2022-04-05 11:05:03 1072711882 crawl-data/CC-NEWS/2022/04/CC-NEWS-20220405090411-00275.warc.gz
2022-04-05 11:05:12        698 crawl-data/CC-NEWS/2022/04/warc.paths.gz

which includes the filename "warc.path.gz" and when it attempts to parse it as a date, it crashes.

Traceback:

Traceback (most recent call last):
  File "bug_script.py", line 29, in <module>
    log_level=logging.INFO)
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 330, in crawl_from_commoncrawl
    cc_news_crawl_names = __get_remote_index(warc_files_start_date, warc_files_end_date)
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 181, in __get_remote_index
    p for p in lines if __date_within_period(
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 182, in <listcomp>
    __extract_date_from_warc_filename(p),
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 118, in __extract_date_from_warc_filename
    return datetime.datetime.strptime(dt, '%Y%m%d%H%M%S')
  File "/usr/local/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'warc.paths.gz' does not match format '%Y%m%d%H%M%S'

To Reproduce
Run this script:

import os
import logging

from datetime import datetime, date, timedelta
from newsplease.crawler import commoncrawl_crawler as cc

def empty_callback(*args):
    pass

today = date.today()
today_date = datetime(today.year, today.month, today.day)

end_date = today_date - timedelta(days=1)
start_date = end_date - timedelta(days=1)

os.makedirs("./warcs/", exist_ok=True)

cc.crawl_from_commoncrawl(
    valid_hosts=["bbc.co.uk"],
    warc_files_start_date=start_date,
    warc_files_end_date=end_date,
    callback_on_article_extracted=empty_callback,
    callback_on_warc_completed=empty_callback,
    continue_after_error=True,
    local_download_dir_warc="./warcs/",
    number_of_extraction_processes=1,
    log_level=logging.INFO)

Expected behavior
crawl_from_commoncrawl should automatically ignore "warc.path.gz" and only parse the other filenames.

Log

INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2022/04/ --no-sign-request >> /tmp/tmppydqzsib && awk '{ print $4 }' /tmp/tmppydqzsib 
Traceback (most recent call last):
  File "bug_script.py", line 29, in <module>
    log_level=logging.INFO)
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 330, in crawl_from_commoncrawl
    cc_news_crawl_names = __get_remote_index(warc_files_start_date, warc_files_end_date)
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 181, in __get_remote_index
    p for p in lines if __date_within_period(
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 182, in <listcomp>
    __extract_date_from_warc_filename(p),
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 118, in __extract_date_from_warc_filename
    return datetime.datetime.strptime(dt, '%Y%m%d%H%M%S')
  File "/usr/local/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'warc.paths.gz' does not match format '%Y%m%d%H%M%S'

Versions

  • OS: MacOS 12.2, but the script is running in a Docker container with amazonlinux:latest as the base image.
  • Python Version: 3.6.15
  • news-please Version: 1.5.22

Intent

  • personal
  • academic
  • business
  • other
  • Some information on your project:
    We're processing articles using CC-NEWS to create a brand sentimenting tracking system
@sebastian-nagel
Copy link
Contributor

This issue is addressed in #226.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants