crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz #224

Loumstar · 2022-04-05T11:25:29Z

Mandatory

I read the documentation (readme and wiki).
I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.
I confirm that this bug report does not report on a specific news site where news-please does not work. Please keep in mind that news-please is a generic crawler so it is expected that it will not work for all sites well or even at all.

Describe the bug
crawl_from_commoncrawl crashes when attempting to filter warc files by date using the timestamp at the end (inside __extract_date_from_warc_filename). This appears to be because it loops through all warc files for each month using, e.g.:

aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2022/04/ --no-sign-request

This will return something like:

2022-04-01 03:05:03 1072709550 crawl-data/CC-NEWS/2022/04/CC-NEWS-20220401000546-00192.warc.gz
2022-04-01 05:05:03 1072707459 crawl-data/CC-NEWS/2022/04/CC-NEWS-20220401012435-00193.warc.gz
....
2022-04-05 11:05:03 1072711882 crawl-data/CC-NEWS/2022/04/CC-NEWS-20220405090411-00275.warc.gz
2022-04-05 11:05:12        698 crawl-data/CC-NEWS/2022/04/warc.paths.gz

which includes the filename "warc.path.gz" and when it attempts to parse it as a date, it crashes.

Traceback:

Traceback (most recent call last):
  File "bug_script.py", line 29, in <module>
    log_level=logging.INFO)
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 330, in crawl_from_commoncrawl
    cc_news_crawl_names = __get_remote_index(warc_files_start_date, warc_files_end_date)
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 181, in __get_remote_index
    p for p in lines if __date_within_period(
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 182, in <listcomp>
    __extract_date_from_warc_filename(p),
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 118, in __extract_date_from_warc_filename
    return datetime.datetime.strptime(dt, '%Y%m%d%H%M%S')
  File "/usr/local/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'warc.paths.gz' does not match format '%Y%m%d%H%M%S'

To Reproduce
Run this script:

import os
import logging

from datetime import datetime, date, timedelta
from newsplease.crawler import commoncrawl_crawler as cc

def empty_callback(*args):
    pass

today = date.today()
today_date = datetime(today.year, today.month, today.day)

end_date = today_date - timedelta(days=1)
start_date = end_date - timedelta(days=1)

os.makedirs("./warcs/", exist_ok=True)

cc.crawl_from_commoncrawl(
    valid_hosts=["bbc.co.uk"],
    warc_files_start_date=start_date,
    warc_files_end_date=end_date,
    callback_on_article_extracted=empty_callback,
    callback_on_warc_completed=empty_callback,
    continue_after_error=True,
    local_download_dir_warc="./warcs/",
    number_of_extraction_processes=1,
    log_level=logging.INFO)

Expected behavior
crawl_from_commoncrawl should automatically ignore "warc.path.gz" and only parse the other filenames.

Log

INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2022/04/ --no-sign-request >> /tmp/tmppydqzsib && awk '{ print $4 }' /tmp/tmppydqzsib 
Traceback (most recent call last):
  File "bug_script.py", line 29, in <module>
    log_level=logging.INFO)
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 330, in crawl_from_commoncrawl
    cc_news_crawl_names = __get_remote_index(warc_files_start_date, warc_files_end_date)
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 181, in __get_remote_index
    p for p in lines if __date_within_period(
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 182, in <listcomp>
    __extract_date_from_warc_filename(p),
  File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 118, in __extract_date_from_warc_filename
    return datetime.datetime.strptime(dt, '%Y%m%d%H%M%S')
  File "/usr/local/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'warc.paths.gz' does not match format '%Y%m%d%H%M%S'

Versions

OS: MacOS 12.2, but the script is running in a Docker container with amazonlinux:latest as the base image.
Python Version: 3.6.15
news-please Version: 1.5.22

Intent

personal
academic
business
other
Some information on your project:
We're processing articles using CC-NEWS to create a brand sentimenting tracking system

The text was updated successfully, but these errors were encountered:

sebastian-nagel · 2022-04-27T14:27:13Z

This issue is addressed in #226.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz #224

crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz #224

Loumstar commented Apr 5, 2022 •

edited

sebastian-nagel commented Apr 27, 2022

crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz #224

crawl_from_commoncrawl crashes when attempting to parse the date from warc.paths.gz #224

Comments

Loumstar commented Apr 5, 2022 • edited

sebastian-nagel commented Apr 27, 2022

Loumstar commented Apr 5, 2022 •

edited