You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.
I confirm that this bug report does not report on a specific news site where news-please does not work. Please keep in mind that news-please is a generic crawler so it is expected that it will not work for all sites well or even at all.
Describe the bug crawl_from_commoncrawl crashes when attempting to filter warc files by date using the timestamp at the end (inside __extract_date_from_warc_filename). This appears to be because it loops through all warc files for each month using, e.g.:
aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2022/04/ --no-sign-request
which includes the filename "warc.path.gz" and when it attempts to parse it as a date, it crashes.
Traceback:
Traceback (most recent call last):
File "bug_script.py", line 29, in <module>
log_level=logging.INFO)
File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 330, in crawl_from_commoncrawl
cc_news_crawl_names = __get_remote_index(warc_files_start_date, warc_files_end_date)
File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 181, in __get_remote_index
p for p in lines if __date_within_period(
File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 182, in <listcomp>
__extract_date_from_warc_filename(p),
File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 118, in __extract_date_from_warc_filename
return datetime.datetime.strptime(dt, '%Y%m%d%H%M%S')
File "/usr/local/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/usr/local/lib/python3.6/_strptime.py", line 362, in _strptime
(data_string, format))
ValueError: time data 'warc.paths.gz' does not match format '%Y%m%d%H%M%S'
Expected behavior crawl_from_commoncrawl should automatically ignore "warc.path.gz" and only parse the other filenames.
Log
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2022/04/ --no-sign-request >> /tmp/tmppydqzsib && awk '{ print $4 }' /tmp/tmppydqzsib
Traceback (most recent call last):
File "bug_script.py", line 29, in <module>
log_level=logging.INFO)
File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 330, in crawl_from_commoncrawl
cc_news_crawl_names = __get_remote_index(warc_files_start_date, warc_files_end_date)
File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 181, in __get_remote_index
p for p in lines if __date_within_period(
File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 182, in <listcomp>
__extract_date_from_warc_filename(p),
File "/usr/local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 118, in __extract_date_from_warc_filename
return datetime.datetime.strptime(dt, '%Y%m%d%H%M%S')
File "/usr/local/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/usr/local/lib/python3.6/_strptime.py", line 362, in _strptime
(data_string, format))
ValueError: time data 'warc.paths.gz' does not match format '%Y%m%d%H%M%S'
Versions
OS: MacOS 12.2, but the script is running in a Docker container with amazonlinux:latest as the base image.
Python Version: 3.6.15
news-please Version: 1.5.22
Intent
personal
academic
business
other
Some information on your project:
We're processing articles using CC-NEWS to create a brand sentimenting tracking system
The text was updated successfully, but these errors were encountered:
Mandatory
Describe the bug
crawl_from_commoncrawl
crashes when attempting to filter warc files by date using the timestamp at the end (inside__extract_date_from_warc_filename
). This appears to be because it loops through all warc files for each month using, e.g.:This will return something like:
which includes the filename "warc.path.gz" and when it attempts to parse it as a date, it crashes.
Traceback:
To Reproduce
Run this script:
Expected behavior
crawl_from_commoncrawl
should automatically ignore "warc.path.gz" and only parse the other filenames.Log
Versions
Intent
We're processing articles using CC-NEWS to create a brand sentimenting tracking system
The text was updated successfully, but these errors were encountered: