Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't download: 403 error on some CC segments. #279

Open
enn-nafnlaus opened this issue Feb 14, 2024 · 1 comment
Open

Can't download: 403 error on some CC segments. #279

enn-nafnlaus opened this issue Feb 14, 2024 · 1 comment

Comments

@enn-nafnlaus
Copy link

2024-02-14 21:01 INFO 2048692:root - Downloaded https://dl.fbaipublicfiles.com/laser/CCMatrix/v1.0.0/2020-10_0278.tsv.gz [200] took 8s (5766.4kB/s)
2024-02-14 21:01 INFO 2048692:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-10/segments/1581875145708.59/wet/CC-MAIN-20200222150029-20200222180029-00542.warc.wet.gz
2024-02-14 21:01 INFO 2048693:root - Downloaded https://dl.fbaipublicfiles.com/laser/CCMatrix/v1.0.0/2018-05_0044.tsv.gz [200] took 9s (5267.6kB/s)
2024-02-14 21:01 INFO 2048693:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886639.11/wet/CC-MAIN-20180116184540-20180116204540-00601.warc.wet.gz
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib64/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib64/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 138, in dl_file
raw_documents = get_documents(segment)
File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 107, in get_documents
return {d["digest"]: d["raw_content"] for d in CCSegmentsReader([segment])}
File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 107, in
return {d["digest"]: d["raw_content"] for d in CCSegmentsReader([segment])}
File "/data/LLM_training/translation/cc_net/cc_net/process_wet_file.py", line 199, in iter
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/data/LLM_training/translation/cc_net/cc_net/process_wet_file.py", line 192, in open_segment
return jsonql.open_remote_file(url, cache=file)
File "/data/LLM_training/translation/cc_net/cc_net/jsonql.py", line 1124, in open_remote_file
raw_bytes = request_get_content(url)
File "/data/LLM_training/translation/cc_net/cc_net/jsonql.py", line 1101, in request_get_content
raise e
File "/data/LLM_training/translation/cc_net/cc_net/jsonql.py", line 1095, in request_get_content
r.raise_for_status()
File "/home/user/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00001.warc.wet.gz
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 338, in
func_argparse.main(dl, finalize)
File "/home/user/.local/lib/python3.10/site-packages/func_argparse/init.py", line 29, in main
return make_main(*fns, module=module, description=description)(sys.argv[1:])
File "/home/user/.local/lib/python3.10/site-packages/func_argparse/init.py", line 72, in parse_and_call
return command(**parsed_args)
File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 103, in dl
pool.map(dlf, file_list)
File "/usr/lib64/python3.10/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib64/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00001.warc.wet.gz

@avidale
Copy link
Contributor

avidale commented Feb 15, 2024

Hi! If some CommonCrawl files do not exist anymore, I am not sure it would be easy to find them.
Have you considered downloading CCMatrix from another storage, such as https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants