Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_wikipedia.py fails on linux #58

Open
peterbjorgensen opened this issue Oct 17, 2023 · 10 comments
Open

make_wikipedia.py fails on linux #58

peterbjorgensen opened this issue Oct 17, 2023 · 10 comments

Comments

@peterbjorgensen
Copy link
Contributor

Traceback (most recent call last):
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 283, in _multiprocessing_run_all
    multiprocessing.set_start_method("spawn")
  File "/usr/lib/python3.11/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 289, in <module>
    main()
  File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 285, in main
    processor(date=args.date, lang=args.lang)
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 390, in __call__
    fn(
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 285, in _multiprocessing_run_all
    assert multiprocessing.get_start_method() == "spawn", "Multiprocessing start method must be spawn"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Multiprocessing start method must be spawn

The bug can be fixed by setting
multiprocessing.set_start_method("spawn")
in the __main__ environment.

Perhaps the dolma core/parallel.py should use multiprocessing.get_context("spawn") to avoid this.

@peterbjorgensen
Copy link
Contributor Author

Once this is fixed I also get the following error:

files: 0.00f [03:30, ?f/s]        2023-10-18 09:34:09,483 WARNING dolma.WikiExtractorParallel Failed to process wikipedia_simple/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid stored block lengths

This is the command I use for running it:
python scripts/make_wikipedia.py --output wikipedia_simple --lang simple --processes 4

@soldni
Copy link
Member

soldni commented Oct 25, 2023

Hi @peterbjorgensen! thank you for this bug report. I've made a PR (#64) with these fixes in.

I can't seem to reproduce the error gzip... could you tell me a bit more about your setup (platform, python version, etc.)

@peterbjorgensen
Copy link
Contributor Author

I am on Python 3.11.5 on fully updated Arch Linux, wikiextractor-3.0.7.
It seems like it makes an incomplete wiki_00.gz archive of 70 MB.
I can't gunzip wiki_00.gz either - I get gzip: wiki_00.gz: invalid compressed data--format violated

@huangwei2913
Copy link

Even using Python 3.11.8 , the error is the same as follows:
Found 1 files to process
files: 0.00f [04:25, ?f/s][2024-02-29 16:53:56 SpawnPoolWorker-32.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type
documents: 239kd [04:26, 897d/s]
files: 0.00f [04:26, ?f/s]. gunzip wiki_00.gz error makes me not able to follow the taggers step

@Awyshw
Copy link

Awyshw commented Mar 28, 2024

Even using Python 3.11.8 , the error is the same as follows: Found 1 files to process files: 0.00f [04:25, ?f/s][2024-02-29 16:53:56 SpawnPoolWorker-32.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type documents: 239kd [04:26, 897d/s] files: 0.00f [04:26, ?f/s]. gunzip wiki_00.gz error makes me not able to follow the taggers step

@soldni I think this needs to be fixed, please check it.

@soldni
Copy link
Member

soldni commented Apr 5, 2024

I remain unable to reproduce this issue on my side, would need more info.

@yeshouxiaobai
Copy link

@soldni
I'm also get the bug:
python scripts/make_wikipedia.py --output ./wikipedia_zh --date 20240401 --lang zh --process 1
Found 1 files to process
files: 0.00f [1:00:20, ?f/s] [2024-04-06 22:45:55 SpawnPoolWorker-3.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia_zh/wiki_20240401_zh/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type
documents: 1.40Md [1:00:20, 387d/s]

wikiextractor : 3.0.6

@yeshouxiaobai
Copy link

I update wikiextractor from 3.0.6 to 3.0.7,solve the bug Error -3 while decompressing data: invalid block type. But get : Error -3 while decompressing data: invalid stored block lengths

@huangwei2913
Copy link

huangwei2913 commented Apr 8, 2024 via email

@RogersSteve
Copy link

I update wikiextractor from 3.0.6 to 3.0.7,solve the bug Error -3 while decompressing data: invalid block type. But get : Error -3 while decompressing data: invalid stored block lengths

Have you solved this problem, i faced this problem, and i don't have the chance to follow tagger step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants