New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fails on the first file #292
Comments
I had the same issue and @vsraptor 's modification fixed it for me. Thank you for posting. As you are in that file you could also replace Bzip2 output compression to Gzip (adding an def open(self, filename):
if self.compress:
# return bz2.BZ2File(filename + '.bz2', 'w')
return gzip.GzipFile(filename + '.gz', mode='w')
else:
return open(filename, 'w') |
I noticed that the last compressed file created (as given by I instrumented the It also works when using |
INFO: Preprocessed 22100000 pages
INFO: Preprocessed 22200000 pages
INFO: Loaded 738901 templates in 4795.6s
INFO: Starting page extraction from enwiki-latest-pages-articles.xml.bz2.
INFO: Using 7 extract processes.
Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 494, in reduce_process
output.write(ordering_buffer.pop(next_ordinal))
File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 173, in write
self.file.write(data)
File "/usr/lib/python3.8/bz2.py", line 245, in write
compressed = self._compressor.compress(data)
TypeError: a bytes-like object is required, not 'str'
should be :
The text was updated successfully, but these errors were encountered: