Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fails on the first file #292

Open
vsraptor opened this issue Aug 9, 2022 · 2 comments
Open

fails on the first file #292

vsraptor opened this issue Aug 9, 2022 · 2 comments

Comments

@vsraptor
Copy link

vsraptor commented Aug 9, 2022

INFO: Preprocessed 22100000 pages
INFO: Preprocessed 22200000 pages
INFO: Loaded 738901 templates in 4795.6s
INFO: Starting page extraction from enwiki-latest-pages-articles.xml.bz2.
INFO: Using 7 extract processes.
Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 494, in reduce_process
output.write(ordering_buffer.pop(next_ordinal))
File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 173, in write
self.file.write(data)
File "/usr/lib/python3.8/bz2.py", line 245, in write
compressed = self._compressor.compress(data)
TypeError: a bytes-like object is required, not 'str'


    def write(self, data):
        self.reserve(len(data))
        if self.compress:
            self.file.write(data)
        else:
            self.file.write(data)

should be :

    def write(self, data):
        self.reserve(len(data))
        if self.compress:
            self.file.write(data.encode('utf8'))
        else:
            self.file.write(data)
@rxm
Copy link

rxm commented Oct 14, 2022

I had the same issue and @vsraptor 's modification fixed it for me. Thank you for posting.

As you are in that file you could also replace Bzip2 output compression to Gzip (adding an import gzip under the import bz2 line). I try to work with compressed files downstream, and Gzip files are significantly faster to deal with at a small price in size.

def open(self, filename):
        if self.compress:
            # return bz2.BZ2File(filename + '.bz2', 'w')
            return gzip.GzipFile(filename + '.gz', mode='w')
        else:
            return open(filename, 'w')

@rxm
Copy link

rxm commented Oct 14, 2022

I noticed that the last compressed file created (as given by NextFile) when using the --compressed flag is incomplete. I have tried flush, closes, and scattered sleeps but I have not yet found where the problem is (this using bz2 compression). Any ideas?

I instrumented the OutputSplitter class and found that OutputSplitter.close() is not called for the last file. There are also a few extra writes to the last file. Wikiextractor is a multiprocess script that has several processes reading the dump and one reduce_process writing the results. If it runs out of things to write it terminates and leaves it to the calling process to close the OutputSplitter object, but at that point they are different. Adding an output.close() to the bottom of reduce_process closes the currently open file.

It also works when using gzip.GzipFile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants