Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how do I parse wikipedia dump file? #294

Open
sl2902 opened this issue Oct 17, 2022 · 4 comments
Open

how do I parse wikipedia dump file? #294

sl2902 opened this issue Oct 17, 2022 · 4 comments

Comments

@sl2902
Copy link

sl2902 commented Oct 17, 2022

Thanks for the library!

I have the latest xml dump file, and I would like to use your library to parse the infoboxes from the dump. However, I don't see any function to stream the file. Could you share an example of how I could pass the content of a page to the mwparserfromhell.parse(text) function to extract any infobox?

If this helps, this is what I have got so far

for _, elem in iter_lines():
    print(strip_tag_name(elem.tag))
    if strip_tag_name(elem.tag) == 'text':
        print(elem.text)

iter_lines() is a function which uses ET.iterparse() to incrementally parse the XML; it returns a generator

@lahwaacz
Copy link
Contributor

This library parses the wikitext only. You need to use another library to parse the XML file to get the wikitext. See e.g. https://stackoverflow.com/questions/16533153/parse-xml-dump-of-a-mediawiki-wiki

@sl2902
Copy link
Author

sl2902 commented Oct 19, 2022

On that link, it loads the entire file to memory; this will not be possible with the dump

@lahwaacz
Copy link
Contributor

Then you need to find a different parser.

@EvanGranthamBrown
Copy link

EvanGranthamBrown commented Oct 26, 2022

Check out mwxml, a library designed for this specific task (parsing Wikipedia XML dumps):

import mwxml

file_location = "/path/to/wikipedia/dump.xml"

dump = mwxml.Dump.from_file(open(file_location))

for page in dump:
    for revision in page:
        parsed = mwparserfromhell.parse(revision.text)
        # do stuff with parsed

The mwxml Dump class is an iterator which reads pages one at a time, so you can avoid loading the whole file at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants