how do I parse wikipedia dump file? #294

sl2902 · 2022-10-17T18:44:01Z

Thanks for the library!

I have the latest xml dump file, and I would like to use your library to parse the infoboxes from the dump. However, I don't see any function to stream the file. Could you share an example of how I could pass the content of a page to the mwparserfromhell.parse(text) function to extract any infobox?

If this helps, this is what I have got so far

for _, elem in iter_lines():
    print(strip_tag_name(elem.tag))
    if strip_tag_name(elem.tag) == 'text':
        print(elem.text)

iter_lines() is a function which uses ET.iterparse() to incrementally parse the XML; it returns a generator

The text was updated successfully, but these errors were encountered:

lahwaacz · 2022-10-19T13:09:32Z

This library parses the wikitext only. You need to use another library to parse the XML file to get the wikitext. See e.g. https://stackoverflow.com/questions/16533153/parse-xml-dump-of-a-mediawiki-wiki

sl2902 · 2022-10-19T16:25:59Z

On that link, it loads the entire file to memory; this will not be possible with the dump

lahwaacz · 2022-10-19T19:39:22Z

Then you need to find a different parser.

EvanGranthamBrown · 2022-10-26T15:43:02Z

Check out mwxml, a library designed for this specific task (parsing Wikipedia XML dumps):

import mwxml

file_location = "/path/to/wikipedia/dump.xml"

dump = mwxml.Dump.from_file(open(file_location))

for page in dump:
    for revision in page:
        parsed = mwparserfromhell.parse(revision.text)
        # do stuff with parsed

The mwxml Dump class is an iterator which reads pages one at a time, so you can avoid loading the whole file at once.

dpriskorn mentioned this issue Jan 13, 2023

as a developer I want to download a file with articles from the latest enwiki dump and play with it to understand the format internetarchive/iari#462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how do I parse wikipedia dump file? #294

how do I parse wikipedia dump file? #294

sl2902 commented Oct 17, 2022 •

edited

lahwaacz commented Oct 19, 2022

sl2902 commented Oct 19, 2022

lahwaacz commented Oct 19, 2022

EvanGranthamBrown commented Oct 26, 2022 •

edited

how do I parse wikipedia dump file? #294

how do I parse wikipedia dump file? #294

Comments

sl2902 commented Oct 17, 2022 • edited

lahwaacz commented Oct 19, 2022

sl2902 commented Oct 19, 2022

lahwaacz commented Oct 19, 2022

EvanGranthamBrown commented Oct 26, 2022 • edited

sl2902 commented Oct 17, 2022 •

edited

EvanGranthamBrown commented Oct 26, 2022 •

edited