Skip to content

cyanic-selkie/wiki2qid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wiki2qid

A program for efficiently generating mappings between Wikipedia's titles, Wikipedia's page IDs, and Wikidata's QIDs.

Release Docs License Downloads

This is effectively a reimplementation of the wikimapper library in Rust. A major difference is that wiki2qid generates an Apache Avro file instead of a SQLite file. The reason for this is the ability to efficiently insert the mapping data into any data structure you want. Incidentally, this also means that wiki2qid does not support querying the mappings.

Note: Some Wikipedia pages are redirects and therefore map to the same QID (i.e., it is a many-to-one relationship).

Note: Some Wikipedia pages don't have corresponding Wikidata items (i.e., their QIDs are null).

Usage

You can install wiki2qid by running the following command:

cargo install wiki2qid

Of course, you can also build it from source.

wiki2qid requires 3 files as input. They are the page, page_props, and redirect SQL table dumps. You can download them with the following commands:

wget https://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-page.sql.gz
wget https://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-page_props.sql.gz
wget https://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-redirect.sql.gz

Replace the ${LANGUAGE} with two letter language codes (e.g., "en", "hr").

After decompressing the SQL table dumps, you can extract the mapping data with the following command:

wiki2qid --input-page "${LANGUAGE}wiki-latest-page.sql" \
         --input-page_props "${LANGUAGE}wiki-latest-page_props.sql" \
         --input-redirect "${LANGUAGE}wiki-latest-redirect.sql" \
         --output wiki2qid.avro

The schema of the output is defined by the following JSON:

{
    "type": "record",
    "name": "wiki2qid",
    "fields": [
        {"name": "title", "type": "string"},
        {"name": "pageid", "type": "int"},
        {"name": "qid", "type": ["null", "int"]}
    ]
}

Helper Scripts

The help with this, there are 2 helper scripts in the helpers/ directory.

You can use them by first downloading and decompressing the data with the following command:

./download --download-dir ${DOWNLOAD_DIR} --language ${LANGUAGE_1} --language ${LANGUAGE_2}

You can pass in any number of languages.

After you've done that, you can generate the mappings with the following command:

./generate --download-dir ${DOWNLOAD_DIR} --output-dir ${OUTPUT_DIR} --output-filename ${OUTPUT_FILENAME} --language ${LANGUAGE_1} --language ${LANGUAGE_2}

The argument --output-filename is optional and has the default value of wiki2qid.avro.

You can find the mapping data here: ${OUTPUT_DIR}/${LANGUAGE_i}/${OUTPUT_FILENAME}.

Performance

wiki2qid uses a single thread. On the English dump from March 2023, containing ~6,600,000 articles, it takes ~1.5 minutes to complete with peak memory usage of ~11GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.

About

A program for efficiently generating mappings between Wikipedia's titles, Wikipedia's page IDs, and Wikidata's QIDs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published