A program for efficiently generating mappings between Wikipedia's titles, Wikipedia's page IDs, and Wikidata's QIDs.
This is effectively a reimplementation of the wikimapper
library in Rust. A major difference is that wiki2qid
generates an Apache Avro file instead of a SQLite file. The reason for this is the ability to efficiently insert the mapping data into any data structure you want. Incidentally, this also means that wiki2qid
does not support querying the mappings.
Note: Some Wikipedia pages are redirects and therefore map to the same QID (i.e., it is a many-to-one relationship).
Note: Some Wikipedia pages don't have corresponding Wikidata items (i.e., their QIDs are null).
You can install wiki2qid
by running the following command:
cargo install wiki2qid
Of course, you can also build it from source.
wiki2qid
requires 3 files as input. They are the page, page_props, and redirect SQL table dumps. You can download them with the following commands:
wget https://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-page.sql.gz
wget https://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-page_props.sql.gz
wget https://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-redirect.sql.gz
Replace the ${LANGUAGE}
with two letter language codes (e.g., "en", "hr").
After decompressing the SQL table dumps, you can extract the mapping data with the following command:
wiki2qid --input-page "${LANGUAGE}wiki-latest-page.sql" \
--input-page_props "${LANGUAGE}wiki-latest-page_props.sql" \
--input-redirect "${LANGUAGE}wiki-latest-redirect.sql" \
--output wiki2qid.avro
The schema of the output is defined by the following JSON:
{
"type": "record",
"name": "wiki2qid",
"fields": [
{"name": "title", "type": "string"},
{"name": "pageid", "type": "int"},
{"name": "qid", "type": ["null", "int"]}
]
}
The help with this, there are 2 helper scripts in the helpers/
directory.
You can use them by first downloading and decompressing the data with the following command:
./download --download-dir ${DOWNLOAD_DIR} --language ${LANGUAGE_1} --language ${LANGUAGE_2}
You can pass in any number of languages.
After you've done that, you can generate the mappings with the following command:
./generate --download-dir ${DOWNLOAD_DIR} --output-dir ${OUTPUT_DIR} --output-filename ${OUTPUT_FILENAME} --language ${LANGUAGE_1} --language ${LANGUAGE_2}
The argument --output-filename
is optional and has the default value of wiki2qid.avro
.
You can find the mapping data here: ${OUTPUT_DIR}/${LANGUAGE_i}/${OUTPUT_FILENAME}
.
wiki2qid
uses a single thread. On the English dump from March 2023, containing ~6,600,000 articles, it takes ~1.5 minutes to complete with peak memory usage of ~11GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.