X-WikiRE

This tool provides a semi-automated creation of the WikiReading dataset as described in the work of Hewlett et. al.. There are some already built dataset available at their repository.

Requirements

MongoDB
Python

Procedure

Required files

Download Wikidata JSON dump from here
Download Wikipedia XML dump from here
Download the language specific page_props.sql dump from wikipedia dumps here

Data Processing

Build the mapping dict between Wikipedia IDs and WIkidata IDs using wiki_prop.py
Transform the XML dump to JSON using the segment_wiki.py (a custom version of Gensim's script described here)

Data import

Import the Wikidata dump into MongoDB in it's own collection using:

mongoimport --db WikiReading --collection wikidata --file wikidata_dump.json --jsonArray

Create an index on the "id" field
```
db.wikidata.createIndex({"id": 1})
```
Import the JSON wikipedia dump into MongoDB

Create an index on the title field:

db.wikidata.createIndex({"wikidata_id": 1})

POS Tagger training

Train POS tagger for the desired language using this, and the data from universal dependencies

Cite

@inproceedings{abdou-etal-2019-x,
    title = "X-{W}iki{RE}: A Large, Multilingual Resource for Relation Extraction as Machine Comprehension",
    author = "Abdou, Mostafa  and
      Sas, Cezar  and
      Aralikatte, Rahul  and
      Augenstein, Isabelle  and
      S{\o}gaard, Anders",
    booktitle = "Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-6130",
    doi = "10.18653/v1/D19-6130",
    pages = "265--274",
    abstract = "Although the vast majority of knowledge bases (KBs) are heavily biased towards English, Wikipedias do cover very different topics in different languages. Exploiting this, we introduce a new multilingual dataset (X-WikiRE), framing relation extraction as a multilingual machine reading problem. We show that by leveraging this resource it is possible to robustly transfer models cross-lingually and that multilingual support significantly improves (zero-shot) relation extraction, enabling the population of low-resourced KBs from their well-populated counterparts.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

X-WikiRE

Requirements

Procedure

Required files

Data Processing

Data import

POS Tagger training

Cite

About

Releases

Packages

Languages

SasCezar/XWikiRE

Folders and files

Latest commit

History

Repository files navigation

X-WikiRE

Requirements

Procedure

Required files

Data Processing

Data import

POS Tagger training

Cite

About

Topics

Resources

Stars

Watchers

Forks

Languages