WikiNLI

The dataset and code for the paper Mining Knowledge for Natural Language Inference from Wikipedia Categories.

Dataset

The training and development dataset are under data/WikiNLI

In the paper we sampled 100k instances for our experiments, and the 100k version of training set can be found under data/WikiNLI/100k

Other languages

WikiNLI constructed from Wikipedia of other languages are under data/mWikiNLI, we provide four versions, Chinese(zh), French(fr), Arabic(ar) and Urdu(ur)
WikiNLI constructed by translating English WikiNLI to other languages are under data/trWikiNLI, we provide four versions, Chinese(zh), French(fr), Arabic(ar) and Urdu(ur)

WordNet and WikiData

The WordNet and WikiData we used in the experiments described in the WikiNLI paper

Code

To pretrain a transformer based model on WikiNLI with the Huggingface transformers framework, use the following scripts.

python code/run_wikinli.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased \
    --task_name wikinli \
    --num_train_examples 500000 \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir ./data \
    --max_seq_length 40 \
    --per_gpu_eval_batch_size=32   \
    --per_gpu_train_batch_size=32   \
    --learning_rate 2e-5 \
    --num_train_epochs 1.0 \
    --save_steps 3000 \
    --logging_steps 3000 \
    --eval_all_checkpoints \
    --output_dir ./saved_outputs/bert-large

python code/run_wikinli.py \
    --model_type roberta \
    --model_name_or_path roberta-large \
    --task_name wikinli \
    --num_train_examples 500000 \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir ./data \
    --max_seq_length 40 \
    --per_gpu_eval_batch_size=32   \
    --per_gpu_train_batch_size=32   \
    --learning_rate 1e-5 \
    --warmup_steps 1000 \
    --num_train_epochs 3.0 \
    --save_steps 3000 \
    --logging_steps 3000 \
    --eval_all_checkpoints \
    --output_dir ./saved_outputs/roberta-large

after the model is finished pretraining, modify the saved model by removing the top linear layer.

mv [PATH]/pytorch_model.bin [PATH]/raw_pytorch_model.bin
python code/modify_saved_model.py [PATH]/raw_pytorch_model.bin [PATH]/pytorch_model.bin

A WikiNLI pretrained roberta-large model can be downloaded from https://drive.google.com/file/d/1RJgewj2TPXI2lDNuxEO1gq9aSTkdRxiZ/view?usp=sharing

To evaluate on NLI related tasks after pretraining on WikiNLI, follow the instructions of evaluating GLUE tasks by Huggingface.

Dependency

PyTorch 1.4.0
transformers 3.1.0

Citation

@inproceedings{chen2020mining,
      title={Mining Knowledge for Natural Language Inference from Wikipedia Categories}, 
      author={Mingda Chen and Zewei Chu and Karl Stratos and Kevin Gimpel},
      booktitle = {Findings of {EMNLP}},
      year={2020},
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

LICENSE

LICENSE

README.md

README.md

Repository files navigation

WikiNLI

Dataset

Other languages

WordNet and WikiData

Code

Dependency

Citation

About

Releases

Packages

Languages

License

ZeweiChu/WikiNLI

Folders and files

Latest commit

History

Repository files navigation

WikiNLI

Dataset

Other languages

WordNet and WikiData

Code

Dependency

Citation

About

Resources

License

Stars

Watchers

Forks

Languages