PersianQuAD: The Native Question Answering Dataset for the Persian Language

PersianQuAD: The Native Question Answering Dataset for the Persian Language

In order to address the need for a high-quality QA dataset for Persian language, we propose a model for creating dataset for deep-learning-based QA systems. We deploy the proposed model to create PersianQuAD, the first native question answering dataset for the Persian language. PersianQuAD contains approximately 20,000 "question, paragraph, answer" triplets on Persian Wikipedia articles and is the first large-scale native QA dataset for the Persian language which is created by native annotators.

The proposed model consists of four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. We analysed PersianQuAD and showed that it contains questions of varying types and difficulties and hence, it is a good presenter of real-world questions in the Persian language. We built three QA systems using MBERT, ALBERT-FA and ParsBERT. The best system uses MBERT and achieves a F1 score of 82.97% and an Exact Match of 78.8%. The results show that the resulted dataset performs well for training deep-learning-based QA systems. We have made our dataset and QA models freely available and hope that it encourages the development of new QA datasets and systems for different languages, and leads to further advances in machine comprehension.

Dataset

Download

The dataset is available for download from the Dataset directory. The statistics of the PersianQuAD is shown below:

Split	No. of questions	No. of Candidate Answers	Avg. of question length	avg. answer length
Train	18567	1	10.7	2.6
Test	1000	3	10.5	2.3

In the following, question type distribution over PersianQuAD dataset is illustrated:

Question Word	Distribution
What	28.14%
How	15.24%
When	10.70%
Where	13.60%
Who	16.50%
Which	15.26%
Why	00.92%

Model

You can train and test the proposed model by running Main.ipynb in the Google Colab enviroment. You must download the repository and extract it to your Google Drive. Then, run Main.ipynb by Google Colab and train your models.

Evalution

We build three QA systems according to the pre-trained language models examined (MBERT, ALBERT-FA, ParsBERT). We trained each of the QA systems using the training part of PersianQuAD and evaluate them using the test part. We evaluate each of the QA systems according to two widely used automatic evaluation metrics Exact Match and F1.

Dataset	Model	Exact Match	F1 measure
PersianQuAD	Human	95.00%	96.49%
PersianQuAD	Albert-FA	74.90%	79.25%
PersianQuAD	ParsBERT	73.80%	79.08%
PersianQuAD	MBERT	78.80%	82.97%

Citation

Plain

A. Kazemi, J. Mozafari and M. A. Nematbakhsh, "PersianQuAD: The Native Question Answering Dataset for the Persian Language," in IEEE Access, vol. 10, pp. 26045-26057, 2022, doi: 10.1109/ACCESS.2022.3157289.

Bibtex

@ARTICLE{PersianQuAD-Access,
    author={Kazemi, Arefeh and Mozafari, Jamshid and Nematbakhsh, Mohammad Ali},
    journal={IEEE Access},
    title={PersianQuAD: The Native Question Answering Dataset for the Persian Language},
    year={2022},
    volume={10},
    number={},
    pages={26045-26057},
    doi={10.1109/ACCESS.2022.3157289}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Dataset		Dataset
Inference.py		Inference.py
Main.ipynb		Main.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset

Dataset

Inference.py

Inference.py

Main.ipynb

Main.ipynb

README.md

README.md

Repository files navigation

PersianQuAD: The Native Question Answering Dataset for the Persian Language

Dataset

Download

Model

Evalution

Citation

Plain

Bibtex

About

Releases

Packages

Contributors 2

Languages

BigData-IsfahanUni/PersianQuAD

Folders and files

Latest commit

History

Repository files navigation

PersianQuAD: The Native Question Answering Dataset for the Persian Language

Dataset

Download

Model

Evalution

Citation

Plain

Bibtex

About

Resources

Stars

Watchers

Forks

Languages