UQA: Corpus for Urdu Question Answering

Overview

UQA is a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. This dataset, generated by translating the Stanford Question Answering Dataset (SQuAD2.0) using the EATS technique, aims to provide a valuable resource for developing and testing multilingual NLP systems for Urdu and enhancing the cross-lingual transferability of existing models.

Dataset

The UQA dataset consists of translated contexts, questions, and answers. It includes both answerable and unanswerable questions, maintaining the structure and challenges of the original SQuAD2.0 dataset.

Download

The dataset and fine-tuned models can be downloaded from the following link:

UQA Hugging Face

Code

This repository includes scripts used in the translation process, dataset generation, and model benchmarks. The main components are:

Translation scripts using the EATS technique
Evaluation scripts for model performance
Benchmark results on models like mBERT, XLM-RoBERTa, and mT5

Model Performance

The following table summarizes the performance of various models tested on the UQA dataset. The metrics reported are Exact Match (EM) and F1 Score.

Model	Exact Match (EM)	F1 Score
mBERT	45.50%	64.72%
mT5-Small	52.37%	67.24%
mT5-Large	71.26%	84.20%
XLM-RoBERTa	65.67%	78.00%
XLM-RoBERTa-Large	72.24%	84.42%
XLM-RoBERTa-XL	74.56%	85.99%

Citiation

Samee Arif, Sualeha Farid, Awais Athar, and Agha Ali Raza. 2024. UQA: Corpus for Urdu Question Answering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17237–17244, Torino, Italia. ELRA and ICCL.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Annotations		Annotations
SQuAD-UR		SQuAD-UR
SQuAD		SQuAD
UQA		UQA
Voters		Voters
.gitattributes		.gitattributes
Clean.ipynb		Clean.ipynb
Dataset Generator.ipynb		Dataset Generator.ipynb
Krippendorff.ipynb		Krippendorff.ipynb
README.md		README.md
Train mBERT - XLMRoBERTa.ipynb		Train mBERT - XLMRoBERTa.ipynb
Train mT5.ipynb		Train mT5.ipynb
Translate.ipynb		Translate.ipynb

sameearif/UQA

Folders and files

Latest commit

History

Repository files navigation

UQA: Corpus for Urdu Question Answering

Overview

Dataset

Download

Code

Model Performance

Citiation

About

Resources

Stars

Watchers

Forks

Languages