UQA is a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. This dataset, generated by translating the Stanford Question Answering Dataset (SQuAD2.0) using the EATS technique, aims to provide a valuable resource for developing and testing multilingual NLP systems for Urdu and enhancing the cross-lingual transferability of existing models.
The UQA dataset consists of translated contexts, questions, and answers. It includes both answerable and unanswerable questions, maintaining the structure and challenges of the original SQuAD2.0 dataset.
The dataset and fine-tuned models can be downloaded from the following link:
This repository includes scripts used in the translation process, dataset generation, and model benchmarks. The main components are:
- Translation scripts using the EATS technique
- Evaluation scripts for model performance
- Benchmark results on models like mBERT, XLM-RoBERTa, and mT5
The following table summarizes the performance of various models tested on the UQA dataset. The metrics reported are Exact Match (EM) and F1 Score.
Model | Exact Match (EM) | F1 Score |
---|---|---|
mBERT | 45.50% | 64.72% |
mT5-Small | 52.37% | 67.24% |
mT5-Large | 71.26% | 84.20% |
XLM-RoBERTa | 65.67% | 78.00% |
XLM-RoBERTa-Large | 72.24% | 84.42% |
XLM-RoBERTa-XL | 74.56% | 85.99% |
Samee Arif, Sualeha Farid, Awais Athar, and Agha Ali Raza. 2024. UQA: Corpus for Urdu Question Answering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17237–17244, Torino, Italia. ELRA and ICCL.