Skip to content

sameearif/UQA

Repository files navigation

UQA: Corpus for Urdu Question Answering

Overview

UQA is a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. This dataset, generated by translating the Stanford Question Answering Dataset (SQuAD2.0) using the EATS technique, aims to provide a valuable resource for developing and testing multilingual NLP systems for Urdu and enhancing the cross-lingual transferability of existing models.

Dataset

The UQA dataset consists of translated contexts, questions, and answers. It includes both answerable and unanswerable questions, maintaining the structure and challenges of the original SQuAD2.0 dataset.

Download

The dataset and fine-tuned models can be downloaded from the following link:

Code

This repository includes scripts used in the translation process, dataset generation, and model benchmarks. The main components are:

  • Translation scripts using the EATS technique
  • Evaluation scripts for model performance
  • Benchmark results on models like mBERT, XLM-RoBERTa, and mT5

Model Performance

The following table summarizes the performance of various models tested on the UQA dataset. The metrics reported are Exact Match (EM) and F1 Score.

Model Exact Match (EM) F1 Score
mBERT 45.50% 64.72%
mT5-Small 52.37% 67.24%
mT5-Large 71.26% 84.20%
XLM-RoBERTa 65.67% 78.00%
XLM-RoBERTa-Large 72.24% 84.42%
XLM-RoBERTa-XL 74.56% 85.99%

Citiation

Samee Arif, Sualeha Farid, Awais Athar, and Agha Ali Raza. 2024. UQA: Corpus for Urdu Question Answering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17237–17244, Torino, Italia. ELRA and ICCL.