AfricaNLP-Public-Datasets

A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.

Datasets per task (Randomly ordered)

Machine Translation

TANZIL: A translated Quran to 42 languages, including African languages such as Amharic, Hausa, Somali, and Swahili.
MENYO-20k: A Yorùbá-English multi-domain parallel text dataset.
FFR: A Fon-French parallel text dataset.
Hausa Corpus: A Hausa-English parallel text dataset.
CCAligned: A parallel text dataset for English and 137 languages, including 30 African Languages.
ParaCrawl: A parallel text dataset for 41 languages, including Somali and Swahili.
WikiMatrix: A parallel text dataset for 85 languages, including Swahili, Malagasy, and Egyptian Arabic.
Ethiopian MT datasets: A parallel text dataset for English paired with 7 Ethiopian languages.
English-Luganda: An English-Luganda parallel text dataset.
French-Fon and French-Ewe: A parallel text dataset for French paired with Fon and Ewe.
Amharic-English: An Amharic-English parallel text dataset.
Tigrinya-English: A Tigrinya-English parallel text dataset (Free registration required).
Lingala-French: A Lingala-English parallel text dataset (Free registration required).
Congolese Swahili-French (Min,Small,Medium): Congolese Swahili-French parallel text datasets (Free registration required).
Swahili-French: A synthetic Swahili-French parallel text dataset (Free registration required).
English-Hausa (Min, Small): English-Hausa parallel text datasets (Free registration required).
English-Swahili: An English-Swahili parallel text dataset (Free registration required).
English-Swahili: An English-Swahili textdatasets on two separate files (Free registration required).229,312-Pairs
English-Kanuri: An English-Kanuri parallel text dataset (Free registration required).
English-Akuapem Twi: An English-Akwapem Twi parallel text dataset.
FLORES-101: A parallel text dataset for 101 languages, including 20 African languages.
isiXhosa-English: An isiXhosa-English parallel text dataset.
Tatoeba: A parallel text dataset for 409 languages, including 28 African languages.
Gnome: A technical domain parallel text dataset for 197 languages, including 16 African languages.
Ubuntu: A technical domain parallel text dataset for 244 languages, including 24 African languages.
OPUS-100: A parallel text dataset for 100 languages, including 9 African languages.
TICO-19: A covid-19 domain parallel text dataset for 37 languages, including 13 African languages.
Mozila localization: A parallel text dataset for 197 languages, including 18 African languages.

Text Classification

KINNEWS and KIRNEWS: News Classification datasets for Kinyarwanda (KINNEWS) and Kirundi (KIRNEWS).
Setswana and Sepedi: News classification datasets for Setswana and Sepedi.
Swahili News: A news classification dataset for Swahili.
Amharic News Text classification: News text classification dataset for Amharic.
VOA Hausa and BBC Yoruba news classification: News title classification dataset for Hausa and Yoruba.

Sentiment Analysis

TUNIZI: A Tunizian Arabizi sentiment analysis dataset.
NaijaSenti: A sentiment analysis dataset for Hausa, Igbo, Yoruba, and Nigerian Pidgin.

Text Summarization

Amharic Summarization: A dataset for Amharic abstractive text summarization.
XL-Sum: A dataset for multilingual abstractive text summarization for 44 languages, including 10 African languages.

Named Entity Recognition

MasakhaNER: A dataset for Named Entity Recognition of 10 African languages.
WikiANN: A dataset for Named Entity Recognition for 282 languages, including several African languages.
Yoruba GV NER: Yoruba Named Entity Recognition dataset.
Hausa VOA NER: Hausa Named Entity Recognition dataset

Automated Speech Recognition (ASR)

ALFFA: An ASR dataset for Amharic, Hausa, Swahili, and Wolof.
AMMI ASR dataset: An ASR dataset for 19 Languages, including 16 African Languages.
CommonVoice: An ongoing ASR dataset project for 60 languages (as of May, 2021), including Kinyarwanda, Kabyle, Luganda, and Hausa.
Fon: An ASR dataset for Fon.
Swahili: A Swahili speech dataset (Free registration required).
Congolese Swahili: A Congolese Swahili speech dataset (Free registration required).
BembaSpeech: An ASR dataset for Bemba.
SPCS Speech: A Sepedi speech dataset.
SADiLaR TTS: ASR datasets for Afrikaans, Sesotho, Setswana, and isiXhosa.
NCHLT Speech: Speech datasets for South African's eleven official languages, including Afrikaans, Xitsonga, Setswana, Sesotho, Sepedi, isiZulu, Tshivenda, Siswati, isiXhosa, and isiNdebele.
IARPA Babel Swahili data: An ASR dataset for Swahili. (Require payment of $25)

Speech Translation

Mboshi: Mboshi-French parallel speech dataset.
IWSLT 2021 Speech Translation: Speech translation datasets for Swahili - English and Congolese Swahili-French.

Monolingual Data

Swahili Language Modeling: A Swahili dataset for language modeling and additional datasets for Swahili Syllabic Alphabet and Swahili Word Analogy.
OSCAR: A multilingual dataset for 166 languages, including Amharic, Somalia, Yoruba, Egyptian Arabic, Malagasy, Swahili, and Afrikaans.
Luganda Agriculture data (Bukedde, Wikipedia): Monolingual datasets for Luganda in agricultural domain from Bukedde and Wikipedia.
isiXhosa: A monolingual dataset for isiXhosa.
mC4: A multilingual dataset for 101 languages, including 13 African languages.
MOT v1.0: A multilingual dataset for 44 languages, including 11 African languages.

Phonetic Dictionary

ipa-dict: A Phonetic dictionary for 23 languages including Swahili.
za-lex: Lexical pronunciation datasets for 6 languages spoken is South Africa: Afrikaans, Southern Sotho, Xhosa, Zulu, SA English, and Tswana.

Chatbots (Conversational AI) Data

AfriWOZ1.0 A set of 6 African dialogue datasets, human-translated from MultiWOZ2.2, for training chatbots or conversational AI.

Other potential sources:

Contributions

This is a growing list of NLP datasets for African languages. Please, if there is any publicly available dataset I missed out, kindly feel free to add it by doing a pull request, contacting me on Twitter, or emailing me at niyongabor.andre@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

AfricaNLP-Public-Datasets

Datasets per task (Randomly ordered)

Machine Translation

Text Classification

Sentiment Analysis

Text Summarization

Named Entity Recognition

Automated Speech Recognition (ASR)

Speech Translation

Monolingual Data

Phonetic Dictionary

Chatbots (Conversational AI) Data

Other potential sources:

Contributions

About

Releases

Packages

Contributors 4

Andrews2017/africanlp-public-datasets

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

AfricaNLP-Public-Datasets

Datasets per task (Randomly ordered)

Machine Translation

Text Classification

Sentiment Analysis

Text Summarization

Named Entity Recognition

Automated Speech Recognition (ASR)

Speech Translation

Monolingual Data

Phonetic Dictionary

Chatbots (Conversational AI) Data

Other potential sources:

Contributions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages