Skip to content

Machine translation (MT) benchmark dataset for languages in the Horn of Africa.

Notifications You must be signed in to change notification settings

asmelashteka/HornMT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HornMT

The HornMT repository contains data and the associated metadata for the project Machine Translation Benchmark Dataset for Languages in the Horn of Africa. It is a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems for languages in the Horn of Africa.

Supported Languages

Language ISO 639-3 code
Afar aaf
Amharic amh
English eng
Oromo orm
Somali som
Tigrinya tir

Content

data/ contains one text file per language and each file contains news snippets in the same order for each language.

data
├── aar.txt
├── amh.txt
├── eng.txt
├── orm.txt
├── som.txt
└── tir.txt

metadata.tsv contains tab separated data describing each news snippet. The metadata contains the following fields.

  • Scope - describes whether the news is global or local. It takes two values: Global news and Local news.
  • Category - News category covering the following 12 topics
    • Art and Culture
    • Business and Economy
    • Conflicts and Attacks
    • Disaster and Accidents
    • Entertainment
    • Environment
    • Health
    • International Relations
    • Law and Crime
    • Politics
    • Science and Technology
    • Sport
  • Source - List of one or more URLs from which the news content is extracted or based on.
  • Domain - TLD corresponding to the URL(s) in Source.
  • Date - The publication date of the source article. The format is yyyy-mm-dd.

Other formats

All the data and associated metadata together in one file is also available in other file formats.

HornMT.xlsx - data and associated metadata in xlsx format.

HornMT.json - data and associated metadata in json format.

Below is an example row.

{
   "data":{
      "eng":"The World Meteorological Organisation reports that the ozone layer is damaged to its worst extent ever in the Arctic.",
      "aaf":"Baad Metrolojih Eglali Areketekeh Addal Ozonih qelu faxe waktik lafetle calat biyakisem xayose.",
      "amh":"የአለም የአየር ንብረት ድርጅት በአርክቲክ አካባቢ ያለው የኦዞን ምንጣፍ ከፍተኛ ጉዳት እንደደረሰበት አስታወቀ፡፡",
      "orm":"Dhaabbanni Meetiroolojii Addunyaa baqqaanni oozonii Arkiitik keessatti gara sadarkaa isa hamaa haga ammaatti akka miidhame gabaase.",
      "som":"Ururka Saadaasha Hawada Adduunka ayaa ku warramaya in lakabka ozoneka ee Ka koreeya dhulka baraflayda uu waxyeelladii abid ugu darnaa soo gaadhay.",
      "tir":"ውድብ ሜትሮሎጂ ዓለም ኣብ ኣርክቲክ ዝርከብ ናሕሲ ኦዞን ኣዝዩ ብዝኸፍአ ደረጃ ከምዝተጎድአ ሓቢሩ፡፡"
   },
   "metadata":{
      "scope":"Global",
      "category":"Science and Technology",
      "source":"https://www.independent.co.uk/environment/climate-change/ozone-layer-damaged-by-unusually-harsh-winter-2263653.html",
      "domain":"www.independent.co.uk",
      "date":"2011-04-05"
   }
}

Team

Translation Team

Afar

  • Mohammed Deresa
  • Yasin Nur

Amharic

  • Tigist Taye
  • Selamawit Hailemariam
  • Wako Tilahun

Oromo

  • Gemechis Melkamu
  • Galata Girmaye

Somali

  • Abdiselam Mohamed
  • Beshir Abdi

Tigrinya

  • Berhanu Abadi Weldegiorgis
  • Michael Minassie
  • Nureddin Mohammedshiek

Project Leaders

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0