Skip to content

neulab/covid19-datashare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

covid19-datashare

A repo for sharing language resources related to the outbreak (in machine readable format)

Sources we have already scraped/collected

Parallel Terminologies:

  • Translations of covid19-related terms in dozens of languages and locales, provided by Facebook and Google.

Parallel:

Monolingual/Comparable:

  • Wikipedia -- last updated March 25th 2020.
  • BBC World Service (22 languages) -- last updated May 8th 2020.
  • Voice of America (31 languages) -- last updated March 26th 2020.
  • Deutsche Welle (29 languages) -- last updated May 9th 2020.
  • El Diario (Papiamento language) -- last updated April 6th 2020.

Other Collections from our friends:

  • The Translators Without Borders have compiled glossaries and are starting to provide translations.
  • TAUS has compiled a corpus of COVID-19-related parallel sentences. Available here. Note that these corpora are published under the CC BY-NC 4.0 license which means the data can be shared and modified only for non-commercial purposes.
  • Microsoft has collected covid19-related bing search logs (desktop users only) over the period of Jan 1st, 2020 – April 18th, 2020. They are here
  • An international team of scientists that tries to estimate the number of cases with COVID-19 symptoms in different countries have put out surveys in 57 languages. (HT: @juliakreutzer)
  • The COVID-19 Myth Busters in World Languages has information in 31+ languages.
  • The EMEA corpus provides pdf conversions of documents from the European Medicines Agency (22 languages, 231 bitexts).
  • SketchEngine has collected an English in-domain corpus.

Directory Organization

This is the suggested organization. Hopefully some content will move from one bucket to another, as we keep refining it. Additional information (eg. metadata) can be added through xml tags (see below).

Parallel 
- TMs (homogeneous)
- Terminologies (homogeneous)
- Documents (homogeneous)
- Sentences (sparse)
- Terms (sparse)

Comparable 
- Documents (not translations, wiki pages)
- Sentences (as above)

Back-translated (xml defines original and MT and engine)
- documents 
- sentences

Monolingual
- Documents (pages, docs) 
- Sentences ( e.g tweets) 

Some of the data are not available under a CC-0 license (which is the default license for this repository) For the data that we reproduce/share under a different license (e.g. CC BY-SA 3.0 or others), this is denoted by the name of the directory and a corresponding README.

Data file format

The industry standard for sharing parallel data is TMX. TAUS and other translators can easily share their data in this format.

For other scraped/collected/filtered data (e.g. monolingual news articles) we suggest a very simple xml format, as it is important to add some metadata information, where available. This can/should include:

  • 'lang': the language of the document (please use ISO-3, three-letter codes)
  • 'source': (e.g. url or other dataset)
  • 'type': to denote the type of data e.g. mono, parallel, comparable, terminology, translation memory, and others.
  • 'docid': can be used to match parallel docs across languages, or to map filtered/aligned docs to their original dumped version.

Also, optionally you can add:

  • a 'term' field: to be used if we have other information, e.g. some monolingual texts that were filtered based on some term (e.g. COVID, SARS, Ebola).
  • a 'original_lang' field: in the case of translated data, it'd be good to define the source
  • a 'translation_mode': in the case of (back-)translated data, please denote if this was "auto" (for MT) or "manual" (for human-generated translations).

Example for document level:

<doc docid='bbc_51871911' lang='eng' type='mono' source_url='https://www.bbc.co.uk/gahuza/51871911' term='COVID'>
text text text
</doc>

If we manage to align documents at the sentence level, we can add 'sent_id' information e.g.

<doc docid='bbc_51871911' lang='eng' type='mono' source_url='https://www.bbc.co.uk/gahuza/51871911' term='COVID'>
<s sent_id='1'>text text text</s>
<s sent_id='2'>text text text</s>
...
</doc>

If you can't or don't want to convert your data into this xml, you could also share plain text files, but also add a README that will provide information on the source of the data, the type, etc etc

For sharing large files, you can also upload a compressed archive. The current repo tracks .zip and .tar.gz files and stores them using git-lfs.

Contact/Contributors

Here is who has contributed content so far:

  • Facebook terminologies: Facebook. Contact: Francisco (Paco) Guzman
  • Google terminologies: Google. Contact: Mengmeng Niu (mniu [at] google [dot] com) and Ian Hill (ihill [at] google [dot] com)
  • The Zhejiang handbook: Alp Öktem from TWB.
  • The Kaggle dataset was originally compiled by Liling Tang
  • Everything else was scraped by Neulab members: Junjie Hu, Zi-Yi Dou, and Antonis Anastasopoulos. Contact: Antonis.
  • Microsoft shared their URL list from past disaster responses. Contact: Will Lewis
  • Hady Elsahar did an initial scraping of covid19 wikipedia domains: It's here.

About

A repo for sharing language resources related to the outbreak (in machine readable format)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages