covid19-datashare

A repo for sharing language resources related to the outbreak (in machine readable format)

Sources we have already scraped/collected

Parallel Terminologies:

Translations of covid19-related terms in dozens of languages and locales, provided by Facebook and Google.

Parallel:

Coronavirus advice from the Doctors of the World (47 languages; last updated May 10th)
Data from the COVID-19 Myth Busters in World Languages project (28 languages; last updated May 9th)
The Zhejiang handbook (10 languages)
International SOS Need-to-know slides vietnamese example (6 languages)
Canadian Government Public Service Announcements (21 languages)
King County (WA) fact sheet (12 languages)
Crowd Sourced Translations from source 1 and source 2 for the languages of the Philippines (13 and 8 languages)
Translations related to COVID-19 originally created by https://translatorswithoutborders.org/covid-19 and https://better.sg/migrantworkertranslations/. Taken from this Kaggle Dataset created and shared by Liling Tang.

Monolingual/Comparable:

Wikipedia -- last updated March 25th 2020.
BBC World Service (22 languages) -- last updated May 8th 2020.
Voice of America (31 languages) -- last updated March 26th 2020.
Deutsche Welle (29 languages) -- last updated May 9th 2020.
El Diario (Papiamento language) -- last updated April 6th 2020.

Other Collections from our friends:

The Translators Without Borders have compiled glossaries and are starting to provide translations.
TAUS has compiled a corpus of COVID-19-related parallel sentences. Available here. Note that these corpora are published under the CC BY-NC 4.0 license which means the data can be shared and modified only for non-commercial purposes.
Microsoft has collected covid19-related bing search logs (desktop users only) over the period of Jan 1st, 2020 – April 18th, 2020. They are here
An international team of scientists that tries to estimate the number of cases with COVID-19 symptoms in different countries have put out surveys in 57 languages. (HT: @juliakreutzer)
The COVID-19 Myth Busters in World Languages has information in 31+ languages.
The EMEA corpus provides pdf conversions of documents from the European Medicines Agency (22 languages, 231 bitexts).
SketchEngine has collected an English in-domain corpus.

Directory Organization

This is the suggested organization. Hopefully some content will move from one bucket to another, as we keep refining it. Additional information (eg. metadata) can be added through xml tags (see below).

Parallel 
- TMs (homogeneous)
- Terminologies (homogeneous)
- Documents (homogeneous)
- Sentences (sparse)
- Terms (sparse)

Comparable 
- Documents (not translations, wiki pages)
- Sentences (as above)

Back-translated (xml defines original and MT and engine)
- documents 
- sentences

Monolingual
- Documents (pages, docs) 
- Sentences ( e.g tweets)

Some of the data are not available under a CC-0 license (which is the default license for this repository) For the data that we reproduce/share under a different license (e.g. CC BY-SA 3.0 or others), this is denoted by the name of the directory and a corresponding README.

Data file format

The industry standard for sharing parallel data is TMX. TAUS and other translators can easily share their data in this format.

For other scraped/collected/filtered data (e.g. monolingual news articles) we suggest a very simple xml format, as it is important to add some metadata information, where available. This can/should include:

'lang': the language of the document (please use ISO-3, three-letter codes)
'source': (e.g. url or other dataset)
'type': to denote the type of data e.g. mono, parallel, comparable, terminology, translation memory, and others.
'docid': can be used to match parallel docs across languages, or to map filtered/aligned docs to their original dumped version.

Also, optionally you can add:

a 'term' field: to be used if we have other information, e.g. some monolingual texts that were filtered based on some term (e.g. COVID, SARS, Ebola).
a 'original_lang' field: in the case of translated data, it'd be good to define the source
a 'translation_mode': in the case of (back-)translated data, please denote if this was "auto" (for MT) or "manual" (for human-generated translations).

Example for document level:

<doc docid='bbc_51871911' lang='eng' type='mono' source_url='https://www.bbc.co.uk/gahuza/51871911' term='COVID'>
text text text
</doc>

If we manage to align documents at the sentence level, we can add 'sent_id' information e.g.

<doc docid='bbc_51871911' lang='eng' type='mono' source_url='https://www.bbc.co.uk/gahuza/51871911' term='COVID'>
<s sent_id='1'>text text text</s>
<s sent_id='2'>text text text</s>
...
</doc>

If you can't or don't want to convert your data into this xml, you could also share plain text files, but also add a README that will provide information on the source of the data, the type, etc etc

For sharing large files, you can also upload a compressed archive. The current repo tracks .zip and .tar.gz files and stores them using git-lfs.

Contact/Contributors

Here is who has contributed content so far:

Facebook terminologies: Facebook. Contact: Francisco (Paco) Guzman
Google terminologies: Google. Contact: Mengmeng Niu (mniu [at] google [dot] com) and Ian Hill (ihill [at] google [dot] com)
The Zhejiang handbook: Alp Öktem from TWB.
The Kaggle dataset was originally compiled by Liling Tang
Everything else was scraped by Neulab members: Junjie Hu, Zi-Yi Dou, and Antonis Anastasopoulos. Contact: Antonis.
Microsoft shared their URL list from past disaster responses. Contact: Will Lewis
Hady Elsahar did an initial scraping of covid19 wikipedia domains: It's here.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
back-translated		back-translated
comparable		comparable
monolingual		monolingual
parallel		parallel
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

back-translated

back-translated

comparable

comparable

monolingual

monolingual

parallel

parallel

.gitattributes

.gitattributes

LICENSE

LICENSE

README.md

README.md

Repository files navigation

covid19-datashare

Sources we have already scraped/collected

Directory Organization

Data file format

Contact/Contributors

About

Releases

Packages

Contributors 6

Languages

License

neulab/covid19-datashare

Folders and files

Latest commit

History

Repository files navigation

covid19-datashare

Sources we have already scraped/collected

Directory Organization

Data file format

Contact/Contributors

About

Resources

License

Stars

Watchers

Forks

Languages