Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data collection for different languages #2

Open
andra-pumnea opened this issue Mar 20, 2020 · 13 comments
Open

Data collection for different languages #2

andra-pumnea opened this issue Mar 20, 2020 · 13 comments
Assignees
Labels
Data enhancement New feature or request

Comments

@andra-pumnea
Copy link
Contributor

Find official data sources for FAQ about COVID-19 in different languages and scrape them.

@borhenryk
Copy link
Contributor

I have already a script for the RKI FAQ. Will share it later!

@bogdankostic
Copy link
Contributor

If someone needs a starting point, I already wrote scrapers for WHO and some pages of CDC:
https://github.com/deepset-ai/COVID-QA/tree/master/data/scrapers

@tholor tholor added Data enhancement New feature or request labels Mar 20, 2020
@andra-pumnea
Copy link
Contributor Author

I will do some scraping for Romanian

@stedomedo
Copy link
Contributor

I'll add Italian

@tkh42
Copy link
Contributor

tkh42 commented Mar 20, 2020

I will look into some more german pages.

@borhenryk
Copy link
Contributor

@tkh42 let me know which so we are not doing double-work. This would make sense probably https://www.infektionsschutz.de/coronavirus/faqs-coronaviruscovid-19.html

@tkh42
Copy link
Contributor

tkh42 commented Mar 20, 2020

@HenrykBorzymowski Ok. Yes I have thought about doing that one too, I think I will start with https://www.bmas.de/DE/Presse/Meldungen/2020/corona-virus-arbeitsrechtliche-auswirkungen.html

@Timoeller
Copy link
Contributor

Perfect people, this is taking off rather quickly :D
I can invite you to our slack crawler group if you tell me your wirvsvirus slack names.

I would also suggest that you create small issues stating on which website you want to work on, so we do not have double work or do a crawler twice. state the website in the title so github can find related issues very easily! Thanks

@borhenryk
Copy link
Contributor

Here is a google table in which we can track which pages we already have a scraper for etc. Please fill in and change if necessary: https://docs.google.com/spreadsheets/d/1er-7sDvgMZ484FRhPL7X6rl1fgRIRtA7fJfj-gLp3jg/edit?usp=sharing

@Timoeller Timoeller self-assigned this Mar 21, 2020
@Timoeller
Copy link
Contributor

@tkh42 Can I somehow help or motivate you creating scrapers for German Sites? :D

We already started the label process and need more questions!

@tkh42
Copy link
Contributor

tkh42 commented Mar 21, 2020

@Timoeller I am finished with the BMAS one will create the pull request and continue with the next.:)

@stedomedo
Copy link
Contributor

stedomedo commented Mar 22, 2020

One way to "easily" get multilingual data is to machine-translate.
pip install googletrans (and then use Translator(service_urls=["translate.google.com/gen204"]))
These are older Google Translate Versions, and worse quality than prod, but it's free. The lower quality would only be used in the background though, not shown to the user.

A workflow like this could then work for the user:
Type query in Spanish
-> QA system detects Spanish query
-> QA system matches with Spanish original and/or from-English-translated questions/answers
-> QA system shows answers in original language with option to web-translate with Google

This would be easier than real-time translation and/or getting sufficient data in many languages.

@stedomedo
Copy link
Contributor

Multilingual resource can also easily be found using linguee and checking the sources of the found sentences in the language pairs, e.g. for DE:
https://www.linguee.com/english-german/search?source=auto&query=coronavirus

sfakir added a commit that referenced this issue Mar 23, 2020
merge latest update back to local repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants