Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape sentences from wikidata #92

Open
stefangrotz opened this issue Feb 17, 2020 · 4 comments
Open

Scrape sentences from wikidata #92

stefangrotz opened this issue Feb 17, 2020 · 4 comments
Assignees
Projects

Comments

@stefangrotz
Copy link
Contributor

stefangrotz commented Feb 17, 2020

Wikidata is completely under CC0, this makes it very attractive for the project. In contains both, sentences and sometimes audio, but for this Issue I want to focus on sentences.

This Issue is work in progress, I want to collect possible sources for sentences in Wikidata:

  • P5831 usage example : a example sentence for a word. Often with a language added in brackets.
  • A "Description" in many languages exists for many Wikidata- items, but it isn't always a complete sentence.

The next step would be to write a script to scrap these sentences.

@stefangrotz stefangrotz changed the title scrape sentences from wikidata scrap sentences from wikidata Feb 17, 2020
@MichaelKohler
Copy link
Member

Looks like these are indeed CC0. I don't think we need to ask legal for this. @nukeador do you agree?

Would love to see a selection of these sentences. Also, I assume you are aware of the scraper capabilities for other resources? As long as we can get it into a parseable state, it then can directly be integrated in the scraper to use the rules and everything. More details in the last part of the README. Also happy to explain further if needed.

@stefangrotz
Copy link
Contributor Author

stefangrotz commented Feb 17, 2020

As long as we can get it into a parseable state, it then can directly be integrated in the scraper to use the rules and everything.

This was exactly what I was thinking. Right now the example sentences for a datatype called "lexemes" are relatively new. They exists since 2018. But they are planing to move all wiktionary data into wikidata, so we will likely have more sentences in the future.

Wikidata is huge, I am sure that there are more data types that contain sentences.

Would love to see a selection of these sentences.

I always wanted to learn wikidata queries, this is a nice little project to finally do it. I will post some examples tomorrow or so.

@nukeador
Copy link

Note only these 4 namespaces is CC0.

All structured data from the main, Property, Lexeme, and EntitySchema namespaces is available under the Creative Commons CC0 License; text in the other namespaces is available under the Creative Commons Attribution-ShareAlike License;

Do we have data on how many sentences do we have for each language?

@Adrijaned
Copy link

I've already suggested using P5831 earlier in the sentence-collector project (common-voice/sentence-collector#260), but, as per this query, there is currently only about 4000 sentences in P5831, some of which are probably repetitions. (After uncommenting the first line of the query you should be able to filter sentences by language using the query helper (accesible by clicking the (i) on the left sidebar)).

All of those should be in the Lexeme namespace, so license-wise should be of no issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Overview
In Progress
Development

No branches or pull requests

4 participants