Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed #2436

Open
prototyperspective opened this issue Mar 14, 2024 · 3 comments
Labels
data-quality issues related to the quality of the data that Scholia shows

Comments

@prototyperspective
Copy link

What is the issue?
Most studies (and books) are not in Scholia since they're not in Wikidata.

Why is this a problem?
The platform's value is determined by the data quality and extent of Wikidata. However, most books and papers are not yet imported to its structured format. Scholia could start to become truly useful by AI-set "main subjects" data (see e.g. #1896 #1733 #1730 for topics-related use-cases) and statistical charts if maybe 40% of all studies or 60% of cited/notable ones were included but it seems like currently not even 5% of all have been integrated (for example not even most of those studies usually in the uppermost altmetrics-percentiles whose images I've uploaded here).

How could this be addressed?
It could be solved by bulk-importing (and updating/refining) based on some database using some script. Please see my post about this here which links to several either potential or readily available such datasets.

What are good places to discuss this?
Here and at the linked page as well as maybe some other Wikidata place less focused on books and more about scientific papers.

@prototyperspective prototyperspective added the data-quality issues related to the quality of the data that Scholia shows label Mar 14, 2024
@egonw
Copy link
Collaborator

egonw commented Mar 16, 2024

Thanks for the cross-link! Often changes or changes on the Wikidata side need updates here. The book example with versions and editions is an important one.

Scholia indeed just visualizes what is in Wikidata, and here to make Scholia as useful as possible (without resulting in timed-out queries). Scholia should not, imho, be a platform to discuss what Wikidata can handle or not. That is, mass-importing data is to be discussed on Wikidata (as is in this case). But the simple fact is that the current Wikidata platform is not as scalable as everyone would love it to be.

@fnielsen
Copy link
Collaborator

I am under the impression that the bots that imported and annotated wikicite data has been switched off due to the fear of Wikidata Query Service coming into troubles.

@andrawaag
Copy link
Collaborator

I disagree. Bulk importing is not a solution. I have switched of some if not all bulk importing bots, not because of fear of the WDQS getting into trouble, because bulk importing will make Wikidata less useful. We don't know the exact number of all books and papers, but even the most conservative estimates give a number that is way bigger than the number of current Wikidata items. So trying to achieve complete recall is basically impossible.

IMO we should try build AI-ready corpora not by increasing the coverage in Wikidata, but by creating independent RDF graphs on books and papers using the Wikidata namespace. Building RDF Graphs of the size of Wikidata (or bigger) is relatively easy if directly constructed as RDF graphs (ie. not having to rely on the limitations of the Wikibase API). This approach still requires Wikidata bots, because main topic items still will need to be minted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-quality issues related to the quality of the data that Scholia shows
Projects
None yet
Development

No branches or pull requests

4 participants