Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed #2436

prototyperspective · 2024-03-14T18:46:44Z

What is the issue?
Most studies (and books) are not in Scholia since they're not in Wikidata.

Why is this a problem?
The platform's value is determined by the data quality and extent of Wikidata. However, most books and papers are not yet imported to its structured format. Scholia could start to become truly useful by AI-set "main subjects" data (see e.g. #1896 #1733 #1730 for topics-related use-cases) and statistical charts if maybe 40% of all studies or 60% of cited/notable ones were included but it seems like currently not even 5% of all have been integrated (for example not even most of those studies usually in the uppermost altmetrics-percentiles whose images I've uploaded here).

How could this be addressed?
It could be solved by bulk-importing (and updating/refining) based on some database using some script. Please see my post about this here which links to several either potential or readily available such datasets.

What are good places to discuss this?
Here and at the linked page as well as maybe some other Wikidata place less focused on books and more about scientific papers.

egonw · 2024-03-16T09:35:23Z

Thanks for the cross-link! Often changes or changes on the Wikidata side need updates here. The book example with versions and editions is an important one.

Scholia indeed just visualizes what is in Wikidata, and here to make Scholia as useful as possible (without resulting in timed-out queries). Scholia should not, imho, be a platform to discuss what Wikidata can handle or not. That is, mass-importing data is to be discussed on Wikidata (as is in this case). But the simple fact is that the current Wikidata platform is not as scalable as everyone would love it to be.

fnielsen · 2024-03-19T18:10:21Z

I am under the impression that the bots that imported and annotated wikicite data has been switched off due to the fear of Wikidata Query Service coming into troubles.

andrawaag · 2024-03-20T06:30:18Z

I disagree. Bulk importing is not a solution. I have switched of some if not all bulk importing bots, not because of fear of the WDQS getting into trouble, because bulk importing will make Wikidata less useful. We don't know the exact number of all books and papers, but even the most conservative estimates give a number that is way bigger than the number of current Wikidata items. So trying to achieve complete recall is basically impossible.

IMO we should try build AI-ready corpora not by increasing the coverage in Wikidata, but by creating independent RDF graphs on books and papers using the Wikidata namespace. Building RDF Graphs of the size of Wikidata (or bigger) is relatively easy if directly constructed as RDF graphs (ie. not having to rely on the limitations of the Wikibase API). This approach still requires Wikidata bots, because main topic items still will need to be minted.

prototyperspective added the data-quality issues related to the quality of the data that Scholia shows label Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed #2436

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed #2436

prototyperspective commented Mar 14, 2024

egonw commented Mar 16, 2024 •

edited

fnielsen commented Mar 19, 2024

andrawaag commented Mar 20, 2024

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed #2436

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed #2436

Comments

prototyperspective commented Mar 14, 2024

egonw commented Mar 16, 2024 • edited

fnielsen commented Mar 19, 2024

andrawaag commented Mar 20, 2024

egonw commented Mar 16, 2024 •

edited