Skip to content

seyedahbr/Wikidata_Reference_Statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikidata Reference Statistics


IMPORTANT NOTE

Due to a logical error in one of the queries, the data in one of the columns in Table 1 in the paper changes slightly. The column shows the number and percentage of referenced statements. In the original query used, the expression DISTINCT was applied to the external (alias) variable instead of the internal variable (this commit). Below is the correct data for each dataset. Other paper's information as well as datasets are absolutely correct.

Project Dump Referenced Statements
Gene Wiki 2016 8,789,246(50%) 8,699,626(49%)
2021 65,780,005(71%) 61,080,346(66%)
Taxonomy 2016 8,146,218(51%) 8,061,019(50%)
2021 19,423,938(60%) 16,778,074(52%)
Astronomy 2016 751,158(85%) 695,795(78%)
2021 128,394,763(89%) 127,751,791(88%)
Law 2016 48,225(27%) 48,132(27%)
2021 2,266,462(53%) 2,257,890(53%)
Music 2016 2,298,330(61%) 2,135,020(57%)
2021 6,342,019(54%) 5,920,103(51%)
Ships 2016 114,528(62%) 111,121(61%)
2021 315,381(29%) 295,885(27%)

This repository contains the materials of the Reference Statistics in Wikidata Topical Subsets, 2nd Wikidata Workshop with ISWC 2021 paper. The embeddings are:

  • Query Results: The output of SPARQL queries for each experiment.
  • Scripts: The Python script used to enrich WDumper specification files with sub-classes and the jupyter notebook for plotting the charts.
  • SPARQL Queries: SPARQL queries to fetch reference statistics from datasest.
  • WDumper Specification Files: The specification files for extracting topical subsets corresponding to 6 different WikiProjects via WDumper tool. There are two JSON file for each project. The first just contains the top-level classes. The second (with '_sub' suffix) has enriched by sub-classes. Experiments were done using the latter.
  • ShEx schemata: The Shape Exprision schemas of the 6 WikiProjects suitable for subsetting through ShEx validators such as shex-js and PyShex (slurping).

To reproduce the experiments

  1. You can download the relevant dataset from this Zenodo repository for the desired WikiProject. There are one or a collection of .nt.gz files for each project. Otherwise, you can use WDumper with the specification files provided in the WDumper Specification Files directory and build the dataset of each project from the scratch. To use WDumper, go through the following steps:
$ git clone https://github.com/bennofs/wdumper.git
$ cd wdumper/
$ gradle build   # you will need JDK-11 and gradle
$ cd build/install/wdumper/bin/
$ ./wdumper-cli [wikidata dump.json.gz] [specification file.json]
  • Note: the output will be a 'wdump-1.nt.gz' file. To start another extraction, change the name or move the file as it will be overwritten.
  1. Import the dataset' .nt.gz file(s) via Blazegrph and load a local endpoint over it:
$ wget https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_2_1_6_RC/blazegraph.jar
$ java -cp blazegraph.jar com.bigdata.rdf.store.DataLoader fastloader.properties [dataset].nt.gz
$ java -server -Xmx4g -jar blazegraph.jar
  • Note: 'fastloader.properties' file is in the Scripts directory
  1. Run queries that are in the SPARQL queries directory over the Blazegraph local endpoint:
$ curl -X POST http://localhost:9999/blazegraph/sparql --data-urlencode query@[SPARQL QUERY from the dir].sparql -H 'Accept:text/CSV' > [filename.csv]
  1. Plot the charts via the Jupyter notebook provided in the Script directory.

About

Materials of the paper "Reference Statistics in Wikidata Topical Subsets", 2nd Wikidata Workshop with ISWC 2021 (http://ceur-ws.org/Vol-2982/paper-3.pdf)

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published