Wikidata Reference Statistics

IMPORTANT NOTE

Due to a logical error in one of the queries, the data in one of the columns in Table 1 in the paper changes slightly. The column shows the number and percentage of referenced statements. In the original query used, the expression DISTINCT was applied to the external (alias) variable instead of the internal variable (this commit). Below is the correct data for each dataset. Other paper's information as well as datasets are absolutely correct.

Project	Dump	Referenced Statements
Gene Wiki	2016	~~8,789,246(50%)~~ 8,699,626(49%)
Gene Wiki	2021	~~65,780,005(71%)~~ 61,080,346(66%)
Taxonomy	2016	~~8,146,218(51%)~~ 8,061,019(50%)
Taxonomy	2021	~~19,423,938(60%)~~ 16,778,074(52%)
Astronomy	2016	~~751,158(85%)~~ 695,795(78%)
Astronomy	2021	~~128,394,763(89%)~~ 127,751,791(88%)
Law	2016	~~48,225(27%)~~ 48,132(27%)
Law	2021	~~2,266,462(53%)~~ 2,257,890(53%)
Music	2016	~~2,298,330(61%)~~ 2,135,020(57%)
Music	2021	~~6,342,019(54%)~~ 5,920,103(51%)
Ships	2016	~~114,528(62%)~~ 111,121(61%)
Ships	2021	~~315,381(29%)~~ 295,885(27%)

This repository contains the materials of the Reference Statistics in Wikidata Topical Subsets, 2nd Wikidata Workshop with ISWC 2021 paper. The embeddings are:

Query Results: The output of SPARQL queries for each experiment.
Scripts: The Python script used to enrich WDumper specification files with sub-classes and the jupyter notebook for plotting the charts.
SPARQL Queries: SPARQL queries to fetch reference statistics from datasest.
WDumper Specification Files: The specification files for extracting topical subsets corresponding to 6 different WikiProjects via WDumper tool. There are two JSON file for each project. The first just contains the top-level classes. The second (with '_sub' suffix) has enriched by sub-classes. Experiments were done using the latter.
ShEx schemata: The Shape Exprision schemas of the 6 WikiProjects suitable for subsetting through ShEx validators such as shex-js and PyShex (slurping).

To reproduce the experiments

You can download the relevant dataset from this Zenodo repository for the desired WikiProject. There are one or a collection of .nt.gz files for each project. Otherwise, you can use WDumper with the specification files provided in the WDumper Specification Files directory and build the dataset of each project from the scratch. To use WDumper, go through the following steps:

$ git clone https://github.com/bennofs/wdumper.git
$ cd wdumper/
$ gradle build   # you will need JDK-11 and gradle
$ cd build/install/wdumper/bin/
$ ./wdumper-cli [wikidata dump.json.gz] [specification file.json]

Note: the output will be a 'wdump-1.nt.gz' file. To start another extraction, change the name or move the file as it will be overwritten.

Import the dataset' .nt.gz file(s) via Blazegrph and load a local endpoint over it:

$ wget https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_2_1_6_RC/blazegraph.jar
$ java -cp blazegraph.jar com.bigdata.rdf.store.DataLoader fastloader.properties [dataset].nt.gz
$ java -server -Xmx4g -jar blazegraph.jar

Note: 'fastloader.properties' file is in the Scripts directory

Run queries that are in the SPARQL queries directory over the Blazegraph local endpoint:

$ curl -X POST http://localhost:9999/blazegraph/sparql --data-urlencode query@[SPARQL QUERY from the dir].sparql -H 'Accept:text/CSV' > [filename.csv]

Plot the charts via the Jupyter notebook provided in the Script directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Results

Query Results

SPARQL Queries

SPARQL Queries

Scripts

Scripts

ShEx schemata

ShEx schemata

WDumper Specification Files

WDumper Specification Files

README.md

README.md

Repository files navigation

Wikidata Reference Statistics

To reproduce the experiments

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Query Results		Query Results
SPARQL Queries		SPARQL Queries
Scripts		Scripts
ShEx schemata		ShEx schemata
WDumper Specification Files		WDumper Specification Files
README.md		README.md

seyedahbr/Wikidata_Reference_Statistics

Folders and files

Latest commit

History

Repository files navigation

Wikidata Reference Statistics

To reproduce the experiments

About

Topics

Resources

Stars

Watchers

Forks

Languages