Issue 280, quering Wikidata

This wikipage is for issue-280's working group subsidy and reference. See also Wikidata's Guidelines for external relationships.

Introduction
Quering and exporting
Results

Introduction

The issue 280 started with the title's suggestion, "Schema.org should have mappings to Wikidata terms where possible", but the simplest and collaborative way is to feed Wikidata: these are the basic @thadguidry rules to accomplish the task at Wikidata editions,

Schema.org types mapped using exact match (P2888);
Schema.org types mapped using (external) Equivalent Class (P1709);
Schema.org properties mapped using (external) Equivalent Property (P1628);
When there's no equivalent but there is a sub/super available, then it (Schema.org properties or types) will be mapped using:
- (external) Super-property (P2235);
- (external) Sub-property (P2236).

Now, to get back the information, we need "figure out the SPARQL for query.wikidata.org that would extract these mappings", as @danbri suggested.

Quering and exporting

The "wanted universe" is provided by a simple query, and perhaps works fine for a local Wikidata user (at the query.wikidata.org's server.
See also Issue 280, quering Wikidata, OLD QUERIES.

Standard sparql query

All periodic reports was generated by a "standard query", to get relationship information (equivclass, equivprop, sub or super) and export result to other algorithms, databases or spreadsheets.

SELECT ?wd ?wdLabel ?corrName ?schema
{
  values (?corr ?corrName) 
    { (wdt:P2235 "superProp") (wdt:P2236 "subProp") (wdt:P1628 "equivProp") 
      (wdt:P1709 "equivClass") (wdt:P2888 "exactMatch")
    }
  ?wd ?corr ?schema
  filter(regex(str(?schema), "schema.org"))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?corrName ?schema

Standard SQL database

After download as CSV (eg. file wd2schema-raw-2016-06.csv), produce a standard SQL table to manipulate data,

DROP TABLE IF EXISTS wd2schema_std;

CREATE TABLE  IF NOT EXISTS wd2schema_std_temp (
	pLabel text, p text,
	equivclass text, equivprop text, sub text, super text, exact text
);

COPY wd2schema_std_temp FROM '/tmp/wd2schema-raw-2016-06.csv' CSV HEADER;

-- adapat to better structure:
CREATE TABLE wd2schema_std AS 
  SELECT
    COALESCE(equivclass,equivprop,sub,super) as sch_id,
    CASE 
	    WHEN equivclass IS NOT NULL THEN 'equivclass'
	    WHEN equivprop IS NOT NULL THEN 'equivprop'
	    WHEN sub IS NOT NULL THEN 'sub'
	    WHEN super IS NOT NULL THEN 'super'
	    WHEN exact IS NOT NULL THEN 'exact'
    END as reltype,
    p As wd_id,
    plabel as wd_label
  FROM wd2schema_std_temp
;
CREATE VIEW vw_wd2schema_std AS
  SELECT *, reltype='equivprop' as rel_isprop,
        ascii(substring(sch_id,1,1)::char)>87 as sch_isprop, 
        substring(wd_id,1,1)='P' wd_isprop 
  FROM wd2schema_std ORDER BY sch_id,reltype
;
DROP TABLE wd2schema_std_temp;

-- clean:
UPDATE wd2schema_std SET 
   wd_id  = regexp_replace(wd_id,  '^https?://[a-z\.]+/?', ''),
   sch_id = regexp_replace(sch_id, '^https?://[a-z\.]+/?', '')
;
-- back as new file:
COPY (SELECT * FROM wd2schema_std ORDER BY sch_id,reltype) 
TO '/tmp/wd2schema-std-2017-05.csv' WITH CSV HEADER;

So, as input we have wd2schema-raw-2016-06.csv and as output wd2schema-std-2016-06.csv, and the SQL database with the same data to perform queries.

Summarizations

SELECT count(*) as n_tot FROM wd2schema_std;

-- summarize reltypes
SELECT reltype, count(*) as n  FROM wd2schema_std
GROUP BY 1 ORDER BY 1;

-- summarize reltypes and pure_prop
WITH t AS (
  SELECT *, (rel_isprop AND sch_isprop AND wd_isprop) as pure_prop
  FROM vw_wd2schema_std
) SELECT reltype, pure_prop, count(*) as n
  FROM t GROUP BY 1,2 ORDER BY 1,2;

-- summarize schemaOrg name that repeat
SELECT sch_id, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

-- summarize plabel that repeat
SELECT wd_label, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

-- summarize wd_id that repeat
SELECT wd_id, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

Results

Quering and reporting results.

in 2017-05

n_tot = 508 (without "exact" reduces to 444)
repeated wd_label: service (19), seat (5), composer (3), duration (3), ..., answer (2), audience (2), ...
repeated sch_id: (empty 64), AdministrativeArea (2), audience (2), award (2), BankOrCreditUnion (2), BarOrPub (2), brand (2)... 25 cases.
repeated wd_id: Q7406919 (19), Q2731419 (5), P710 (3), Q42253 (3), P580 (2), P580 (2), P86 (2), Q1004 (2), Q13442814 (2)... 23 cases.
reltypes:

reltype	n
equivclass	332
equivprop	84
exact	64
sub	22
super	6

Results in 2016-10

n_tot = 435
repeated wd_label: service (13), seat (5), brand (2), collection (2), composer (2), duration (2), has part (2) ...
repeated wd_id: Q7406919 (19), Q2731419 (5), P527 (2), P580 (2), P580 (2), P710 (2), P86 (2), Q186005 (2).
repeated sch_id: AdministrativeArea (2), audience (2), award (2), BankOrCreditUnion (2), BarOrPub (2), brand (2)...
reltypes:

reltype	n
equivclass	327
equivprop	82
sub	22
super	4

in 2016-08

n_tot = 182
repeated wd_label: location (13), brand (3), image (3), volume (3), author (2), child (2), director (2) ...
repeated wd_id: P276 (13), P40 (2), P433 (2), P478 (2), Q1656682 (2), Q42253 (2), Q431289 (2), Q478798 (2).
repeated sch_id: deathPlace (3), name (3), actor (2), author (2), birthDate (2), birthPlace (2), brand (2), ...
reltypes:

reltype	n
equivclass	80
pure equivprop	63
equivprop	15
sub	12
super	12

in 2016-06

n_tot = 156
repeated wd_label: location (13), image (3), audience (2), author (2), brand (2), ..., Uniform Resource Locator (2). Only 12 (8%).
reltypes:

reltype	n
equivclass	68
equivprop	66
sub	12
super	10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly