Skip to content

Issue 280, quering Wikidata

Peter edited this page Jul 12, 2017 · 19 revisions

This wikipage is for issue-280's working group subsidy and reference. See also Wikidata's Guidelines for external relationships.


Introduction

The issue 280 started with the title's suggestion, "Schema.org should have mappings to Wikidata terms where possible", but the simplest and collaborative way is to feed Wikidata: these are the basic @thadguidry rules to accomplish the task at Wikidata editions,


Now, to get back the information, we need "figure out the SPARQL for query.wikidata.org that would extract these mappings", as @danbri suggested.

Quering and exporting

The "wanted universe" is provided by a simple query, and perhaps works fine for a local Wikidata user (at the query.wikidata.org's server.
See also Issue 280, quering Wikidata, OLD QUERIES.

Standard sparql query

All periodic reports was generated by a "standard query", to get relationship information (equivclass, equivprop, sub or super) and export result to other algorithms, databases or spreadsheets.

SELECT ?wd ?wdLabel ?corrName ?schema
{
  values (?corr ?corrName) 
    { (wdt:P2235 "superProp") (wdt:P2236 "subProp") (wdt:P1628 "equivProp") 
      (wdt:P1709 "equivClass") (wdt:P2888 "exactMatch")
    }
  ?wd ?corr ?schema
  filter(regex(str(?schema), "schema.org"))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?corrName ?schema

Standard SQL database

After download as CSV (eg. file wd2schema-raw-2016-06.csv), produce a standard SQL table to manipulate data,

DROP TABLE IF EXISTS wd2schema_std;

CREATE TABLE  IF NOT EXISTS wd2schema_std_temp (
	pLabel text, p text,
	equivclass text, equivprop text, sub text, super text, exact text
);

COPY wd2schema_std_temp FROM '/tmp/wd2schema-raw-2016-06.csv' CSV HEADER;

-- adapat to better structure:
CREATE TABLE wd2schema_std AS 
  SELECT
    COALESCE(equivclass,equivprop,sub,super) as sch_id,
    CASE 
	    WHEN equivclass IS NOT NULL THEN 'equivclass'
	    WHEN equivprop IS NOT NULL THEN 'equivprop'
	    WHEN sub IS NOT NULL THEN 'sub'
	    WHEN super IS NOT NULL THEN 'super'
	    WHEN exact IS NOT NULL THEN 'exact'
    END as reltype,
    p As wd_id,
    plabel as wd_label
  FROM wd2schema_std_temp
;
CREATE VIEW vw_wd2schema_std AS
  SELECT *, reltype='equivprop' as rel_isprop,
        ascii(substring(sch_id,1,1)::char)>87 as sch_isprop, 
        substring(wd_id,1,1)='P' wd_isprop 
  FROM wd2schema_std ORDER BY sch_id,reltype
;
DROP TABLE wd2schema_std_temp;

-- clean:
UPDATE wd2schema_std SET 
   wd_id  = regexp_replace(wd_id,  '^https?://[a-z\.]+/?', ''),
   sch_id = regexp_replace(sch_id, '^https?://[a-z\.]+/?', '')
;
-- back as new file:
COPY (SELECT * FROM wd2schema_std ORDER BY sch_id,reltype) 
TO '/tmp/wd2schema-std-2017-05.csv' WITH CSV HEADER;

So, as input we have wd2schema-raw-2016-06.csv and as output wd2schema-std-2016-06.csv, and the SQL database with the same data to perform queries.

Summarizations

SELECT count(*) as n_tot FROM wd2schema_std;

-- summarize reltypes
SELECT reltype, count(*) as n  FROM wd2schema_std
GROUP BY 1 ORDER BY 1;

-- summarize reltypes and pure_prop
WITH t AS (
  SELECT *, (rel_isprop AND sch_isprop AND wd_isprop) as pure_prop
  FROM vw_wd2schema_std
) SELECT reltype, pure_prop, count(*) as n
  FROM t GROUP BY 1,2 ORDER BY 1,2;

-- summarize schemaOrg name that repeat
SELECT sch_id, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

-- summarize plabel that repeat
SELECT wd_label, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

-- summarize wd_id that repeat
SELECT wd_id, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

Results

Quering and reporting results.

in 2017-05

  • n_tot = 508   (without "exact" reduces to 444)

  • repeated wd_label: service (19), seat (5), composer (3), duration (3), ..., answer (2), audience (2), ...

  • repeated sch_id: (empty 64), AdministrativeArea (2), audience (2), award (2), BankOrCreditUnion (2), BarOrPub (2), brand (2)... 25 cases.

  • repeated wd_id: Q7406919 (19), Q2731419 (5), P710 (3), Q42253 (3), P580 (2), P580 (2), P86 (2), Q1004 (2), Q13442814 (2)... 23 cases.

  • reltypes:

reltype n
equivclass 332
equivprop 84
exact 64
sub 22
super 6

Results in 2016-10

  • n_tot = 435
  • repeated wd_label: service (13), seat (5), brand (2), collection (2), composer (2), duration (2), has part (2) ...
  • repeated wd_id: Q7406919 (19), Q2731419 (5), P527 (2), P580 (2), P580 (2), P710 (2), P86 (2), Q186005 (2).
  • repeated sch_id: AdministrativeArea (2), audience (2), award (2), BankOrCreditUnion (2), BarOrPub (2), brand (2)...
  • reltypes:
reltype n
equivclass 327
equivprop 82
sub 22
super 4

in 2016-08

  • n_tot = 182
  • repeated wd_label: location (13), brand (3), image (3), volume (3), author (2), child (2), director (2) ...
  • repeated wd_id: P276 (13), P40 (2), P433 (2), P478 (2), Q1656682 (2), Q42253 (2), Q431289 (2), Q478798 (2).
  • repeated sch_id: deathPlace (3), name (3), actor (2), author (2), birthDate (2), birthPlace (2), brand (2), ...
  • reltypes:
reltype n
equivclass 80
pure equivprop 63
equivprop 15
sub 12
super 12

in 2016-06

  • n_tot = 156
  • repeated wd_label: location (13), image (3), audience (2), author (2), brand (2), ..., Uniform Resource Locator (2). Only 12 (8%).
  • reltypes:
reltype n
equivclass 68
equivprop 66
sub 12
super 10