Issue 280, quering Wikidata
This wikipage is for issue-280's working group subsidy and reference. See also Wikidata's Guidelines for external relationships.
The issue 280 started with the title's suggestion, "Schema.org should have mappings to Wikidata terms where possible", but the simplest and collaborative way is to feed Wikidata: these are the basic @thadguidry rules to accomplish the task at Wikidata editions,
-
Schema.org types mapped using exact match (P2888);
-
Schema.org types mapped using (external) Equivalent Class (P1709);
-
Schema.org properties mapped using (external) Equivalent Property (P1628);
-
When there's no equivalent but there is a sub/super available, then it (Schema.org properties or types) will be mapped using:
Now, to get back the information, we need "figure out the SPARQL for query.wikidata.org that would extract these mappings", as @danbri suggested.
The "wanted universe" is provided by a simple query, and perhaps works fine for a local Wikidata user (at the query.wikidata.org's server.
See also Issue 280, quering Wikidata, OLD QUERIES.
All periodic reports was generated by a "standard query", to get relationship information (equivclass, equivprop, sub or super) and export result to other algorithms, databases or spreadsheets.
SELECT ?wd ?wdLabel ?corrName ?schema
{
values (?corr ?corrName)
{ (wdt:P2235 "superProp") (wdt:P2236 "subProp") (wdt:P1628 "equivProp")
(wdt:P1709 "equivClass") (wdt:P2888 "exactMatch")
}
?wd ?corr ?schema
filter(regex(str(?schema), "schema.org"))
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?corrName ?schema
After download as CSV (eg. file wd2schema-raw-2016-06.csv
), produce a standard SQL table to manipulate data,
DROP TABLE IF EXISTS wd2schema_std;
CREATE TABLE IF NOT EXISTS wd2schema_std_temp (
pLabel text, p text,
equivclass text, equivprop text, sub text, super text, exact text
);
COPY wd2schema_std_temp FROM '/tmp/wd2schema-raw-2016-06.csv' CSV HEADER;
-- adapat to better structure:
CREATE TABLE wd2schema_std AS
SELECT
COALESCE(equivclass,equivprop,sub,super) as sch_id,
CASE
WHEN equivclass IS NOT NULL THEN 'equivclass'
WHEN equivprop IS NOT NULL THEN 'equivprop'
WHEN sub IS NOT NULL THEN 'sub'
WHEN super IS NOT NULL THEN 'super'
WHEN exact IS NOT NULL THEN 'exact'
END as reltype,
p As wd_id,
plabel as wd_label
FROM wd2schema_std_temp
;
CREATE VIEW vw_wd2schema_std AS
SELECT *, reltype='equivprop' as rel_isprop,
ascii(substring(sch_id,1,1)::char)>87 as sch_isprop,
substring(wd_id,1,1)='P' wd_isprop
FROM wd2schema_std ORDER BY sch_id,reltype
;
DROP TABLE wd2schema_std_temp;
-- clean:
UPDATE wd2schema_std SET
wd_id = regexp_replace(wd_id, '^https?://[a-z\.]+/?', ''),
sch_id = regexp_replace(sch_id, '^https?://[a-z\.]+/?', '')
;
-- back as new file:
COPY (SELECT * FROM wd2schema_std ORDER BY sch_id,reltype)
TO '/tmp/wd2schema-std-2017-05.csv' WITH CSV HEADER;
So, as input we have wd2schema-raw-2016-06.csv
and as output wd2schema-std-2016-06.csv
, and the SQL database with the same data to perform queries.
SELECT count(*) as n_tot FROM wd2schema_std;
-- summarize reltypes
SELECT reltype, count(*) as n FROM wd2schema_std
GROUP BY 1 ORDER BY 1;
-- summarize reltypes and pure_prop
WITH t AS (
SELECT *, (rel_isprop AND sch_isprop AND wd_isprop) as pure_prop
FROM vw_wd2schema_std
) SELECT reltype, pure_prop, count(*) as n
FROM t GROUP BY 1,2 ORDER BY 1,2;
-- summarize schemaOrg name that repeat
SELECT sch_id, count(*) as n FROM wd2schema_std
GROUP BY 1 HAVING count(*)>1
ORDER BY 2 DESC, 1;
-- summarize plabel that repeat
SELECT wd_label, count(*) as n FROM wd2schema_std
GROUP BY 1 HAVING count(*)>1
ORDER BY 2 DESC, 1;
-- summarize wd_id that repeat
SELECT wd_id, count(*) as n FROM wd2schema_std
GROUP BY 1 HAVING count(*)>1
ORDER BY 2 DESC, 1;
Quering and reporting results.
-
n_tot
= 508 (without "exact" reduces to 444) -
repeated
wd_label
: service (19), seat (5), composer (3), duration (3), ..., answer (2), audience (2), ... -
repeated
sch_id
: (empty 64), AdministrativeArea (2), audience (2), award (2), BankOrCreditUnion (2), BarOrPub (2), brand (2)... 25 cases. -
repeated
wd_id
: Q7406919 (19), Q2731419 (5), P710 (3), Q42253 (3), P580 (2), P580 (2), P86 (2), Q1004 (2), Q13442814 (2)... 23 cases. -
reltypes:
reltype | n |
---|---|
equivclass | 332 |
equivprop | 84 |
exact | 64 |
sub | 22 |
super | 6 |
-
n_tot
= 435 - repeated
wd_label
: service (13), seat (5), brand (2), collection (2), composer (2), duration (2), has part (2) ... - repeated
wd_id
: Q7406919 (19), Q2731419 (5), P527 (2), P580 (2), P580 (2), P710 (2), P86 (2), Q186005 (2). - repeated
sch_id
: AdministrativeArea (2), audience (2), award (2), BankOrCreditUnion (2), BarOrPub (2), brand (2)... - reltypes:
reltype | n |
---|---|
equivclass | 327 |
equivprop | 82 |
sub | 22 |
super | 4 |
-
n_tot
= 182 - repeated
wd_label
: location (13), brand (3), image (3), volume (3), author (2), child (2), director (2) ... - repeated
wd_id
: P276 (13), P40 (2), P433 (2), P478 (2), Q1656682 (2), Q42253 (2), Q431289 (2), Q478798 (2). - repeated
sch_id
: deathPlace (3), name (3), actor (2), author (2), birthDate (2), birthPlace (2), brand (2), ... - reltypes:
reltype | n |
---|---|
equivclass | 80 |
pure equivprop | 63 |
equivprop | 15 |
sub | 12 |
super | 12 |
-
n_tot
= 156 - repeated
wd_label
: location (13), image (3), audience (2), author (2), brand (2), ..., Uniform Resource Locator (2). Only 12 (8%). - reltypes:
reltype | n |
---|---|
equivclass | 68 |
equivprop | 66 |
sub | 12 |
super | 10 |