Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Wikidata primary source for Pakistan #119834

Open
lucychambers opened this issue Aug 29, 2017 · 13 comments
Open

Make Wikidata primary source for Pakistan #119834

lucychambers opened this issue Aug 29, 2017 · 13 comments

Comments

@lucychambers
Copy link

Limit initially to:

  • Current term (14th Assembly)
  • Lower House

This will involve creating multiple prompts to make sure that Wikidata is sufficient:

  • Wikidata ⇆ Wikipedia
  • Wikidata ⇆ Official site
  • (Maybe, whilst in progress) Wikidata ⇆ EP

Acceptance Criteria

  • All membership sources (morph/official.csv and archive/official-term-14.csv) are replaced by information from Wikidata.
  • No information is lost (unless it was incorrect) — anything not in Wikidata, and not suitable for adding there can be added to a new person source.
@tmtmtmtm
Copy link
Contributor

For now it's OK to make a morph scraper to get the membership/P39 information out of Wikidata — we won't know the correct abstractions for doing something more general until we've done quite a few more of these.

@chrismytton
Copy link
Contributor

I've made an initial prompt which compares what's in the Wikipedia scraper with Wikidata items that have a current (no end date) Member of the 14th National Assembly of Pakistan (Q33512801) P39 entry.

https://www.wikidata.org/wiki/User:Chris_Mytton/sandbox/prompts/Pakistan_National_Assembly

@lucychambers
Copy link
Author

lucychambers commented Sep 5, 2017

Outstanding:

  • Scraper for historical members (within current term).
    These are the people in the "Membership changes" section of the List of members of the 14th National Assembly of Pakistan page on Wikipedia.
  • Prompt for the scraped historical members.
  • Generate Quickstatements TSV to add missing memberships for historical members
  • Write a membership scraper to get term 14 data from Wikidata
  • EP comparison prompt. EP master branch <-> Wikidata.
  • Switch over EveryPolitician to treat Wikidata as the main source, but don't merge yet
  • Once any problems have been fixed, merge the EP branch to complete the switch

@chrismytton
Copy link
Contributor

I've also created a prompt for the official site.

That prompt uses a manually generated CSV that takes the output from the scraper and combines it with EveryPolitician reconciliation information using something similar to the following command, run from data/Pakistan/Assembly in ep-data. ~/Downloads/pakistan-national-assembly.csv is the scraper output.

q -H -d, -O 'select w.id as wikidata, name
 from ~/Downloads/pakistan-national-assembly.csv o
 left join sources/idmap/official.csv oid on oid.id = o.id
 left join sources/reconciliation/wikidata.csv w on w.uuid = oid.uuid'

@chrismytton
Copy link
Contributor

Once #53037 has been merged that should fix a couple of issues with the official site prompt.

I've also been seeing a strange error with the official site prompt, sometimes the SPARQL will return 330 results, but then clicking through an running the SPARQL manually returns 339 results, as expected. I'm not sure if there's anything that can be done or if it's just transient, but worth watching out for.

@chrismytton
Copy link
Contributor

I've updated the Wikipedia scraper to pick up historic members from the "Membership changes" table in everypolitician-scrapers/pakistan-national-assembly-wikipedia@62d4ac5.

@chrismytton
Copy link
Contributor

Prompt for the historic members of the 14th term is here:

https://www.wikidata.org/wiki/User:Chris_Mytton/sandbox/prompts/Pakistan_National_Assembly_historic

@chrismytton
Copy link
Contributor

I've generated Quickstatements (docs) for the missing historic term 14 members here:

https://gist.github.com/chrismytton/aa224963a46b92dc273569af7355a512

@chrismytton
Copy link
Contributor

I've now run the that batch of Quickstatements, so the people on the historic term 14 prompt should now all have a "Member of the 14th National Assembly of Pakistan" P39 statement.

@chrismytton
Copy link
Contributor

The members that I've just added a term 14 P39 for are missing start and end dates, because they weren't simple to scrape from the Wikipedia page. @lucychambers has kindly volunteered to go through and manually add them for the 22 members on the prompt, thanks Lucy!

@chrismytton
Copy link
Contributor

Before we can switch EveryPolitician over to using Wikidata as the primary membership source we need to create a scraper. The tonga-assembly-wikidata scraper is probably the best example we have to work from, that scraper was created on a previous attempt to switch a country to using Wikidata, so should in theory have all the fields we need.

@chrismytton
Copy link
Contributor

chrismytton commented Sep 28, 2017

I've created the scraper for getting membership information from Wikidata:

@chrismytton
Copy link
Contributor

Prompt created at User:Chris_Mytton/sandbox/prompts/Pakistan_National_Assembly_EveryPolitician which compares the term-14.csv file for Pakistan on EveryPolitician's master branch with what's currently in Wikidata.

@tmtmtmtm tmtmtmtm transferred this issue from everypolitician/everypolitician Nov 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants