Tutorial:Querying Wikipedia with mwclient
mwclient
is a library for accessing the MediaWiki API from Python. MediaWiki powers Wikipedia and a bunch of other wikis. In this quick guide, we'll look at how we can use mwclient
to query any MediaWiki-powered site for the information we want.
Installing mwclient
After installing mwclient
(see the README), launch Python and confirm it's installed:
>>> import mwclient
If that didn't raise any errors, congratulations! You're all set to go.
Here's how you connect to Wikipedia and ask for revisions of the Wikipedia:Sandbox page:
>>> import mwclient
>>> from pprint import pprint
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Wikipedia:Sandbox']
>>> revisions = page.revisions()
>>> for counter in range(5):
... rev = revisions.next()
... pprint(rev)
...
{'revid': 290932490,
'timestamp': (2009, 5, 19, 12, 43, 13, 1, 139, -1),
'user': 'Benlisquare'}
{'anon': '',
'revid': 290930263,
'timestamp': (2009, 5, 19, 12, 29, 23, 1, 139, -1),
'user': '62.254.235.147'}
{'anon': '',
'revid': 290930082,
'timestamp': (2009, 5, 19, 12, 28, 16, 1, 139, -1),
'user': '166.216.160.16'}
{'comment': 'Clearing the sandbox ([[WP:BOT|BOT]] EDIT)',
'revid': 290927544,
'timestamp': (2009, 5, 19, 12, 10, 6, 1, 139, -1),
'user': 'SoxBot'}
{'anon': '',
'revid': 290927187,
'timestamp': (2009, 5, 19, 12, 7, 29, 1, 139, -1),
'user': '62.254.235.147'}
Compare the output you get with the page’s revision history on Wikipedia. They should match.
Calling page.revisions()
gives us a generator that returns revisions in reverse chronological order, with the most recent edit first. Each revision is a dictionary containing the keys you see above. The optional anon
key indicates an anonymous edit; user
then contains the editor's IP address instead of user name. All keys and string values will be Unicode strings.
To get all edits between two dates in the forward direction, with the text content of each revision, do this:
>>> revisions = page.revisions(start='2009-05-19T00:00:00Z',
... end='2009-05-19T23:59:59Z',
... dir='newer',
... prop='ids|timestamp|flags|comment|user|content')
And here's how to get all the edits of any given user. Let's look at SoxBot
from the revisions above:
>>> contribs = site.usercontributions(u'SoxBot')
>>> for counter in range(2):
... rev = contribs.next()
... pprint(rev)
...
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
'ns': 3,
'pageid': 17244650,
'revid': 290942689,
'timestamp': (2009, 5, 19, 13, 44, 26, 1, 139, -1),
'title': 'User talk:Twinzor',
'top': '',
'user': 'SoxBot'}
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
'ns': 3,
'pageid': 21352732,
'revid': 290942678,
'timestamp': (2009, 5, 19, 13, 44, 23, 1, 139, -1),
'title': 'User talk:Turco85',
'top': '',
'user': 'SoxBot'}
-
MediaWiki timestamp strings can be generated using
"%Y-%m-%dT%H:%M:%SZ"
as format string with Python'sdatetime.strftime
. All timestamps must be in UTC. -
You can pass a combination of parameters to
page.revisions()
to get revisions the way you want them. You can even skip the dates and call withstartid
orendid
= any revision number (seerevid
in the output), to retrieve revisions before or after that one. -
To look at what parameters the
page.revisions()
andsite.usercontributions()
functions take, use Python's built-in help browser:
>>> help(page.revisions)
>>> help(site.usercontributions)
This page was originally written by Kiran Jonnalagadda (@jace) on his blog, at 20:03, 19 May 2009.
- Querying Wikipedia with mwclient
- Replacing a string for every page in a category
- Creating a page listing all pages in a category
Note: the red links below are pages yet to be created. Feel free to add them!