GitHub - kevincrane/wikimap: Playing with giant datasets. An attempt to map all of the knowledge in the world by drawing connections between Wikipedia topics. Will identify which topics are the most influential in general knowledge, as well as what each topic is related to by links. Hopefully I'll learn how to make a visualization too, that'd be awesome.

kevincrane / wikimap Public

Playing with giant datasets. An attempt to map all of the knowledge in the world by drawing connections between Wikipedia topics. Will identify which topics are the most influential in general knowledge, as well as what each topic is related to by links. Hopefully I'll learn how to make a visualization too, that'd be awesome.

1 star 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
model		model
oldXmlExtractor		oldXmlExtractor
.gitignore		.gitignore
Indexing Progress.xls		Indexing Progress.xls
README		README
articleLinkParser.py		articleLinkParser.py
articleLinkParserTest.py		articleLinkParserTest.py
miniwiki.xml		miniwiki.xml
page_parser.py		page_parser.py
wikiLinkExtractor.py		wikiLinkExtractor.py

Repository files navigation

Author: Kevin Crane

Playing with giant datasets. An attempt to map all of the knowledge in the world by
drawing connections between Wikipedia topics. Will identify which topics are the
most influential in general knowledge, as well as what each topic is related to by
links. Hopefully I'll learn how to make a visualization too, that'd be awesome.

Files:
- wikiLinkExtractor.py : Reads a wiki XML file, iterates through each
article, extracts every link from the article,
then stores the page metadata in IndexModel,
and the links in LinksModel.
- page_parser.py : Taken from https://github.com/gareth-lloyd/visualizing-events
Creates an XML parser optimized for long files
and designed to store wiki articles
- articleLinkParser.py : Receives a WikiPage object containing the article
text and parses it, extracting all of the links
inside (denoted by: [[link]]). It's not perfect,
but it does a pretty great job at pulling out
links.
- model
- IndexModel.py : Contains all information and methods for the
Index DB table
- LinksModel.py : Contains all information and methods for the
Links DB table

To Run:
1. First, make wikiXmlIndexer.py and articleLinkExtractor.py executable

$ chmod +x wikiLinkExtractor.py

2. Make sure you're all set up for this. There are some configurations
you can do to suit yourself, or you can just leave everything as
default and it'll work. The only thing you have to do if you use
the default settings is make a new MySQL database called 'wikimap'.

> CREATE DATABASE wikimap;

- If you want to make changes, go inside the 'main' method of
wikiLinkExtractor.py and edit the arguments used when instantiating
a new WikiLinkExtractor. Alternatively, you can change the defaults
directly in IndexModel.py and LinksModel.py.

3. Next, make sure you know how to use the program so you don't blow
anything up.

$ ./wikiXmlExtractor.py --help

4. Now comes the fun part. You're going to point the program to your
giant Wikipedia XML file (~40GB), specify the names of the tables you
want to store the metadata and links to, and then you just wait. For
like, 40 hours. Luckily I designed it so it could pick up where it
left off, so feel free to Ctrl-C and restart it whenever you like.

$ ./wikiLinkExtractor.py -f enwiki-20130102-pages-articles.xml -i wiki_index -l wiki_links

- This program will iterate through the Wiki XML file you pointed to,
read each article, and extract the links from the text. It will then
store the article metatdata into the Index table you specified, and
the links contained within that article in the Links table.
- The first time you run it, you should add the '--reset' flag as well.
This will create new tables for you and get them all set up. However,
DO NOT use this flag after you've already started indexing everything.
It will drop your tables on the floor, erasing all the time you spent
waiting.
- Try prepending the 'time' command above if you're curious about
exactly how long you've spent waiting.
* As you can see, you'll need to point at an XML file from Wikipedia
containing all of the information in the world.
http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
(tested with data dump from January 2nd, 2013)

5. One last step. There's a bunch of waiting in this too, but nowhere near
as much as the first time. Just call this command and you're free to
leave and go surf Facebook or Hacker News.

$ ./wikiLinkExtractor.py -i wiki_index -l wiki_links --linkto

- This will start by spending a good amount of time grouping every single
row in the Links table by article and count how many times a different
page linked to it. This is the most time-consuming operation of this
whole step as it needs to sort and group like 24,000,000 wikipedia
article titles, each containing ~140 links.
- Once this ordeal of sorting is done, it will store these link counts
into the metadata of the Index table.

6. You did it, you're done! You should now have 2 MySQL database tables:
1. The Index table, containing:
- Article title
- Wiki Id number
- Number of pages it linked to
- Number of pages that linked to it (how popular it is)
2. The Links table, containing:
- The Wiki Id of a link's origin
- The article title of a link's destination

- This will all be used to serve up a (hopefully) badass web app that
will let you map the connections between topics of knowledge.
- If it doesn't turn out THAT incredibly, it will still end up being a
hugely useful application for buidling up web dev expertise.

Developer Information:
- Name: Kevin Crane
- Contact: kevincrane@gmail.com
- OS: Ubuntu 12.04 LTS
- Language: Python 2.7
- IDE: Pycharm 2.7
- System: Dell Vostro v131