Skip to content

Framartin/wikipedia_network_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikipedia_network_analysis

Statistical Network Analysis of the Field of Statistics on Wikipedia In English

Main statistical categories visualization

Introduction

The Python script, which extracts edges of our graphs, is inspired by the great brianckeegan's Wikipedia-Network-Analysis python notebook available on github under MIT license.

A statistical analysis in French is available in the report folder.

Data

We build a directed graph using links between Wikipedia articles related to the specific field of Statistics (but you can quite easily change it if you want).

We have two solutions to get pages related to Statistics :

  1. Using Category:Statistics
  2. Using lists of articles about statistics (List_of_statistics_articles and Outline_of_statistics) featured in the Portal:Statistics

See Extract_links_from_API.py for more details. We strongly recommend using the second solution.

  • Data, available in edges1.csv and vertex1.csv files, was extracted the 30/12/2014 using the first solution.
  • Data, available in edges2.csv and vertex2.csv files, was extracted the 27/12/2014 using the second solution.

If you want to update this data, please donate to Wikimedia, because this operation is quite resource consuming for the MediaWiki API.

Warnings

Please consider the following problem pointed out by brianckeegan :

Wikipedia article also contain templates (https://en.wikipedia.org/wiki/Help:Template) which creates lots of "redundant" links between articles that share templates even those these links don't appear in the body of the article itself. You'll need to do much more advanced text parsing of wiki-markup to actually get links in the body of an article

Requirements

  • Python 2.7
  • Anaconda
  • wikitools which you can install with pip : pip install wikitools
  • R and its package igraph

License

The Python and R scripts are under MIT License.