Skip to content

willbeason/wikipedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

willbeason/wikipedia

This repository is a collection of code I use for analyzing wikipedia. The main idea here is to document methods for various analysis I've done.

Generally, code starts in commands under cmd/ and gets migrated to pkg/ when I figure out how to make it more general.

I make no guarantees on the interoperability of the code here, or that output will remain consistent with time. For now these are fun, non-professional projects.

Many of these commands will change with time as I learn more about the Wikipedia corpus. This is a large, complex set of data and there aren't (yet) good resources on the types of things these libraries need to account for.

The Data

Commands in this repository mainly operate on either the regular pages-articles-multistream Wikipedia dumps, or Badger databases constructed from the commands in this repository. So (for now) you can't really look at individual articles unless you know their Wikipedia Page ID (You can get this by clicking "Page information" on any page on Wikipedia.org).

The data format is optimized for streaming the entirety of Wikipedia. I've offloaded concerns about indexing articles or how large to make files to Badger as this quickly became a headache I didn't want to deal with.

The Commands

A brief description of the commands so far, roughly in the order you'd use them.

extract-wikipedia operates on the pages-articles-multistream dump of Wikipedia, extracting each element into its own file. As of this writing, extraction results in about 213,000 files.

clean-wikipedia operates on the output of extract-wikipedia, removing noisy elements present in Wikipedia such as tables or various non-shown elements. This results in an intermediate set of data which is essentially "the text of Wikipedia as a user would see it". I keep this as I'm still updating normalize-wikipedia, and extract-wikipedia takes annoyingly long so it's worth saving the output of this intermediate step.

normalize-wikipedia operates on the output of clean-wikipedia, performing operations like making everything lowercase, identifying numbers and dates, removing tables, and the like. This is the data set I directly operate on.

Warnings

Some commands just fail - they'll panic in bader's DB.Orchestrate. I don't have a good explanation for this. Either there's a bug in Badger when streaming data in 32 threads, or I'm using something wrong. In any case, rerunning the command several times seems to work fine. Each time the process will make more progress, and eventually it succeeds. For now commands take on the order of a couple of minutes for me to rerun, so it isn't worth my time to debug.

Memory consumption is essentially unbounded since I haven't messed with any of the Badger memory settings. This is fine for my machine since I have 64 GB of memory and can hold most (or in some cases all) of Wikipedia in memory at once, but you may experience slowdowns if you don't modify the options set in pkg/db/process.go.

About

Creating and analyzing Wikipedia corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published