Skip to content

TomerAberbach/wikipedia-ngrams

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Wikipedia Ngrams

A Kotlin project which extracts ngram counts from Wikipedia data dumps.

Download

Download the latest jar from releases.

You can also clone the repository and build with maven:

$ git clone https://github.com/TomerAberbach/wikipedia-ngrams.git
$ cd wikipedia-ngrams
$ mvn package

A fat jar called wikipedia-ngrams-VERSION-jar-with-dependencies.jar will be in a newly created target directory.

Usage

DISCLAIMER: Many of these commands will take a very long time to run.

Download the latest Wikipedia data dump using wget:

$ wget -np -nd -c -A 7z https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2

Or using axel:

$ axel --num-connections=3 https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2

To speed up the download you should replace https://dumps.wikimedia.org with the mirror closest to you.

Once downloaded, extract the zipped data using a tool like lbzip2 and feed the resulting enwiki-latest-pages-articles.xml file into WikiExtractor:

$ python3 WikiExtractor.py --no_templates --json enwiki-latest-pages-articles.xml

This will output a large directory structure with root directory text.

Finally, run wikipedia-ngrams.jar with the desired ngram "n" (2 in this example) and the path to directory output of WikiExtractor:

$ java -jar wikipedia-ngrams.jar 2 text

Note that you may need to increase the maximum heap size and/or disable GC overhead limit.

contexts.txt and 2-grams.txt files will be in an out directory. contexts.txt caches the "sentences" in the Wikipedia data dump. To use this cache in your next run (with n = 3 for example), run the following command:

$ java -jar wikipedia-ngrams.jar 3 out/contexts.txt

The outputted files will not be sorted. Use a command-line tool like sort to do so.

Note that OutOfMemoryError is not a legitimate issue. The burden is on the user to allocate enough heap space and have a large enough RAM (consider allocating a larger swap file).

Dependencies

License

MIT ยฉ Tomer Aberbach