IndexWikipedia

A simple utility to index wikipedia dumps using Lucene.

This tool can be used to quickly create an index. It is then expected that a programmer will write some code to use the index. This project does not aim to build an end-user index.

It is useful as part of research projects.

Usage:

install java (JDK) if needed
install maven if needed
grap your wikipedia dump: you might be grab quickly part of the dump by typing a command like wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2. (Sorry the database dumps are not at a fixed location so we cannot provide a precise URI.) Be mindful that there are many types of Wikipedia dumps and not all of them contain the articles: when in doubt, read the documentation.
mvn compile
Create a directory where your index will reside, such as WikipediaIndex. E.g., you might be able to type mkdir WikipediaIndex. Be mindful not to reuse the same directory for different projects or different Lucene versions.
mvn exec:java -Dexec.args="yourdump someoutputdirectory

Actual example:

git clone https://github.com/lemire/IndexWikipedia.git
cd IndexWikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
mkdir Index
mvn compile
mvn exec:java -Dexec.args="enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 Index"

Note that this precise example may fail unless you adjust the URI https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 since Wikipedia dumps are not guaranteed to stay at the same URI.

The documents have title, name, docid and body fields, all of which are stored with the index.

To see how you might then query the index, see the class file 'Query.java' for a working example.

Extracting word-frequency pairs

There is also a poorly named utility to extract all word-frequency pairs called me.lemire.lucene.CreateFreqSortedDictionary. Deliberately, it is currently undocumented.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
src/main/java/me/lemire/lucene		src/main/java/me/lemire/lucene
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main/java/me/lemire/lucene

src/main/java/me/lemire/lucene

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

IndexWikipedia

Usage:

Extracting word-frequency pairs

About

Releases 1

Packages

Contributors 2

Languages

License

lemire/IndexWikipedia

Folders and files

Latest commit

History

Repository files navigation

IndexWikipedia

Usage:

Extracting word-frequency pairs

About

Topics

Resources

License

Stars

Watchers

Forks

Languages