WikipediaExport

Convert Wikipedia XML dump files to JSON or Text files

Text corpora are required for algorithm design/benchmarking in information retrieval, machine learning, language processing.
The Wikipedia data is ideal because it is large (7 million documents in English Wikipedia) and available in many languages.

Unfortunately the XML format of the Wikipedia dump is somewhat proprietary and inaccessible. WikipediaExport solves this problem by converting the XML dump to plain text or JSON - two formats that can be easily consumed by many tools.

Download wikipedia dump files at:
http://dumps.wikimedia.org/enwiki/latest/
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Usage

Export to text file:
dotnet WikipediaExport.dll inputpath="C:\data\wikipedia/enwiki-latest-pages-articles.xml" format=text

Export to JSON file:
dotnet WikipediaExport.dll inputpath="C:\data\wikipedia/enwiki-latest-pages-articles.xml" format=json

Format output file

Text file

Five consecutive lines constitute a single document:
title
content
domain
url
docDate (Unix time: milliseconds since the beginning of 1970)

JSON file

title
content (all "\r" have been replaced with " ")
domain
url
docDate (Unix time: milliseconds since the beginning of 1970)

Application

WikipediaExport is used to generate the input data for LuceneBench, a benchmark program to compare the performance of Lucene (a search engine library written in Java, powering the search platforms Solr and Elasticsearch) and SeekStorm (a high-performance search platform written in C#, powering the SeekStorm Search as a Service).

WikipediaExport is contributed by SeekStorm - the high performance Search as a Service & search API

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
WikipediaExport		WikipediaExport
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WikipediaExport.sln		WikipediaExport.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly