GitHub - kerzol/wikiprep: New wikiprep, we use MediaWiki::DumpFile instead of old Parse::MediaWikiDump. WATCH OUT: not guaranteed to work with new (>2011) wikipedia dumps!!!

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 456 Commits
bin		bin
lib/Wikiprep		lib/Wikiprep
t		t
tools		tools
.gitignore		.gitignore
COPYING		COPYING
Changes		Changes
MANIFEST		MANIFEST
Makefile.PL		Makefile.PL
README		README

Repository files navigation


				 Wikiprep
			      ==============
			       Zemanta fork
                           with MediaWiki::DumpFile
           MediaWiki syntax preprocessor and information extractor

Attention!
==========

The perl package MediaWiki::DumpFile is pretty old, and the latest tested version of wikipedia dump is 0.5.

Using the infomation from https://github.com/wikimedia/mediawiki/commits/master/docs/export-0.6.xsd
I can imagine that Wikiprep can work well with a dump created before Nov 2011.

One should either rewrite MediaWiki::DumpFile, either find another Wikiprep-like tool.
More info : #2


Content
=======

1.	Introduction
1.1	  Relation to the original Wikiprep
1.2	  Current status

2.	Requirements and installation
2.1	  Software
2.2	  Hardware

3.	Usage
3.1       Parallel processing

4.	Tools

5.	License

6.	Hacking



1. Introduction
===============

Wikiprep is a Perl script that parses MediaWiki data dumps in XML format
and extracts useful information from them (MediaWiki is the software that
is best known for running Wikipedia and other Wikimedia foundation
projects).

MediaWiki uses a markup language (wiki syntax) that is optimized for easy
editing by human beings. It contains a lot of quirks, special cases and 
odd corners that help MediaWiki correctly display wiki pages, even when
they contain typing errors. This makes parsing this syntax with other
software highly non-trivial.

One goal of Wikiprep is to implement a parser that:
  
  o is compatible with MediaWiki as closely as possible, implementing as
    much functionality as is needed to achive other goals.

  o is as fast as possible, allowing tracking the English Wikipedia dataset
    as closely as possible (MediaWiki's PHP code is slow)

The other goal is to use that parser to extract various information from
the dump that is suitable for further processing and is stored in files
with simple syntax.


1.1 Relation to the original Wikiprep

Wikiprep was initialy developed by Evgeniy Gabrilovich to aid his research.
Tomaz Solc adapted his script for use in semantic information extraction
from Wikipedia as part of Zemanta web service.

This version of Wikiprep undergone some extensive modification to be able
to extract information needed by Zemanta's engine.


1.2 Current status

Currently implemented MediaWiki functionality:

  o Templates:
  
      - Named and positional parameters,
      - parameter defaults,
      - recursive inclusion to some degree - infinite recursion
        breaking is currently implemented in a way incompatible with
	MediaWiki and
      - support for <noinclude>, <includeonly> and similar syntax.

  o Parser functions (currently #if, #ifeq, #language, #switch)

  o Magic words (currently urlencode, PAGENAME)

  o Redirects

  o Internal, external and interwiki links

  o Proper handling of <nowiki> and other pseudo-HTML tags

  o Disambiguation page recognition and special parsing 

  o MediaWiki compatible date handling

  o Stub page recognition

  o Table and math syntax blocks are recognized and removed from the final
    output

  o Related article identification



2. Requirements
===============


2.1 Software

You need a recent version of Perl 5 compiled for a 64 bit architecture with
the following modules installed (names in parentheses are names of respective
Debian packages):

MediaWiki::DumpFile
Regexp::Common        (libregexp-common-perl)
Inline::C             (libinline-perl)
XML::Writer           (libxml-writer-perl)
BerkeleyDB            (libberkeleydb-perl)
Log::Handler          (liblog-handler-perl)

If you can't use Inline::C for some reason, run wikiprep with "-pureperl" flag 
and it will use the (roughly) equivalent pure Perl implementations.

If you want to process gzip or bzip2 compressed dumps you will need gzip and
bzip2 installed. Gzip is also required for the -compress option.

To run unit tests you will also need xmllint utility (shipped with libxml2,
libxml2-utils)


2.2 Hardware

English Wikipedia is big and is getting bigger, so requirements below
slowly grow over time.

As of January 2011, Wikiprep output takes approximately 22 GB of hard disk
space (with -compress, -format composite and default logging). Debug log 
takes 20 GB or more.

Prescan phase requires a little over 4 GB of memory in a single Perl
process (hence the requirement for a 64 bit version of Perl). In transform
phase Wikiprep requires 100 to 200 MB per Perl process (but 4 GB or more
is recommended for decent performance due to OS caching BDB tables)

On a dual 6 core 2.6 GHz AMD Opteron with 32 GB of RAM and 12 parallel
processes it takes approximately 16 hours to process English Wikipedia dump
from 15 January 2011



3. Usage
========


3.1 Installation

Run the following commands from the top of the Wikiprep distribution:

$ perl Makefile.PL
$ make
$ make test (optional)

And as root:

$ make install


3.2 Running

The most common command to start Wikiprep on an XML dump you downloaded
from Wikimedia:

$ wikiprep -format composite -compress -f enwiki-20090306-pages-articles.xml.bz2

This will produce a number of files in the same directory as the dump.


Alternatively, run the following to get a list of other available options:

$ wikiprep

To run regression tests included in the distribution, run:

$ make test


3.3 Parallel processing

Wikiprep is capable of using multiple CPUs, however in order to do this
you must provide it with a split XML dump.

First split the XML dump using the splitwiki utility in the tools
subdirectory:

$ bzcat enwiki-20090306-pages-articles.xml.bz2 | \
  splitwiki 4 enwiki-20090306-pages-articles.xml

Then run Wikiprep on the split dump using the -parallel option:

$ perl wikiprep.pl -format composite -compress \
  -f enwiki-20090306-pages-articles.xml.0000

You only need to run Wikiprep once on a machine and it will automatically
split into as many parallel processes as there are parts of the dump
(identified by the .NNNN suffix).

Wikiprep processing is split into sequential parts: "prescan" which is done 
in one process and "transform" which can be done in parallel. So you will
need to wait for prescan to finish before you will see multiple wikiprep
processes running on the system.

With some scripting it is possible to distribute dump parts to
multiple machines and start Wikiprep separately on each machine.

It's also possible to only run prescan once on a single machine and then 
distribute .db files it creates to other machines where only transform
part is started.


3.4 Parsing MediaWiki dumps other than English Wikipedia

All language and site-specific settings are located in Perl modules under
/lib/Wikiprep/Config. The configuration used can be chosen via the -config
command line parameter (default is Enwiki.pm)

Wikiprep distribution includes configurations for Wikipedia in various
languages. These were contributed by Wikiprep users and can be outdated or
incomplete. At the moment the only configuration that is thoroughly tested
before each release is the English Wikipedia config. Patches are always
welcome.

To add support for a new language or MediaWiki installation, copy Enwiki.pm
to a new file (the convention is to use the same name as is used in naming
Wikimedia dumps) and adjust the values within.



4. Tools
========

There are a couple of tools in the tools/ directory that aim to make your
life a bit easier. Their use should be pretty much obvious from the help
message they return if you run them without any command line arguments.

  o findtemplate.sh

    Find templates that support a named parameter. Good for searching for
    all templates that support for example the "isbn" parameter.

  o getpage.py

    Uses Special:Export feature of Wikipedia to download a single article
    and all templates it depends on. It then constructs a file resembling
    a MediaWiki XML dump, containing only these pages.

    Good for making testing datasets or researching why a specific page
    failed to parse properly.

  o samplewiki

    Takes a complete MediaWiki XML dump, takes a random sample of pages
    from it and creates a new (smaller) dump.

  o splitwiki

    Splits a MediaWiki XML dump into N equal parts. For example:

    $ bzcat enwiki-20090306-pages-articles.xml.bz2 | \
      splitwiki 4 enwiki-20090306-pages-articles.xml

    Will produce 4 files of roughly equal length in the current directory:

    $ ls
    enwiki-20090306-pages-articles.xml.0000
    enwiki-20090306-pages-articles.xml.0001
    enwiki-20090306-pages-articles.xml.0002
    enwiki-20090306-pages-articles.xml.0003

  o riffle

    Takes a complete MediaWiki XML dump and some articles downloaded using the
    Special:Export feature and inserts these articles into the dump.

    Good for keeping an XML dump up-to-date without having to re-download the
    whole thing.



5. License
==========

Copyright (C) 2007 Evgeniy Gabrilovich (gabr@cs.technion.ac.il)
Copyright (C) 2009 Tomaz Solc (tomaz.solc@tablix.org)

  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; either version 2 of the License, or
  (at your option) any later version.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software
  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA,
  or see <http://www.gnu.org/licenses/> and
  <http://www.fsf.org/licensing/licenses/info/GPLv2.html>


Some of the example files are copied from English Wikipedia and are 
copyright (C) by their respective authors. Text is available under the 
terms of the GNU Free Documentation License



6. Hacking
==========

The code is pretty well documented. Have a look inside first, then ask
on the mailing list.

If you need to customize Wikiprep for a specific language, copy
Wikiprep/Config/Enwiki.pm and change fields there. You can then use your
new config with -language option.

To run a test on just a particular test case, run for example:

$ perl t/cases.t t/cases/infobox.xml