Skip to content
This repository has been archived by the owner on Apr 9, 2020. It is now read-only.

tequalsme/accumulo-wikisearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Accumulo Wikipedia Search Example

This project contains a sample application for ingesting and querying wikipedia data.
 
Prerequisites
-------------
1. Accumulo, Hadoop, and ZooKeeper must be installed and running
2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory
   You will want to grab the files with the link name of pages-articles.xml.bz2
3. Though not strictly required, the ingest will go more quickly if the files are decompressed:

   $ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml


INSTRUCTIONS
------------

    Configuration and Build
    -----------------------
    1. Copy ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change contents to specify Accumulo information
       (For parallel ingest, instead copy ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml)
    2. Copy webapp/src/main/resources/app.properties.example to webapp/src/main/resources/app.properties and change contents
       as done in step 1.
    3. From the wikisearch directory, run mvn package
    
    Ingest
    ------
    1. Copy ingest/target/wikisearch-ingest-*.tar.gz to cluster and untar
	2. Copy lib/wikisearch-ingest-*.jar and lib/protobuf-java-*.jar to $ACCUMULO_HOME/lib/ext
	3. Run bin/ingest.sh with one argument: the name of the directory in HDFS where the wikipedia XML 
           files reside, this will start a MapReduce job to ingest the data into Accumulo
       (For parallel ingest, instead run ingest/bin/ingest_parallel.sh)
   
    Query
    -----
    1. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the query/target/dependency directory:
    
        commons-jexl-*.jar
        guava-*.jar
        kryo-*.jar
        minlog-*.jar
        
    2. Copy query/target/wikisearch-query-*.jar to $ACCUMULO_HOME/lib/ext
    3. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example: 
            setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki
            	
	4. cd into webapp and run mvn jetty:run
	5. Open a browser and goto: http://localhost:8080/accumulo-wikisearch/
	   You can issue the queries using this user interface or via the REST url: <host>/accumulo-wikisearch/rest/query
    6. Ctrl-C to stop the jetty container

About

Fork of Apache/accumulo-wikisearch, with the goal of being simpler to setup and use.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published