GitHub - tequalsme/accumulo-wikisearch: Fork of Apache/accumulo-wikisearch, with the goal of being simpler to setup and use.

This repository has been archived by the owner on Apr 9, 2020. It is now read-only.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
ingest		ingest
query		query
webapp		webapp
.gitignore		.gitignore
README		README
pom.xml		pom.xml

Repository files navigation

Apache Accumulo Wikipedia Search Example

This project contains a sample application for ingesting and querying wikipedia data.

Prerequisites
-------------
1. Accumulo, Hadoop, and ZooKeeper must be installed and running
2. One or more wikipedia dump files (http://dumps.wikimedia.org/backup-index.html) placed in an HDFS directory
You will want to grab the files with the link name of pages-articles.xml.bz2
3. Though not strictly required, the ingest will go more quickly if the files are decompressed:

$ bunzip2 < enwiki-*-pages-articles.xml.bz2 | hadoop fs -put - /wikipedia/enwiki-pages-articles.xml

INSTRUCTIONS
------------

Configuration and Build
-----------------------
1. Copy ingest/conf/wikipedia.xml.example to ingest/conf/wikipedia.xml and change contents to specify Accumulo information
(For parallel ingest, instead copy ingest/conf/wikipedia_parallel.xml.example to ingest/conf/wikipedia.xml)
2. Copy webapp/src/main/resources/app.properties.example to webapp/src/main/resources/app.properties and change contents
as done in step 1.
3. From the wikisearch directory, run mvn package

Ingest
------
1. Copy ingest/target/wikisearch-ingest-*.tar.gz to cluster and untar
2. Copy lib/wikisearch-ingest-*.jar and lib/protobuf-java-*.jar to $ACCUMULO_HOME/lib/ext
3. Run bin/ingest.sh with one argument: the name of the directory in HDFS where the wikipedia XML
files reside, this will start a MapReduce job to ingest the data into Accumulo
(For parallel ingest, instead run ingest/bin/ingest_parallel.sh)

Query
-----
1. Copy the following jars to the $ACCUMULO_HOME/lib/ext directory from the query/target/dependency directory:

commons-jexl-*.jar
guava-*.jar
kryo-*.jar
minlog-*.jar

2. Copy query/target/wikisearch-query-*.jar to $ACCUMULO_HOME/lib/ext
3. Use the Accumulo shell and give the user permissions for the wikis that you loaded, for example:
setauths -u <user> -s all,enwiki,eswiki,frwiki,fawiki

4. cd into webapp and run mvn jetty:run
5. Open a browser and goto: http://localhost:8080/accumulo-wikisearch/
You can issue the queries using this user interface or via the REST url: <host>/accumulo-wikisearch/rest/query
6. Ctrl-C to stop the jetty container

About

Fork of Apache/accumulo-wikisearch, with the goal of being simpler to setup and use.

Readme

Activity

3 stars

3 watching

3 forks

Report repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest

ingest

query

query

webapp

webapp

.gitignore

.gitignore

README

README

pom.xml

pom.xml

Repository files navigation

About

Releases

Packages

Contributors 3

Languages

tequalsme/accumulo-wikisearch

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages