Skip to content

dmetaxak/elasticrawl-examples

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

elasticrawl-examples

Example Hadoop jobs demonstrating the Elasticrawl tool. Elasticrawl is a tool for launching AWS Elastic MapReduce jobs against the Common Crawl corpus.

Jobs

  • WordCount - An implementation of the standard Hadoop Word Count example that parses text data in Common Crawl WET (WARC Encoded Text) files. Each WordCount job parses a single segment of Common Crawl data.

  • SegmentCombiner - Combines data from multiple Common Crawl segments to produce a single set of results.

Running with Elasticrawl

See http://github.com/rossf7/elasticrawl#quick-start

Building

Developed on Ubuntu 12.04 and OpenJDK 6 using Eclipse Kepler and the m2e plugin.

with Maven

git clone https://github.com/rossf7/elasticrawl-examples.git
cd elasticrawl-examples
mvn install

with Eclipse

cd ~/workspace
git clone https://github.com/rossf7/elasticrawl-examples.git
  • Open Eclipse
  • File --> Import
  • Maven --> Existing Maven Project
  • Run As --> Maven install

Links

Thanks

  • Mark Watson for his example-warc-java that got me started with WARC files.
  • Lemur project developers for their edu.cmu.lemurproject package. Source for this is included with a couple of minor changes needed to process WET files stored on S3.

License

This code is licensed under the MIT license.

About

Example Hadoop jobs launched by the Elasticrawl tool.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%