droid-identify-hadoopjob

Introduction

Hadoop Job for Identifying files using DROID (Digital Record Object Identification), Version 6.1, http://digital-preservation.github.io/droid/. It reads a text file from an HDFS input path which contains a list of absolute file paths to file instances on network attached storage. This requires that all worker nodes of the Hadoop cluster can access the individual files at the same path, for example by using the same mount points on all worker nodes.

Installation

cd droid-identify-hadoopjob
mvn install

Usage

Execute hadoop job from the command line:

hadoop jar target/droid-identify-hadoopjob-1.0.jar-with-dependencies.jar 
  -d /hdfs/path/to/textfiles/with/absolutefilepaths/ -n job_name

In case the size of the text file is smaller than Hadoop’s default split size (default configuration: 64 Megabyte), only one single task – i.e. running on one single core – would be created for processing the complete list with no benefit of running tasks in parallel at all. Usually, the number of records processed in one task can be controlled by the Hadoop parameter:

mapred.line.input.format.linespermap

If setting this parameter does not have the desired effect, it is possible to take advantage of Hadoop’s default behaviour to create at least one task per input file. Using the Unix command:

split -a 4 -l NUMLINES absolute_file_paths.txt

the complete text file containing all paths can be split into different files with the desired number NUMLINES of file path lines per file which corresponds to the desired number of file identifications to be processed per task. Generally, it’s important to keep an eye on the the number of records processed per task because this directly influences the task run time. As a rule of thumb it is recommended to ensure that “each task runs for at least 30-40 seconds” (http://blog.cloudera.com/2009/12/7-tips-for-improving-mapreduce-performance/). The files can then be loaded into HDFS:

hadoop fs -copyFromLocal /local/directory/inputfiles/ /hdfs/parent/directory/

and the HDFS directory

/hdfs/parent/directory/inputfiles

can then be defined as input directory for the hadoop job (parameter -d).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

pom.xml

pom.xml

Repository files navigation

droid-identify-hadoopjob

Introduction

Installation

Usage

About

Releases

Packages

Languages

shsdev/droid-identify-hadoopjob

Folders and files

Latest commit

History

Repository files navigation

droid-identify-hadoopjob

Introduction

Installation

Usage

About

Resources

Stars

Watchers

Forks

Languages