Skip to content
jweese edited this page Jan 21, 2011 · 3 revisions

The tools branch is a refactorization of the Thrax pipeline into a collection of independently-runnable tools.

Each tool can be easily run from the command line. We present an example pipeline below. Note: unlike in regular Thrax usage, you have to explicitly set work-dir in your configuration file to make sure it is consistent during different runs of these tools.

Extract rules from the corpus. Assuming you have the corpus set up as in the Quickstart, you can extract all the rules (without feature scores) by running

$ hadoop jar bin/thrax.jar edu.jhu.thrax.hadoop.tools.ExtractionTool <conf file>

If using the lexprob feature, extract word-level lexical probabilities.

$ hadoop jar bin/thrax.jar edu.jhu.thrax.hadoop.tools.TargetWordGivenSourceWordProbabilityTool <input path> <work directory>

$ hadoop jar bin/thrax.jar edu.jhu.thrax.hadoop.tools.SourceWordGivenTargetWordProbabilityTool <input path> <work directory>

Java is so verbose.

Parallelization: extraction and word-level lexical probabilities can all be run at the same time.

Run map-reduce jobs for the features that need it. For each such feature:

$ hadoop jar bin/thrax.jar edu.jhu.thrax.hadoop.tools.FeatureTool <work directory> <feature>

In the above case, the feature is the name as it is written in the thrax.conf file.

Another parallelization advantage: assuming that the extraction and word-level tasks are finished, all of these feature tasks can be run in parallel without any problems!

The final step is to aggregate everything together!

$ hadoop jar bin/thrax.jar edu.jhu.thrax.jadoop.tools.OutputTool <true|false> <work directory> [f1 f2 f3 ...]

The boolean as the first argument indicates wether to label the feature scores or not. f1,f2 and so on are the names of features, again, as they would be written in the config file. For this step, you need to include all map-reduce features and all simple features that you want to be included in the output.

It's that easy! It'll be even easier once we figure out running dependent jobs and everything.