Skip to content

Feature function implementation

jweese edited this page Jan 16, 2011 · 1 revision

Thrax has two types of features, depending on what kind of a feature you want to implement. You should extend SimpleFeature if your feature can be calculated just by looking at the individual rule (examples: counting the number of terminal symbols, a binary feature for whether two NTs are adjacent, etc). If calculating a feature value requires looking at rules in some order and comparing them, you should extend MapReduceFeature. Both of these classes are in the edu.jhu.thrax.hadoop.features package.

After writing your feature's class, don't forget to add it to the list of features shown in the get method of FeatureFactory! This is also where you specify the feature's name -- the string you need to add to the "features" key in the config file in order to use your feature.

SimpleFeature

Your feature should extend SimpleFeature if its value can be calculated just by looking at the rule. You need to implement the single abstract method void score(RuleWritable r), which will take in a RuleWritable right after it has been extracted. You will want to insert your feature's label and its score into the rule's feature map. (See RuleWritable for more information about this important datatype.)

MapReduceFeature

If some comparison between rules is necessary, you should extend MapReduceFeature. This is more complicated than SimpleFeature. Every feature that needs MapReduce may need to reduce and sort the rules in a separate order, so each such feature that is used during the extraction requires one run of mapreduce over all the rules.

In the map step, each feature function takes in a set of (RuleWritable,IntWritable) pairs. Each RuleWritable represents a rule (obviously), and the integer value shows how many times it has been extracted. The output of the mapper is of the same type. You shouldn't alter the internals of the rule or the number of times it has been extracted.

After mapping, the rules get sorted in preparation for being reduced. Several comparators have already been written for the RuleWritable type; see its page for more details.

Finally, during the reduction phase, you get to look at rules in the order you specified during the sort. Then you can do all the comparisons you need to do.

Let's look at an example of a map-reduce feature!

We're looking at edu.jhu.thrax.hadoop.features.RarityPenaltyFeature.

We specify the mapper in the mapperClass method. Here, we simply return Mapper, which is the default Hadoop mapping class. It is simply an identity map.

For the comparator and the partitioner, we use YieldComparator and YieldPartitioner, respectively. The YieldComparator compares rules lexicographically according to their left hand side, source, and target. The partitioner makes sure that rules with the same yield end up at the same reducer, so that they can be compared during the reduce step!

The reducer is the real heart of this feature. Thanks to the comparator and partitioner, we know that one reducer will see all the rules with a particular yield, and it will see them right in a row.

Each time we see a new rule (along with its count, of course!) we see if it has the same yield as the previous rules we've seen. If it does, we get the count of the number of times we saw this rule. We save the rule and its count in a HashMap, and we add its count to the total count of all rules with this yield.

On the other hand, if the yield is different, that means the rule is a start of a new run of rules. For each (rule,count) pair we've saved up to now, we add "RarityPenalty" to its features map, along with the feature score of Math.exp(1 - totalCount). Then we write these modified rules and their individual counts to the context. After that, we set totalCount back to zero and store this new rule and its count for the run of the next group of rules.