Skip to content
This repository has been archived by the owner on Feb 19, 2020. It is now read-only.

Featurizers

David Hall edited this page May 8, 2014 · 1 revision

Epic's structured prediction models are built around Featurizers, which are used to convert inputs and labels into a representation suitable for learning. There are several different kinds of featurizers in Epic, many of which we discuss here.

As an example, consider building an NER system, with input sentences like this:

“The referendum will be held on May 11,” said Miroslav Rudenko, the co-chairman of the government of the Donetsk People’s Republic, as the rebels call their political wing, according to Interfax, a Russian state-controlled news service.

How do we know that "Miroslav Rudenko" is a person? For one, it's a two word long capitalized phrase, which is a pretty good indicator. For another, the word "said" appears to its left, indicating that this "Miroslav Rudenko" is probably animate. We might also have a "gazetteer" of common (Russian) names that could just tell us.

Featurizers are meant to capture these kinds of information in a way that can be exploited by the underlying machine learning models. For NER, the canonical featurizer is a SpanFeaturizer, which takes spans of words from the sentence and outputs features.

Featurizers broadly come in two flavors: featurizers on the input sentence, and features on the output structure. We generally call the first kind "Surface Featurizers" and the latter kind "Label Featurizers." You mostly don't need to worry about the latter.

Featurizer DSLs

For basic usage, you probably won't need to understand the anatomy of featurizers. Instead, you can probably create featurizers using a "domain specific language", or DSL, that provides several "base" featurizers that can be combined into larger ones. DSLs are supposed to provide a clean syntax for easily expressing something in code. Here, DSLs are made up of "primitive" featurizers (like "the current word"), and then combinators that generalize the featurizers into something more complicated. For instance, the bigrams combinator takes a base featurizer and produces features of consecutive two word phrases within a given window. As a more concrete example, here is how the default "word featurizer" used for part-of-speech taggers is created:

  def goodPOSTagFeaturizer[L](counts: Counter2[L, String, Double]) = {
    val dsl = new WordFeaturizer.DSL[L](counts)
    import dsl._

    (
      unigrams(word, 1) // previous, current, and next word
        + unigrams(clss, 1) // previous, current, and next word class
        + bigrams(clss, 2) // bigrams of (clss(-2), clss(-1)), (clss(-1), clss(0)), ...
        + bigrams(tagDict, 2) // bigrams of the most common tag for each word
        + suffixes() // suffixes of the current word string
        + prefixes() // prefixes
        + props // a set of hand designed patterns, encoding things like capitalization,
                // whether or not the word looks like a year, etc.
      )
  }

Anatomy of a Featurizer

Here's what the interface of a typical featurizer looks like:

trait SpanFeaturizer[Word] {
  def anchor(words: IndexedSeq[Word]):Anchoring

  trait Anchoring {
    def featuresForSpan(begin: Int, end: Int):Array[Feature]
  }
  
}

Featurizers like many classes in Epic are "anchored" to a particular input, producing an intermediate object that actually does the scoring. Anchorings make it easier to cache computations, saving time for related invocations. Here's an example of how to make

Kinds of Featurizers

WordFeaturizer

Use for features over words. Use for part of speech tags, and commonly used as a base featurizer in other systems.

SpanFeaturizer

Use for features over spans, for the [SemiCRF] and the [SpanModel] parser.

SplitSpanFeaturizer

Use for the [SpanModel] parser. It produces features for a (begin, split, end) triple in a sentence.