Skip to content
Marc Claesen edited this page Oct 8, 2013 · 17 revisions

EnsembleSVM aims to be a valuable framework for algorithm prototyping. In this Section we will use short code examples to demonstrate how EnsembleSVM allows complex data analysis workflows to be defined and implemented with ease.

An important element of the library is its Pipeline concept, which models a highly generic data analysis workflow. In brief, a Pipeline is used to process a generic input a to a generic output b. The dimensions of a and b nor their types need to match.

The pipeline scheme is efficient: calling a pipeline involves one virtual lookup (regardless of its length) and a number of direct calls equal to its length. The length of a pipeline is the number of BasicBlocks it concatenates.

EnsembleSVM uses pipelines as a flexible framework to define data analysis workflows. In particular, the pipeline scheme is very useful to define and implement aggregation schemes.

Implementation

In C++ terminology, a Pipeline is a single-argument functor whose interface is defined by the following template:

template <typename Res , typename Arg>
struct Pipeline<Res(Arg)>{
Res operator()(Arg&& input ) const ;
};

The important aspect of pipelines is that they can be concatenated with ease. The design of pipelines is inspired by the decorator pattern, though it is much more involved. Pipeline concatenation necessitates creating a new type, because the signature of the resulting functor may be entirely new. Consider the following code fragment (we simplify the code here):

Pipeline<double(std::vector<double>)> bar=...;
Pipeline<int(double)> foo=...;
Pipeline<int(std::vector<double>)> quux = foo(bar);

In the remainder of this page, we will go more into detail on how to construct, concatenate and use pipelines within our framework. In brief, we provide the following functionality for pipelines:

  • appropriate typedefs, including argument_type and result_type
  • a set of basic pipeline elements, which offer a lot of flexibility to define common machine learning data processing schemes
  • helpful macros to implement your own pipelines and/or make new concatenations
  • automatic serialization and deserialization, transparant to the user
  • factories to facilitate concatenation (the true types can become extremely long)

Due to their templated and potentially nested nature, serialization and deserialization of pipelines is not trivial since C++ lacks dynamic typing. To solve this problem, we make a distinction between building blocks for pipelines and 'complete' pipelines that require serialization and deserialization.

We don't foresee users to derive directly from the templated Pipeline class. Instead, it has two major derived types, with distinct features and requirements:

  1. BasicBlock: a templated functor which implements some elementary operation. These can be concatenated and used to form entirely new processing schemes.

  2. MultistagePipe: a non-templated functor, built from a sequence of Pipeline objects (typically BasicBlocks). MultistagePipe objects have fully automated serialization and deserialization capabilities.

Inner workings of the Pipeline framework

This section serves as a guideline to pipeline/core.hpp.

The internal implementation involves a reasonable amount of template metaprogramming and makes heavy use of SFINAE and duck typing. Users are shielded from this inner complexity as much as possible, so I will only briefly discuss how things work and why.