Skip to content
hadley edited this page Mar 16, 2015 · 1 revision

Table of Contents

Data Pipeline

Reasons for re-working the design:

 * More seamlessly communicate with R, by sharing data structures
 * Better handle data that has a mix of categorical and real-valued variables
 * Incorporate area plots more elegantly
 * Make the pipeline more modular, so that it can extended using plugins 
 * Code clean-up: take advantage of gtk objects for defining data objects

Main differences:

 * Splitting the pipeline off at world data for area data, and real-valued data
 * Brush colors and glyphs become actual variables

Old pipeline

 * Part of the datad structure (''n x p'')
   * raw data
   * tform data: data after variable transformations; simply copied when no transformations have taken place.  This data is fed to the world transformation, but it is also often retrieved in order to present data values to the user (eg, labels in identification).
   * world data: on [-PRECISION, +PRECISION]
   * Note: datad also contains metadata
     * by record (color, glyph type and size, shadowed, excluded, sampled, ...)
     * by variable (type, name before and after transformation, transformation and its parameters, limits)
     * by cell (missingness) 
 * Part of the splotd structure (''n x 2'')
   * planar coordinates: on [-PRECISION, +PRECISION]
   * screen coordinates: in screen coordinates
   * Note: view scaling (pan and zoom) is incorporated in the transformation of planar coordinates to screen coordinates.
 * Note: This pipeline was designed for point plots. Heike's barchart/histogram code includes its own pipeline for the stages following the world transformation.

Changes in the proposed new pipeline

 * each stage is to be a subclass of GObject, described in a [http://www.5z.com/jirka/gob.html gob] file.
 * it should incorporate categorical data and support area plots
 * we will use signals to communicate between stages
   * when one stage's data changes, it will emit a signal
   * each stage will listen for signals emitted by the preceding stage, and sometimes those emitted by the data stage when data attributes change

New pipeline

 * data object (data.gob) (''n x p'')
 * also contains row-wise attributes, which will be set and accessed via {set/get}Attribute(name, value) functions
 * filter or subset stage: This stage will capture the function currently served by the rows_in_plot vector composed of the indices of the currently displayed points.  (The number of displayed records can be reduced by the sampling tool, or by excluding shadowed points.)  If missings are hidden, this could also filter them out as well.
 * imputation stage
 * transform stage: This stage will eventually be composed of sub-stages, each describing a transformation of one or more variables (eg, log, inverse, permute)
 * ''freeze'' stage: Since the record attributes (color, glyph, etc) reside in the data object, all displays of the same data automatically share the same attributes.  This stage would allow us to create a local copy of the attributes so that displays of the same data could show different record attributes.
 * world stage: as before

Note: At this point, the pipeline splits into separate branches, one for point plots and one for area plots.

 * point plot stages
   * jitter stage: Add a small amount of random noise to categorical or discrete variables in order to display them more effectively in point plots
   * planar stage
   * screen stage
 * area plot stages
   * binning stage: applied to continuous variables
   * categorical stage
   * screen stage


View scaling: Do we want to represent view scaling as a distinct stage in the pipeline? As spelled out above, it is now incorporated in the planar -> screen transformation.

Virtual variables: We have noted that there are many row-wise vectors that exist temporarily in ggobi (or will do in the redesigned ggobi) that do not have any permanent life in the data structure. Do we want some of the stages to have the ability to generate and manage new variables?

 * Attribute vectors: We could think of color or glyph as data vectors, and indeed a clusterid vector exists that we sometimes turn into a variable and append to the raw data array and feed through the pipeline.
 * Projection vectors: The planar variables are arrived at in a variety of ways, from tour projections to ''spreading'' variables generated by the ASH or textured dot plot algorithms.  It could be convenient to treat them as data vectors for some purposes.
 * Binning vectors: (These are analogous to projection vectors.) Every binned variable has a corresponding 'cut' vector of integers representing current bin ids.

Decorative records: We sometimes add points and line segments to the data; these are usually used to make a model manifest in the ggobi displays. The edges in morsecodes.xml and the model in prim7.xml did not require the addition of new point records, but other enhancements sometimes do. These records are a kind of separate dataset which we run through the same pipeline as the data itself, but we have no way of distinguishing them except for using categorical variable levels. Do we want something more powerful, or is the existing framework adequate?