dataflow_basic

Dataflow

wukong-dataflow lets you describe your dataflow as simply as you would on the whiteboard to your peers, yet expressively enough that a robot can carry it out.

Here's a system you could use to predict the weather using twitter:

+--------+             +--------+     +--------+    /user \    +-------+      =====
| twtr   |   /user \   | geo    |     | iden-  |   / tweet \   |       |-()->[elsrc]
| stream +->( tweet )->|  2     |-()->|  tify  |->(  weathr )->| split |      =====
|        |   \geo  /   | weather|     | topics |   \ topic /   |       |-()->[hbase]
+--------+             +-Y--^---+     +--------+    \geo  /    +-------+      =====
                         |  |
                       +-v--+---+
                       [ extrnl ]
                       [__api___]

The twitter streaming API source generates events
- consisting of a user, a tweet, and (sometimes) a geolocation.
Decorate each event with the local weather by pinging an external API, so that we can train and validate our predictions.
Apply a topic extractor to label each record with relevant terms.
Split the composite record into database-ready models (user, tweet, weather-topic pairing, etc)
Store tweets into Elasticsearch for full-text lookup, the rest into HBase for fast indexed retrieval.

Notice the following wonderful things:

Human Scalable. Each stage does one simple, predictable thing. You can test it in isolation, deploy it independently, use any toolset or language, and reuse components across projects. Hand the topic identifier to a PhD, the geo API widget to an intern.
Machine Scalable. Nowhere do we reference the number of machines, their characteristics, or even the transport. You can implement this dataflow using unix pipes and tees on your desktop or using Flume on multiple machines. What's more, you can scale out any portion of the flow across machines -- use tiny little VMs to handle network-bound stages, big burly machines to handle CPU-bound stages.
Resilient. The embedding transport can handle failures as a unified concern, dramatically simplifying your stages to be narcissistic, brittle single-minded robots. They expect the world to be just so and break loudly if anything goes wrong so they can do one thing well.
High throughput even with variable latency. As long as there's enough average capacity, the framework can accomodate slow responses from the weather API or bursts in the stream traffic.

There are many more illustrative pictures in this design-phase PDF

Stage

A wukong dataflow is an assembly of stages -- action stages (which produce, consume and transform data) and product stages (which represent the data) -- into a dataflow graph that can then be implemented by an underlying transport.

Action Stages

A typical dataflow consists of action stages -- sources, sinks and transforms.

source -- produces data: an HTTP listener; the stock market ticker; a synthetic random-number generator; a stream of webserver log files.
sink -- consumes data: a database; a 'rolled' file named for the current day and hour; a debugging console in your terminal window; a remote websockets client.
transform -- consumes data and produces data: a weblog -> JSON parser; a router (accepts multiple input streams, writes to multiple output streams); something useless (accepts stock market data, emits the lyrics to various Monty Python songs); an aggregator (counts or averages incoming events over 10-second intervals, emits the much-smaller aggregate metrics). As you notice, a transform is both a source and a sink.

Action stages have the following attributes and methods:

name -- unique handle for the stage
owner -- dataflow containing this stage
inputs -- a set of named products it consumes
outputs -- a set of named products it produces
notify -- accepts control path events; most importantly,
- setup: open files, initiate outbound connections, etc.
- stop: stop accepting new data, close files, etc.

Product Stages

The data streams that flows among stages are represented as products. More pedantically, a product is a set of events and states that fulfill a common contract.

name -- unique handle for the stage
owner -- dataflow containing this stage
input -- action stage that produces this product
outputs -- action stages that consume this product
schema -- a description of the events' data structure, as either a structured schema or the name of a factory that can consume the raw event and produce active objects.

Event

See concepts.md

Contracts

For example, the inputs to reducers in a wukong-mapred dataflow are guaranteed to be in partitioned sorted groups.

You can declare such contracts as requirements on your inputs or announcements about your outputs. The framework will warn you if a required contract isn't announced (note that it does not check the actual dataflow).

Flows

You can group stages together into flows (subgraphs); think of them as 'subroutines' for your dataflow. A flow has a set of named inputs, a set of named outputs, and internal stages. A flow is itself a transform, and the dataflow as a whole is a special type of flow.

You can specify inputs and outputs of a flow as either placeholders or as concrete products, just as you can specify parameters of a method as required variables or with defaults. As is true everywhere, wukong can provide any equivalent product bearing the same contract though -- for example, capturing and replaying output from a file; mocking a database as a null sink; imposing different filenames in production and developement modes. Wukong-mapred files don't care whether their input comes from stdin via the commandline, stdin via hadoop streaming, or directly from hadoop via JRuby.

Chains

A chain is a special kind of flow -- one in which each element feeds into the next. The typical unix command using | pipes (between actions) and > (into a final named product) exemplifies a chain.

Edges

It's somewhat unusual to treat both actions and products as nodes, but it's key to wukong's ability to make your graph visible and your life simple. Typically, actions are nodes and products are an afterthought -- you could read the diagram above as if the products were just labels on an edge.

Edges on the graph are either

data path edges
notification path (aka control path) edges

These often coincide: for an HTTP source that accepts post requests, remote clients connect to the listener and push data over the connection to the listener. By contrast, the Twitter streaming API ask you to connect to their servers (an outbound control path link); they then push data over the longrunning connection thus established (an inbound data path link).

In wukong, dataflow edges are boring.

Not all edges are legal

We've quietly introduced some constraints in the above -- it's time to list them explicitly.

In principle, you can only wire actions to products and products to actions. Wukong lets you describe your graph as if one action feeds into the next, but it inserts an anonymous product representing the prior action's output into the actual dataflow graph.
Products cannot have more than one inbound edge. You can combine multiple data streams into one output stream with an action (merge, join, etc), but the product it creates still obeys the one-input-edge rule.
data flow edges may not form cycles; there must always be an identifiable 'upstream' and 'downstream'.
There are exactly as many products as (outputs) + (external inputs).

Control path edges

Notifications

Action stage notifications:

setup: at setup time, a stage knows its inputs
close: the stage must stop accepting new data, and must de-allocate any products (close files or connections, release buffers).

Slots

action -> action -- `input > map(&:split) > flatten > re

Transforms

Ping External Endpoint

Multiple Retry

Drive an event

reset its connection
change parameters progressively -- eg a connection timeout or the geographic range on a search.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly