Skip to content

dataflow processing primitives

Philip (flip) Kromer edited this page May 9, 2012 · 1 revision

wukong/dataflow -- Dataflow Processing Primitives

Concrete Sources / Sinks

Sources have a continuous(?) flag: keep reading on end of stream, or finish? (or, maybe they always are reading, and the EOS is an event that might or might not trigger a shutdown)

  • stdin / stdout / stderr
  • file (filesystem, s3, hdfs) - tail/read; write
    • source: poll for file pattern
    • sink: roll output filename
  • hanuman log (hierarchical)
  • database
    • source: must specify query
    • sink: must specify [ key / identifying fields ; update|create ; payload fields ]
  • http request
  • http stream
  • socket / rpc
  • jabber / amqp / twitter
  • syslog / syslog-ng
  • exec file

source/sink operations

  • load, source

  • store, sink

    • dump is a console sink
  • push

  • pull

    • query
  • buffer

  • throttle


Streaming Stages

  • split

    • 1-to-all: each record goes to all outputs
    • 1-to-n: each record goes to exactly one output, based on an ordered set of filters
  • union

    • in a stream context, produces a merged stream containing all incoming records with no guarantees of order or locality.
  • process -- pass a set of inputs through a flow, receiving at a set of outputs. Note that this can be an external process, another flow, whatever

  • foreach

    • flatten

    • subgroup, subsort

    • decorate (hash or fragment-replicate join) -- hang data from a hash on each record

    • delay

  • filter

    • sample
  • limit, window


Structural Stages

  • partition

  • sort

    • distinct
  • group, cogroup --

    • join is sugar for cogroup-and-flatten
    • cross

Coordination Stages

  • counters
  • locks
  • registers
  • barrier
  • announce/discover
  • depend/invoke
  • schedule
  • idempotency control
  • stop/start/sleep
  • trigger/close
  • hooks
    • can hook in post-hoc, inline or teeing
      • mini-mode
      • monitoring/logging
      • dark launch
    • middleware

Meta Stages

  • simulate, explain, describe

    • audit
  • register sign something up to be a block

    • compose -- register a composite graph as a component
  • locality declarative advice -- influence where your data lives. See mesos' resource offers mechanism

  • bind ties machine/resource groups to flows

  • priority prioritized execution

  • guarantees

    • end-to-end -- ensures acknowledgement by destination
    • next-hop -- guarantees successful handoff to next stage, but nothing more
    • best effort -- fire and forget

Workflow Stages

  • invoke uses the same rules as normal Rake execution: If the task has run already, it won’t run again.
  • execute runs the task no matter what.

Filesystem Stages

  • directory

  • file

  • link

  • Thor defines: say, ask, yes?, no?, add_file, remove_file, copy_file, template, directory, inside, run, inject_into_file.

    • we instead treat eg. a file as a model, and perform actions on it (create, delete, etc.)
  • rule

    In rake, you can say

    rule ".html" => ".markdown" do |t| sh "markdown #{t.source} > #{t.name}" end

    Only runs if the target is missing or the source is newer.


Generating files

  • remote_file -- Fetch remote file
  • template -- Generate file from an .erb template

Running scripts

  • Execute command

  • script

    • execute runs a command
    • script runs its code block
  • schedule task

  • spawn graph to run in parallel with executor

    • supervised?
    • run as sub-process (so, dies when parent does) or independently (keeps running)

Message passing

  • Broadcast
  • Listen
  • Notify/ping
  • Request/Response

Mediums

  • Stream:

    • unix pipes
    • named pipes
    • socket
    • http streaming
    • websockets
    • syslogNG
    • Flume
    • file read/tail/append/write
  • Dispatch:

    • UDP broadcast
    • jabber
    • http post/put
  • Pull:

    • http get
    • db query
  • Listen:

    • http listen
    • poll remote resource

Job flows

invoke

executes a job only once.

class InvokeExample < Wukong::Task::Base
  desc "one", "Prints 1 2 3"
  def one
    puts 1
    invoke :two
    invoke :three
  end

  desc "two", "Prints 2 3"
  def two
    puts 2
    invoke :three
  end

  desc "three", "Prints 3"
  def three
    puts 3
  end
end
Clone this wiki locally