workflow

wukong/workflow -- expressive workflow assembly

Workflow trichotomy:

'makefile'-style -- reverse dependency graph
- Know your endpoint but not your beginning
- strong idempotency
- Doesn't like to be dependent on data
- describe dependencies of products
- triggering a product triggers backwards on graph until products are grounded
'script'-style -- forward-running workflow graph
- set of steps
- triggered in order
- Know your beginning but not your endpoint [may have many choices]
- Know your direction/velocity
Resource invocation -- imperative actions
- assemble resources (
- trigger actions on those resources
- for example, a 'git repo' resource: you can ensure its existence (by cloning), set to a specific branch or commit, delete it, pull, fetch, push, merge.

For any defined abstraction layer:

Only important that the contract is adhered to
No implication that there are lower level abstraction layers
May show a forward-looking vision of elegant lower level abstractions

(Having a graph will help you express parallel execution)

can refer to a job by its intrinsic info
can refer to a job not yet defined

example workflow

    chain :twitter_parse do
      wukong_rb ‘parse_api.rb’
      pig       ‘uniq_and_unsplice.pig’
    end

    Wukong.workflow(:launch) do
      task :aim do
        #...
      end
      task :enter do
      end
      task :commit do
        # ...
      end
    end

    Wukong.workflow(:recall) do
      task :smash_with_rock do
        #...
      end
      task :reprogram do
        # ...
      end
    end

Workflow

Wukong workflows work somewhat differently than you may be familiar with Rake and such.

In wukong, a stage corresponds to a product; you can then act on that product.

Consider first compiling a c program:

to build the executable, run `cc -o cake eggs.o milk.o flour.o sugar.o -I./include -L./lib`
to build files like '{file}.o', run `cc -c -o {file}.o {file}.c -I./include`

In this case, you define the steps, implying the products.

Something rake can't do (but we should be able to): make it so I can define a dependency that runs last

Run

A run is the event that ensues when you invoke a workflow. Invoking the bake_pie workflow at 01:20:55 on Jan 30, 2012 results in the bake_pie-20120130012055 run.

Stages

Stage

A stage is a data process having

one input, an array of length one called inputs. (later: multiple inputs, named inputs)
one output, called output (later: multiple outputs, named outputs)
(later) an error channel named :error.

Any stage can be invoked by name; only that stage is executed.

Chain

A chain runs a sequence of stages, one after the other, in order. A chain is itself is a stage; it has an array of sub-stages (called steps) that it will execute in order.

the input to the chain becomes the input to the first stage, and the output of the last stage becomes the output of the chain.

You can of course invoke any stage within a chain directly.

ShellProcess (?name?)

A shell_process invokes the swineherd runner.

hash of config variables
?ordered? inputs
one output, named :output, and an error channel named :error

Input and Output

By default, a stage’s inputs are specified by the outputs of its dependencies.

File name templates

Output asset names

The output asset names are constructed from the stage’s metadata. There is a small set of pathname templates (in fact, only one):

Development mode output pathname template


somehow: %{user}, %{run_id}, %{session}, %{run_index}, %{prod|dev|test}

(?implement a template that you think works, those are some possible ingredients we’ll codify &/or fix?)

(later) Automated mode output pathname template (used when deployment class is prod and test): /%{project_path}/%{run_id}/%{transformed_stage_name}-%{deployment_class} (just implement something sensible, we’ll figure out the details)

somehow: %{user}, %{session}, %{run_index}, %{prod|dev|test}, %{timestamp}

project_path: A container for runs for the same purpose/project
session: A temporally close connected set of runs
run_index: An auto-incremented counter for the runs
deployment_class: The type of deployment instantiation. These may be used for more than one granularity of sets of run.
run_id: The time the run started and some other information to uniquely identify this specific invocation of the workflow. (?complete as you find natural?)
timestamp: timestamp of run. everything in this invocation will have the same timestamp.
user: username; ENV['USER'] by default
sources: basenames of job inputs, minus extension, non-\w replaced with '_', joined by '-', max 50 chars.

Explicit asset names

Normally, one should not rename inputs and output. However, there are some (hopefully rare) cases where they may be renamed. Example cases include:

You can override the default input name to adapt to external processes:

(show how)
(make sure I can still inject an explicit name at execution time)

You can also inject an explicit name:

(show how)

Dependencies

...

Execution

Configuration

Commandline args

handled by configliere: nukes launch --launch_code=GLG20
TODO: configliere needs context-specific config vars, so I only get information about the launch action in the nukes job when I run nukes launch --help

versioning of clobbered files

when files are generated or removed, relocate to a timestamped location
- a file /path/to/file.txt is relocated to ~/.wukong/backups/path/to/file.txt.wukong-20110102120011 where 20110102120011 is the job timestamp
- accepts a max_size param
- raises if it can't write to directory -- must explicitly say --safe_file_ops=false

Actions

each action

the default action is call
all stages respond to nothing, and like ze goggles, do nothing.
clobber -- run, but clear all dependencies
undo --
clean --
create --
update -- applies the given *note the difference
delete
invoke --
run --

Standard Products

Utility and Filesystem stages

The primitives correspond heavily with Rake and Chef. However, they extend them in many ways, don't cover all their functionality in many ways, and incompatible in several ways.

directory, symlink
template -- fill in a file with variables supplied at runtime
remote_file --
git_repo
script
- with specializations like hadoop_job, r_script
remote_request -- call to an external product over the network
- http_request -- a type of remote request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly