dataflow scratchpad for proposed syntax

wukong/dataflow -- Scratchpad for Proposed Syntax

Data flows

you can have a consumer connect to a provider, or vice versa
- producer binds to a port, consumers connect to it: pub/sub
- consumers open a port, producer connects to many: megaphone
you can bring the provider on line first, and the consumers later, or vice versa.

You can run this from the commandline:

wukong count_followers.rb users.json followers_histogram.tsv

It will run in local mode, effectively doing

cat users.json | {the map block} | sort | {the reduce block} > followers_histogram.tsv

You can instead run it in Hadoop mode, and it will launch the job across a distributed Hadoop cluster

wukong --run=hadoop count_followers.rb users.json followers_histogram.tsv

Data Formats (Serialization / Deserialization)

tsv/csv
json
xml
avro
apache_log
flat
regexp
Tagged Netstrings
ZeroMQ Property Language
gz/bz2/zip/snappy

Data Packets

Data consists of

record
schema
metadata

Syntax A

read('/foo/bar') # source( FileSource.new('/foo/bar') ) writes('/foo/bar') # sink( FileSink.new('/foo/bar') )

... | file('/foo/bar') # this we know is a source file('/foo/bar') | ... # this we know is a sink file('/foo/bar') # don't know; maybe we can guess later

Here is an example Wukong script, count_followers.rb:

from :json

mapper do |user|
  year_month = Time.parse(user[:created_at]).strftime("%Y%M")
  emit [ user[:followers_count], year_month ]
end   

reducer do 
  start{ @count = 0 }

  each do |followers_count, year_month|
    @count += 1
  end

  finally{ emit [*@group_key, @count] }
end

Provide feedback

Saved searches