Skip to content

dataflow scratchpad for proposed syntax

Philip (flip) Kromer edited this page May 9, 2012 · 1 revision

wukong/dataflow -- Scratchpad for Proposed Syntax

Data flows

  • you can have a consumer connect to a provider, or vice versa

    • producer binds to a port, consumers connect to it: pub/sub
    • consumers open a port, producer connects to many: megaphone
  • you can bring the provider on line first, and the consumers later, or vice versa.

You can run this from the commandline:

wukong count_followers.rb users.json followers_histogram.tsv

It will run in local mode, effectively doing

cat users.json | {the map block} | sort | {the reduce block} > followers_histogram.tsv

You can instead run it in Hadoop mode, and it will launch the job across a distributed Hadoop cluster

wukong --run=hadoop count_followers.rb users.json followers_histogram.tsv

Data Formats (Serialization / Deserialization)

Data Packets

Data consists of

  • record
  • schema
  • metadata

Syntax A

read('/foo/bar') # source( FileSource.new('/foo/bar') ) writes('/foo/bar') # sink( FileSink.new('/foo/bar') )

... | file('/foo/bar') # this we know is a source file('/foo/bar') | ... # this we know is a sink file('/foo/bar') # don't know; maybe we can guess later

Here is an example Wukong script, count_followers.rb:

from :json

mapper do |user|
  year_month = Time.parse(user[:created_at]).strftime("%Y%M")
  emit [ user[:followers_count], year_month ]
end   

reducer do 
  start{ @count = 0 }

  each do |followers_count, year_month|
    @count += 1
  end

  finally{ emit [*@group_key, @count] }
end