Skip to content
This repository has been archived by the owner on Jan 6, 2022. It is now read-only.

Data Pipeline Configuration and Datscript Proposal #17

Open
melaniecebula opened this issue Jan 9, 2015 · 16 comments
Open

Data Pipeline Configuration and Datscript Proposal #17

melaniecebula opened this issue Jan 9, 2015 · 16 comments

Comments

@melaniecebula
Copy link
Contributor

Data Pipeline Configuration and Datscript Proposal


Goal

Create a data pipeline configuration that makes sense. This involves:

Pipeline: datscript --> hackfile parser --> hackfile --> gasket

Datscript


Keywords

Command-Types:

run: runs following commands serially
pipe: pipes following commands together
fork: runs following commands in parallel, next command-type waits for these commands to finish
background: similar to fork, but next command-type does not wait for these commands to finish
map: multiway-pipe from one to many; pipes first command to rest of commands
reduce: multiway-pipe from many to one; pipes rest of commands to first command

Other Keywords:

pipeline: keyword for distinguishing a pipeline from other command-types

Datscript Syntax


Command-type {run, pipe, fork, background, map, reduce} followed by args in either of the two formats:

Format 1:

{command-type} {arg1}
  {arg2}
  {arg3}
  ....

Format 2:

{command-type}
  {arg1}
  {arg2}
  {arg3}
  ....

pipeline {pipeline-name} followed by either of the previous command-type formats:

pipeline {pipeline-name}
    {command-type}
      {arg1}
      {arg2}
      {arg3}
      ....  

Commands in detail


Run Command:

run will run each command serially; that is, it will wait for the previous command to finish before starting the next command.

The following all result in the same behavior, since the run command is serial:

Example 1:

run bar
run baz

Example 2:

run
  bar
  baz

Example 3 (not best-practice):

run bar
  baz

Pipe Command:

pipe will pipe each command together; that is, it will take the first command and pipe it to the next command until it reaches the end, and pipe to std.out. pipe with only one supplied command is undefined.

Example 1: prints "A" to std.out

pipe
  echo a
  transform-to-uppercase
  cat

Example 2: prints "A" to std.out

pipe echo a
  transform-to-uppercase
  cat

Example 3: INVALID because both transform-to-uppercase and cat need input (and since these are separate groupings, these lines are NOT piped together)

pipe echo a
pipe transform-to-uppercase
pipe cat

Example 4: prints "A" to std.out, prints "B"to std.out

pipe
 echo a
  transform-to-uppercase
  cat
pipe
  echo b
  transform-to-uppercase
  cat

Fork Command:

fork will run each command in parallel in the background. The next command-type will wait for these commands to finish. If there is no next command-type, gasket will implicitly wait for these commands to finish before exiting. Each forked command is not guaranteed to be executed in the order you supply.

Example 1 (best-practice):

  • will print a and b to std.out (in either order)
  • after completing those commands, will print baz to std.out
fork
  echo a
  echo b
run echo baz

Example 2: Same output as Example 1 (not best-practice)

fork echo a
  echo b
run echo baz

Example 3: Will print a and b to std.out (in either order), before exiting.

fork
  echo a
  echo b

Background Command

background will run each command in parallel in the background. The next command-type will NOT wait for these commands to finish. If there is no next command-type, gasket will NOT wait for these commands to finish before exiting. Each background command is not guaranteed to be executed in the order you supply.

Example 1 (best-practice):

  • will print a, b, and baz to std.out (in either order)
background
  echo a
  echo b
run echo baz

Example 2: Same output as Example 1 (not best-practice)

background echo a
  echo b
run echo baz

Example 3: Starts a node server, run echo a does not wait for run node server.js to finish. After completing the last command (in this case, run echo a , gasket will NOT wait for background commands ( run node server.js)to finish, but will properly exit them.

background
  run node server.js
run echo a

Map Command

map is a multiway-pipe from one to many. That is, it pipes the first command to rest of the provided commands. The rest of the provided commands are treated as fork commands. Therefore, the "map" operation pipes the first command to the rest of the provided commands in parallel (and therefore no order is guaranteed). map with only one supplied command is undefined.

Example 1 (best-practice): In either order:

  • pipes data.json to dat import
  • pipes data.json to cat
  map curl http://data.com/data.json
      dat import
      cat

Example 2: Same output as Example 1

  map 
      curl http://data.com/data.json
      dat import
      cat

Reduce Command

reduce is a multiway-pipe from many to one. That is, it pipes rest of commands to first command. The rest of the provided commands are treated as fork commands. Therefore, the "reduce" operation pipes each of the provided commands to the first command in parallel (and therefore no order is guaranteed). reduce with only one supplied command is undefined.

Example 1 (best-practice): In either order:

  • pipes papers to dat import
  • pipes taxonomy to dat import
  reduce dat import
      papers
      taxonomy

Example 2: Same output as Example 1

  reduce
      dat import
      papers
      taxonomy

Defining and Executing a Pipeline

The pipeline keyword distinguishes a pipeline from the other command-types. Pipelines are a way of defining groups of command-types that can be treated as data (a command) to be run by any command-type.

Example 1: An import-data pipeline is defined. It imports 1, 2, 3 in parallel before printing "done importing" to std.out. After converting from datscript to gasket, to run the pipeline in the command line gasket run import-data

pipeline import-data
  fork
    import 1
    import 2
    import 3
  run echo done importing

Example 2: Same output as Example 1, but run from within the datscript file.

run import-data
pipeline import-data
  fork
    import 1
    import 2
    import 3
  run echo done importing

You cannot nest pipeline definitions (they should always be at the shallowest layer), but you can nest as many command-types within a pipeline as you like.

Example 3: Nested command-types in a pipeline. Will print a, then print b, then print C

pipeline baz
  run
    echo a
    echo b
    pipe
      echo c
      transform-to-upper-case
      cat

Example 4: INVALID: Pipelines can only be defined at the shallowest layer.

pipeline foo
  run echo a
  pipeline bar

//TODO: Lots of tricky cases to think about here.
Example 6: Executing non-run command-types on a pipeline. In this example, we define a pipeline foo, which has baz and bat defined (without a command-type provided). Then we map bar on to the pipeline foo (so we pipe bar into baz and also into bat, in parallel). One problem here, is pipeline foo might be invalid syntax.

map bar
  foo

pipeline foo
  baz
  bat

Misc

This issue is still a WIP. A lot of this concerns datscript directly, but will ultimately shape gasket (so I think it belongs here)

@melaniecebula
Copy link
Contributor Author

@Karissa brought up a good point. Is there any distinction between a single pipe command and a run command? Thoughts @mafintosh @maxogden ?

@max-mapper
Copy link
Contributor

maybe a pipe with a single command should cause a warning/error that says something like warning on line 7: pipe should only be used with multiple commands, otherwise use run etc

@okdistribute
Copy link

it sounds like in those cases we should just only use run in the documentation. if we don't have any examples using pipe on a single line it'll do most of the work for us

@maxogden that isn't a bad idea, could be nice to have a --verbose option that outputs stuff like this. but that lies more in feature request territory

@melaniecebula
Copy link
Contributor Author

I can edit the documentation to say that one-line pipe commands are not officially supported (and then edit the examples that used them)

@mafintosh
Copy link
Contributor

+1 for removing one-line pipes from docs

@okdistribute
Copy link

@melaniecebula yeah, that seems like that could go in the detailed documentation about the 'pipe' command

@melaniecebula
Copy link
Contributor Author

Okay, I think that'll clean up some of the confusion for one-line map and one-line reduce commands as well.

@melaniecebula
Copy link
Contributor Author

made changes: Added notes about undefined behavior for pipe, map, and reduce when only supplied with one command. Removed one-line pipe/map/reduce examples

@max-mapper
Copy link
Contributor

inspiration: https://github.com/toml-lang/toml

@max-mapper
Copy link
Contributor

I thought about the pipeline foo: syntax some more, and I kind of think we should drop the : and just have it be pipeline foo

Reasoning is that it's the only 'special' syntax we have, and on the call yesterday we came up with it because it adds a second namespace of commands which makes things more futureproof.

But I think it's a little too complex for a first version, and you can get most problems by being wise about reserving keywords in the design of your DSL.

Relevant IRC:

screen shot 2015-01-10 at 6 27 41 pm

@melaniecebula
Copy link
Contributor Author

That makes sense to me! I agree.

@melaniecebula
Copy link
Contributor Author

I've updated the issue to reflect dropping the ":", but keeping "pipeline" as a keyword.

@max-mapper
Copy link
Contributor

excellent, in the interest of simplicity I think we should try and keep any 'special' syntax out of the first version of hackfiles (this includes argument placeholders like $1 for now). So @melaniecebula if you wanna take a stab at forking mafintosh/hackfile-parser that would probably be a good place to start

@melaniecebula
Copy link
Contributor Author

I agree. I think it's something that gasket can handle instead (the pipeline keyword and details like that). Sounds good! I plan on messing around with it after I get some lunch!

@okdistribute
Copy link

A ++1 to not having special syntax stuff in hack files
On Jan 11, 2015 11:26 AM, "Melanie Cebula" notifications@github.com wrote:

I agree. I think it's something that gasket can handle instead (the
pipeline keyword and details like that). Sounds good! I plan on messing
around with it after I get some lunch!


Reply to this email directly or view it on GitHub
#17 (comment).

@max-mapper
Copy link
Contributor

@melaniecebula I think @mafintosh and I discussed it and figured 'fork' and 'background' would be implemented the same, using npm-execspawn (similar to child_process.spawn). the actual specifics of the implementation we didnt discuss though. Might need some sort of process cluster/state module (using e.g. require('run-parallel') or something)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants