Skip to content

Latest commit

 

History

History
616 lines (522 loc) · 23.4 KB

LANGUAGE.md

File metadata and controls

616 lines (522 loc) · 23.4 KB

Reflow language overview

Reflow defines a domain specific language (DSL) for expressing workflow computation. The language allows the user to easily and concisely express workflows that compose tools that exchange data via files. The Reflow language is:

  • functional: it emphasizes expressions and composition
  • lazily evaluated: expressions are evaluated only as needed
  • referentially transparent: expressions may be replaced with the value they produce
  • modular: code is packaged in reusable, parameterized modules
  • implicitly parallel: computations sequenced only when there are data dependencies
  • type safe: programs are checked for type correctness, catching common errors before programs are run

This combination of traits means that users can write straightforward programs that perform the smallest amount of computation needed for the desired result while computation is fully parallelized.

Because Reflow is referentially transparent, the runtime can freely memoize the results of sub expressions, reusing them within a program or multiple programs.

Reflow takes many syntactic hints from Go, and the grammar is simple, compact, and regular. It shares many of Go's identifiers, but also simplifies value constructors to be more appropriate for an applicative language like Reflow.

The full grammar is documented in package syntax.

Basic expressions

Every Reflow file (files ending with ".rf") is a module. If a module contains a toplevel identifier called Main, it may be run by the reflow run command. A module is a set of declarations, binding expressions to identifiers. The following module implements the classic "hello, world!", in Reflow:

val Main = "hello, world!"

When we invoke reflow run, Reflow evaluates the expression bound to Main, and prints the resulting value:

% reflow run hello.rf
"hello, world!"

Reflow is type safe: before evaluating the expression, the module was checked by the type checker, and all expressions without an explicit type declaration were assigned a type. We can examine these with reflow doc: in this case, Main was inferred to be of type string, as expected.

% reflow doc hello.rf
Declarations

val Main string

We can also explicitly assign types to identifiers. For example, if we modify our program to be

val Main int = "hello, world!"

We are (wrongly) declaring that Main should be of type int. This will cause Reflow's type checker to complain.

% reflow run hello.rf
hello.rf:1:31: cannot use value (type string) as type int

Syntactic overview

The following is an overview of Reflow's syntactic features:

strings (type string)
UTF-8-encoded Unicode strings; examples: "hello, world!", `raw strings"!@#$`
integers (type int)
Integers are arbitrary precision; examples: 1, 2, 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679.
floats (type float)
Floats are arbitrary precision; examples: 1.0, 2.23, 3e10, 0.5e-8, 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679.
booleans (type bool)
true, false
files (type file)
Files are blobs of bytes; can be imported from external URLs: file("s3://grail-marius/test"); or from a local file: file("/path/to/file").
directories (type dir)
Directories are dictionaries mapping paths (string) to files; can be imported from external URLs: dir("s3://grail-marius/testdir/"); or from a local file: dir("/path/to/dir/").
tuples (type (t1, t2, t3, ..))
Tuples are an ordered, fixed-size list of heterogeneously typed elements; examples: (1, "foo", "bar") (type (int, string, string)), ("foo", (1,2,3), 3) (type (string, (int, int, int), int)).
lists (type [t])
Lists are variable length collections of homogenously typed elements; examples: [1, 2, 3, 4] (type [int]), ["a", "b", "c"] (type [string]).
maps (type [k:v])
Maps are a mapping of keys to values; examples: ["one": 1, "two": 2] (type [string: int]), [1: 10, 2: 20] (type [int: int]).
records (type {f1 t1, f2 t2, f3 t3}
Records store an unordered collection of typed fields; examples: {a: 123, b: "hello world"} (type {a int, b string}).
sum types (type #T1(t1) | #T2(t2) | #T3(t3))
Sum types (a.k.a. algebraic data types, variant types, unions) express multiple disjoint possibilities for a value. For example, you might want to express the idea of "nil or some integer value", which you could encode as #Nil | #SomeInt(int). Another use case might be expression of "file or directory", which you could encode as #File(file) | #Dir(dir). As you may have noticed from the #Nil variant above, variants do not require elements, so you can use sum types to encode "enumerations", e.g. #Yes | #No | #Maybe. Sum types are also polymorphic, which means that you can use variants anywhere they structurally fit:
type YesNo #Yes | #No
type YesNoMaybe #Yes | #No | #Maybe
func parseDecision(s string) YesNo =
	if s == "yes" {
		#Yes
	} else {
		#No
	}
}
func printDecision(d YesNoMaybe) string =
	switch d {
	case #Yes:
		"yes"
	case #No:
		"no"
	case #Maybe:
		"maybe"
	}
Main := printDecision(parseDecision("nope")) // prints "no"
Here, we pass a #Yes | #No to a function expecting a #Yes | #No | #Maybe. Because every variant of #Yes | #No is compatible with #Yes | #No | #Maybe, this is allowed.
assignment
Values can be assigned identifiers with val; example: val ident = 123 (the identifier ident has type int). Assignments can specify a type annotation: val ident int = 123 and can also perform pattern matching (see below), and a syntax shortcut is provided for simple assignments where neither is required: ident := 123.
conditionals
Conditionals compute a branch depending on a condition; example: if x < 0 { -x } else if x >= 0 && x < 2 { x } else { x - 2 }
switch expressions
Switch expressions can be used to try to match multiple patterns. They evaluate to the value of the matching case's expression.
val someList [string] = ...
switch someList {
case []:
	"the list is empty"
case [s]:
	"the list has exactly one element: " + s
case [s, _, ..._]:
	"the list has more than one element, and the first one is: " + s
}
The cases of a switch expression must be exhaustive:
val someList [string] = ...
switch someList {
case []:
	"the list is empty"
}
// ERROR: ...because any non-empty list will not match a case.
Switches are particularly useful when used with sum types:
pi := 3.14159
type square {length float}
type rectangle {length float, width float}
type circle {radius float}
type shape #Point | #Custom(float) | #Square(square) | #Rectangle(rectangle) | #Circle(circle)
func computeArea(s shape) float =
	switch s {
	case #Point:
		0.0
	case #Custom(a):
		a
	case #Square(s):
		s.length * s.length
	case #Rectangle(r):
		r.length * r.width
	case #Circle(c):
		pi * c.radius * c.radius
	}
computeArea(#Circle({radius: 2.0}))                // 12.56636
computeArea(#Point)                                // 0.0
computeArea(#Rectangle({length: 3.0, width: 4.0})) // 12.0
computeArea(#Custom(3.0))                          // 3.0
functions (type func(a1, a2) r)
Functions abstract code over a list of parameters, they are lexically scoped; examples: func(x int, y int) => x*y (type func(x, y int) int. It is very common to declare a function and assign it to an identifier; Reflow provides syntax sugar for this:
func abs(x int) = if x < 0 { -x } else { x }
(type func(int) int — Reflow infers the returned type for us).
blocks
Blocks are a list of declarations and an expression, example:
{
	a := 1*2
	b := "foo"
	c := "bar"
	(a, b, c)
}
(type (int, string, string)) Blocks are useful to combine with functions, e.g.:
func absMul(x, y int) = {
	x := abs(x)
	y := abs(y)
	x*y
}
Blocks (and toplevel declarations) have a "strong update" property: identifier names may be re-bound; the latest lexical binding applies at the usage site.
execs
An exec runs a shell script inside of a Docker container. Execs can refer to data (e.g., files, directories, strings) interpolated from its environment (expressions delimited by {{ and }}); they can return a number of files and directories. The following exec runs echo hello world inside of the ubuntu image, placing its output in a file. It reserves 1 CPU and 100 MiB of memory. While an exec is not limited to the resources it reserves, it is not recommended for an exec to use significantly more resources than it reserves.
exec(image := "ubuntu", cpu := 1, mem := 100*MiB) (out file) {"
	echo hello world >{{out}}
"}
(This exec has type file.) Note that it is possible to allocate fractional cpu (with a mimimum cpu of 0.1) and that the units of disk and memory are bytes. In order to make allocation disk and memory easier, the following constants are provided:

KiB (210 bytes)
MiB (220 bytes)
GiB (230 bytes)
TiB (240 bytes)

Execs can return multiple files or directories. The following defines a function that produces a pair of files from two input files to bedtools:

val image = "..."
// Bins returns a BED resources for a given bin size and a given
// genome with the given bin size.
func Bins(alignmentGenome, genomeSizes file, binSize int) (bed, nuc file) =
	exec(image, mem := GiB, cpu := 1) (bed, nuc file) {"
		bedtools makewindows -g {{genomeSizes}} -w {{binSize}} > {{bed}}
		bedtools nuc -fi {{alignmentGenome}} -bed {{bed}} > {{nuc}}
	"}
Execs provide a shortcut syntax: exec(image, ..) is syntax sugar for exec(image := image, ..).
pattern matching
Reflow supports pattern matching in assignments and list comprehensions (see below). Pattern matching syntax is complementary to the syntax for constructing literals, so for example, the list [1,2,3] can be pattern matched as val [a, b, c] = [1, 2, 3], resulting in a := 1; b := 2; c := 3. Elements can be ignored with _ and patterns can be nested. For example,
val tup = (1, {r: ("a", 1)}, [1, 2, 3])
val (_, {r: (a, one)}, [first, _, third]) = tup
results in a := "a"; one := 1; first := 1; third := 3. When matching lists, your pattern must be the same length as the list. For example,
val [a, b] = ["a", "b"] // ok
val [a, b] = ["a", "b", "c"] // will fail with an error
val [a, b] = ["a"] // will fail with an error
Use ...x to match the remainder of the list. For example,
val lst = ["a", "b", "c", "d", "e"]
val [a, b, ...cde] = lst
results in a := "a"; b := "b"; cde := ["c", "d", "e"]. If you want to ignore the tail of the list, use ... (no identifier),
val lst = ["a", "b", "c", "d", "e"]
val [a, b, ...] = lst
Use pattern matching to extract values from variants:
type Message string
type Excuse string
type YesNo #Yes(Message) | #No(Excuse)
val decision YesNo = #No("just because")
val #No(reason) = decision // reason == "just because"
val #Yes(excuse) = decision // ERROR: cannot match tag #Yes with a variant with tag #No
comprehensions
Comprehensions provide list, map, and dir processing functionality. A comprehension ranges over a list, map or dir and applies an expression to each entry, returning a list of the results. Comprehensions also perform inline pattern matching. The following example will apply absMul to each pair of integers:
val integers = [(1, 2), (-4, 3), (1, 1)]
val multiplied = [absMul(x, y) | (x, y) <- integers]
Each pair of integers are multiplied together, producing [2, 12, 1].

Dir comprehensions yield (path, file) pairs. For example, after running:

d := dir("/home/notes")
result := [ strings.Join([path, "has", strings.FromInt(len(f)), "bytes"], " ") | (path, f) <- d, if strings.HasSuffix(path, ".txt") ]
result contains a list of strings based on the files in /home/notes with the ".txt" extension, such as:
["foo.txt has 725 bytes", "bar.txt has 89 bytes", "zoo.txt has 7061 bytes"]

Comprehensions can range over multiple lists (or maps or dirs), computing the cross product of these. For example

val integers = [1,2,3,4]
val chars = ["a", "b", "c"]
val cross = [(i, c) | i <- integers, c <- chars]
computes the cross product [(1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b"), (2, "c"), (3, "a"), (3, "b"), (3, "c"), (4, "a"), (4, "b"), (4, "c")].

Finally, comprehensions can apply filters. As an example, the following computes the cross product only for even integers:

val evenCross = [(i, c) | i <- integers, if i%2 == 0, c <- chars]
resulting in [(2, "a"), (2, "b"), (2, "c"), (4, "a"), (4, "b"), (4, "c")].

Operators

Sum +

The sum operator can be used with list, dir, map, int, float, string types. When summing a map or a dir, if the same key exists in the LHS and RHS, the RHS is picked. e.g.:

    a := ["a": 2, "b": 1]
    b := ["a": 3]
    Main := a + b

Main is ["a", 3, b: "1"].

Other arithmetic operators -, *, /

Subtract, multiply and division operators can be used with int and float types.

modulo %

Modulo operator is valid only for int types.

Negation -

Negation operation is valid for int and float types.

Comparison

  • ==, != Equality/inequality operators can be used with list, map, tuple, record, dir, string, int and float types. When comparing files, the digests of the files are compared. When comparing dirs, the names and digests of the files in the directories are compared.
  • <, >, <=, >= Comparison operators can be used with int, float and string types.

Bitwise shift <<, >>

Bit shift operators can be used with int type.

Logical ||, &&, !

Logical operators can be used with boolean expressions.

Builtins

Reflow provides a number of builtin functions. They provide special semantics or typing that cannot be expressed in core Reflow.

Data structure length: len

Builtin len evaluates to the length of the underlying data structure:

  • len of a file evaluates to the size of the file in bytes;
  • len of a directory returns the number of entries in the directory;
  • len of a list returns the number of entries in the list;
  • len of a map returns the number of entries in the map;
  • len of a string returns the length of a string.

Numeric coercion: int, float

Builtins int and float coerces floating point numbers to integers and vice versa.

Tupling and untupling: zip, unzip

Builtin zip takes two lists and returns a new list of tuples consisting of elements from each. Builtin unzip is zip's inverse: given a list of tuples, it returns two lists of elements, each from a different tuple position.

Flattening lists: flatten

Builtin flatten concatenates all the elements in a list of lists.

Map coercion: map

Builtin map converts its argument to a map. Given a list of tuples, a map is constructed by taking the first element of the tuple as a key and the second element as values. Given a directory, a map is constructed using the directory's paths as keys with its files as values.

List coercion: list

Builtin list converts its argument to a list. Given a map, list returns a list of tuples, where the first element of the tuple are the map's keys and the second its values. Given a directory, list returns a list of tuples of paths and files.

Reduce/fold a list: reduce,fold

Builtin reduce reduces a list given a function by repeatedly calling the function with one element of the list at a time, left to right.

`reduce(func(type, type) type, [type]) type`

reduce panics if the list is empty.

a := reduce(func(i, j int) => i + j, [1, 2, 3, 4])
b := reduce(func(i, j {a int}) => if i.a > j.a { {a: i.a} } else { {a: j.a} }, [{a: 2}, {a:7}, {a: 1}])

Builtin fold left folds the list with an initial value and a supplied function. It repeatedly calls the function with one element of the list at a time and uses the result as the accumulated value of the next function call.

`fold(func(typeA, typeB) typeA, list [typeB], init typeA) typeA`

typeA and typeB can be the same. Fold returns the initial value if the list is empty.

a := fold(func(i, j int) => i + j, [1, 2, 3], 0)
b := fold(func(i, j {b int}) => {b: i.b + j.b}, [{b:1}, {b:1}], {b:0})
c := fold(func(i {b int}, j int) => {b: i.b + j}, [1, 2, 3], {b:0})
d := fold(func(i, j int) => if i >= j { i } else { j }, [], 0)

Runtime errors: panic, error

Builtins panic and error are used to indicate a runtime error; panic halts execution with a (string) message and error includes an error code (int) along with the message.

if foo == 0 {
    error(123, "foo should not equal 0")	
} else {
    "some value"	
}
panic("should never reach here")

Sequencing: ~>

The expression

e1 ~> e2

forces expression e1 to be evaluated before e2. Since Reflow has lazy evaluation semantics (see Evaluation Semantics below), evaluation is usually sequenced only when there is a data dependency between the two expressions e1 and e2. The ~> operator allows the user to override this behavior by introducing a new control dependency. This can be useful when there are external dependencies between e1 and e2 that are not witnessed by their data dependencies.

Trace debugging: trace

Builtin trace forces its argument, pretty-prints it to standard error along with the source-code position of the trace expression; the value is then returned. Trace is useful for debugging: for example, the following prints the size of a file as it is evaluated.

val f = exec(image, cpu, mem) (out file) {"
	some_complicated_command >{{out}}
"}

val g = {
	val input = trace(f)
	exec(...) {" {{ input}} "}
}

Ranging: range

Builtin range produces a list of integers in the provided range (exclusive). For example, the following expressions are equivalent:

range(0, 5) == [0, 1, 2, 3, 4]
range(0, 1) == [0]
range(0, 0) == []

Modules

Every Reflow (".rf") file is a module. Modules are reusable components that can be parameterized and then instantiated by other modules. Once instantiated, a module is a value like any other, and may be passed around (even to instantiate yet other modules).

Toplevel value bindings within a module are exported if they begin with a capital letter, otherwise they remain private to the module.

A Reflow module must declare any parameters before regular declarations commence. Parameters may have default values, and, if they do, they may be omitted when instantiated.

// samples specifies the sample name to be processed.
param sample string
// assay specifies the assay to apply to the sample.
param assay string
// mapq specifies the mapq filter threshold to use.
param mapq = 60

// Declarations begin here.

Params may be grouped together; the above is equivalent to:

param (
	// samples specifies the sample name to be processed.
	sample string
	// assay specifies the assay to apply to the sample.
	assay string
	// mapq specifies the mapq filter threshold to use.
	mapq = 60
)

// Declarations begin here.

Parameters may then be used like any other value in the module. Params can also depend on each other. For example:

param (
	sample string
	filename = sample + ".zip"
)

We use make to instantiate a module. Make takes the module's path as its first argument, and then a set of declarations to specify arguments. For example:

val processor = make("./processor.rf", sample := "SAMPLE_123", assay := "assay1")

As with records and execs, make implements a syntactic shortcut when the identifier name matches the parameter name, e.g.:

val sample = ...
val assay = ...
val processor = make("./processor.rf", sample, assay)

Reflow provides a number of system modules; they begin with $/. They are: $/test, $/dirs, $/files, $/regexp, $/strings, and $/path. Reflow module documentation may be inspected with the command reflow doc module.

If a module defines the identifier Main, it can be invoked by reflow run. reflow run can instantiate such modules using command line flags, and they can be queried by reflow run module -help, for example:

% cat main.rf
param (
	a = "ok"
	b string
	c int
)

val Main = (a, b, c)

% reflow run main.rf -help
usage of main.rf:
  -a string
    	 (default "ok")
  -b string
    	(required)
  -c int
    	(required)
% reflow run main.rf -b hello -c 123
("ok", "hello", 123)
% reflow run main.rf -b hello -c 123 -a notok
("notok", "hello", 123)
%

Evaluation semantics

Reflow evaluates its programs similar to dataflow languages: evaluation is sequenced only where there exists a data dependency. This means that, for example, in a block like:

val result = {
	x := exec(...) ...
	y := exec(...) ...
	z := exec(...) ...
	(x, y, z)
}

x, y, and z are evaluated in parallel. Similarly, list comprehensions are also evaluated in parallel.

Reflow programs are also evaluated by need, that is to say, lazily. This means that an expression is never evaluated unless it is required by the results sought by the toplevel invocation. Lazy evaluation holds everywhere in Reflow: for example, if you compute a list of expensive results, but access only the first item, the remaining ones are never computed. See the wikipedia page for more details on lazy evaluation.

The combination of these — dataflow parallelism and lazy evaluation means that most Reflow programs can be written in a "sequential" manner, and Reflow is able to parallelize the program to the extent allowable by data dependencies, skipping needless computation altogether.

Sometimes, it's useful to force ordering, for example because there exists a dependency that is invisible to Reflow. Reflow thus provides an escape hatch, the ~> operator, that forces sequencing of two computations that otherwise would not be. The following example forces a copy to occur before invoking an exec:

val files = make("$/files")
val copy = files.Copy(file("s3://grail-marius/source"), "s3://grail-marius/destination")

val Main = copy ~> exec(image := "ubuntu") (out file) {"
	# do something that assumes s3://grail-marius/destination exists
"}