orchestrator

Package orchestrator provides orchestration and supervision of data pipelines.

These pipelines are made up of inputs and processes; an input is a long running function which listens to events (such as database triggers, or kafka topics, or webhooks, or anything really) and a process is a job (such as a docker container, or webhook dispatcher, or kubernetes job, or anything that runs once) that does something with that event.

By building a system of inputs, processes, and persistence layers it becomes easy to build sophisticated data pipelines.

For instance, consider the ecommerce analytics pipeline where:

A user places an order
A backend microservice of some sort places that order into a kafka topic for back-office processing
An analytics consumer slurps these messages into a datawarehouse somewhere
A write to this datawarehouse is picked up by an input
That input triggers a process which runs a series of validations and enrichement processes
The process then writes the enriched, validated to a different table in the warehouse
That write triggers a different input, which listens for enriched data
That input triggers a process which does some kind of final cleansing for gold standard reporting

In effect, this gives us:

[svc] -> (topic() -> [consumer] -> {data warehouse} -> [input -> [encrichment process]] -> {data warehouse} -> [input -> [reporting process]] -> {reporting system}

Or, if you like, a way of building a lightweight medallion architecture

When should you use this package?

This package is useful for building customised data pipeline orchestrators, and for building customised components (where off the shelf components, such as Databricks, are fiddly or unable to be customised to the same level).

This package is also useful for running pipelines cheaply, or locally- it requires no outside service (unless you write that into your own service), and doesn't need complicated masters/worker configurations or anything else really

When should you not use this package?

This package will not give you the same things that off the shelf tools, such as databricks, will give you. There's no easy way to see DAGs, no simple API for updating configuration (unless you write your own).

This package wont do a lot of what you might need; it exists to serve as the engine of a pipeline tool; you must build the rest yourself.

Sub Packages

example

Readme created from Go doc with goreadme

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
example		example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.go		config.go
config_test.go		config_test.go
dag.go		dag.go
dag_test.go		dag_test.go
doc.go		doc.go
event.go		event.go
event_test.go		event_test.go
go.mod		go.mod
go.sum		go.sum
input.go		input.go
operation.go		operation.go
operation_test.go		operation_test.go
process.go		process.go

License

dapper-data/dapper-orchestrator

Folders and files

Latest commit

History

Repository files navigation

orchestrator

When should you use this package?

When should you not use this package?

Sub Packages

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages