Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Task/ParallelTask/Pipeline #101

Open
leo-schick opened this issue Apr 11, 2023 · 0 comments
Open

Dynamic Task/ParallelTask/Pipeline #101

leo-schick opened this issue Apr 11, 2023 · 0 comments
Labels

Comments

@leo-schick
Copy link
Member

leo-schick commented Apr 11, 2023

Currently the data pipeline DAG is defined fixed on compilation and supports only a small option of dynamics e.g. the task ParallelReadFile supports to read files (the number of files are unknown on compilation time).

I would like to have similar dynamics in other areas as well:

Dynamic nodes

The following dynamic nodes could be implemented:

Dynamic tasks

A option to give the Task a python function which is executed on pipeline runtime and returns a list of commands to execute in order.

Dynamic parallel tasks

A option to give the ParallelTask a python function which is executed on pipeline runtime and returns a list of commands / command chains to be executed in parallel.

Dynamic pipeline

A option to define a DynamicPipeline where the nodes are defined within a python function which is executed on pipeline runtime.

Implement UI awareness

The dynamic node objects (Task/ParallelTask/Pipeline) must be defined so that the python function which defines the actual commands/tasks/nodes is not run when interacting with the UI.

Implement node cost handling

These dynamic nodes should be defined so that they define sub-nodes for the dynamic node object. The pipeline execution should then intelligently retract the node cost from the database when the node had been executed in the past. E.g. a dynamic node could represent a export of a database table. By defining the sub-nodes, the pipeline execution can intelligently run the nodes with the highest node cost first to save up execution time.

Example use cases

  • performing actions against tables on a database (e.g. export table to datalake). We don't know on time of compilation what tables exist in the database
  • performing actions against a data lake / lakehouse per table on disk (e.g. connecting the table to our database engine). We don't know on time of compilation what tables exist on the data lake / lakehouse.
leo-schick added a commit that referenced this issue Nov 24, 2023
* add callable for Task arg. commands #101

* code review suggestions

* make is_dynamic_commands a getter-property
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant