Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Bits shift operator (>>) for functions dependency in workflow deployment. #3305

Open
1 task done
npsables opened this issue Mar 20, 2023 · 0 comments
Open
1 task done

Comments

@npsables
Copy link

npsables commented Mar 20, 2023

Feature Type

  • Adding new functionality to MLRun

Problem Description

I currently got training to work with mlrun in HCI. I know you guys make it as atomic as possible, providing a lot of controllability. But it seems to make the CI/CD very redundant. For example:

First we have to set functions, each function in a file:

def create_hadoop_table(context: mlrun.MLClientCtx):
    ...

def etl_something(context: mlrun.MLClientCtx):
    ...

def store_1_to_fs(context: mlrun.MLClientCtx):
    ...

def store_2_to_fs(context: mlrun.MLClientCtx):
    ...

# 4 functions for 4 files

then create workflows in a file, the function names are not even referred to any function above!

@dsl.pipeline(name="process-vector")
def pipeline():
    create_hadoop_tables_function = mlrun.run_function(
        function="create_missing_hadoop_tables"
    )

    etl_vector = mlrun.run_function(
        function="etl_from_hadoop_file"
    ).after(create_hadoop_tables_function)

    mlrun.run_function(
        function="store_1_to_fs"
    ).after(etl_vector)

    mlrun.run_function(
        function="store_2_to_fs"
    ).after(etl_vector)

then create a bunch of set_function and set_workflow with the name referred to above pipelines:

    project = mlrun.get_or_create_project(name="common")

    project.set_function(
        name="create_missing_hadoop_tables",
        kind="job",
        func="<path-to-create_hadoop_table-file>",  # ---> this is the reference
        handler="create_hadoop_table",
        image=MLRUN_IMAGE
    )

    project.set_function(
        name="etl_from_hadoop_file",  
        kind="job",
        func="<path-to-etl_something-file>",
        handler="etl_something",
        image=MLRUN_IMAGE
    )

    project.set_function(
        name="store_1_to_fs",
        kind="job",
        func=""<path-to-store_1_to_fs-file>",
        handler="store_1_to_fs",
        image=MLRUN_IMAGE
    )

    project.set_function(
        name="store_2_to_fs",
        kind="job",
        func="<path-to-store_2_to_fs-file>",
        handler="store_2_to_fs",
        image=MLRUN_IMAGE
    )

    # then declare workflow
    project.set_workflow(
        name='vector-processing-workflow',
        workflow_path="<path-to-pipeline-file>",
        handler="pipeline"
    )

    project.save()

As you can see, things keep repeating. Reference using the name (string) provides no clues on whether functions in the pipeline are connected together or not.

Feature Description

We can build a DAGTree (yes, like Apache Airflow). It's elegant and more readable. I believe things get easier this way:

First 4 files for 4 functions:

def create_hadoop_table(context: mlrun.MLClientCtx):
    ...

def etl_something(context: mlrun.MLClientCtx):
    ...

def store_1_to_fs(context: mlrun.MLClientCtx):
    ...

def store_2_to_fs(context: mlrun.MLClientCtx):
    ...

# 4 functions for 4 files

then one file to deploy:

project = mlrun.get_or_create_project(name="common")

f1 = project.set_function(
      name="create_missing_hadoop_tables",
      kind="job",
      func="<path-to-create_hadoop_table-file>",  # ---> this is the reference
      handler="create_hadoop_table",
      image=MLRUN_IMAGE
  )

f2 = project.set_function(
      name="etl_from_hadoop_file",  
      kind="job",
      func="<path-to-etl_something-file>",
      handler="etl_something",
      image=MLRUN_IMAGE
  )

f3 = project.set_function(
      name="store_1_to_fs",
      kind="job",
      func=""<path-to-store_1_to_fs-file>",
      handler="store_1_to_fs",
      image=MLRUN_IMAGE
  )

f4 = project.set_function(
      name="store_2_to_fs",
      kind="job",
      func="<path-to-store_2_to_fs-file>",
      handler="store_2_to_fs",
      image=MLRUN_IMAGE
  )

# then declare workflow 
f1 >> f2 >> [f3, f4]  
# this will render the workflow

project.save()

Alternative Solutions

Apache Airflow repo should be a good example.

Additional Context

Do you think this is a good idea? I can do a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant