Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can radical.entk add support for conditional scheduling/execution of tasks/pipelines #632

Open
GKNB opened this issue Jan 17, 2023 · 3 comments

Comments

@GKNB
Copy link

GKNB commented Jan 17, 2023

One common workflow pattern in ML is, we have multiple works, and each work consists of three stages: data generation, training, and data analysis. It is natural to submit each work as a different pipeline so that they can run asynchronously. However, in adaptive learning, different works could have dependencies. For example, the data generation stage of work_2 might depend on data generation or training stage of work_1, while work_1 and work_2 are not completely serial (i.e., data generation stage of work_2 does not depend on data analysis stage of work_1).

My current solution to this problem is, I first create n pipelines, where n equals the number of works. Next, for all pipelines except the first one, I insert a "monitoring stage" which includes a "monitoring task" at the beginning. This task monitors whether the required stages in the last pipeline have finished. If those required stages are finished, this task will quit so that the pipeline can continue to do its actual work.

However, this implementation has a problem of resource wasting. In my current implementation, we need n-1 "monitoring task", and the one in pipeline_k is a simple bash script that looks for a "signal file" created by pipeline_{k-1}, if the pipeline_{k-1} has generated all required data for pipeline_k. This monitoring task, like all other tasks, will consume resources, and each monitoring task will need one cpu core. As the number of pipelines increases, it will consume more resources. A more serious problem is that it might introduce deadlock. This is because entk does not guarantee the order of launching different pipelines. In the extreme case where the number of pipelines exceeds the number of cores, we might be in a scenario where all resources are used by monitoring tasks in pipeline index 2 to n, so that the first real work in pipeline_1 can not start, suggesting that we have a deadlock.

If we are allowed to have conditional scheduling/execution of tasks/pipelines, then we don't need to introduce monitoring tasks anymore. For example, we can do something similar to CUDA stream like task.post_exec_create_event(event_name) which creates an event when a task is finished, and pipeline.wait_event(event_name) which means a pipeline will wait for an event to show up in order to continue to the next state, then we can also solve this issue.

@andre-merzky
Copy link
Member

I agree that the described approach (waiting tasks which block the pipeline until some synchronization event occurs) is inelegant, inefficient and wasteful. But currently RE does indeed not provide any mechanism to express the required functionality.

Implementing this functionality in RE is, however, quite quite possible but also quite difficult. At the moment RE has exactly one point of dependency resolution: all tasks of stage_n of a specific pipeline have to be completed before any task of stage_{n+1} of the same pipeline will be eligible to run. Expanding that to stage dependencies across pipelines would require significant changes to

  • the API (how to express that dependency)
  • internal data structures (how to communicate that information to the wf_processor component)
  • RE wf orchestrator (to enact the whole thing)

Is this missing functionality blocking any use case at the moment?
If not, is there a use case upcoming in the near of far future which relies on this functionality?

@mtitov
Copy link
Contributor

mtitov commented Jan 20, 2023

As we discussed during today's sprint: another option is to consider method stage.post_exec (link) - extend its functionality to trigger certain actions (including starting pipelines)

@mturilli mturilli changed the title Can radical.entk adds support for conditional scheduling/execution of tasks/pipelines Can radical.entk add support for conditional scheduling/execution of tasks/pipelines Jan 23, 2023
@mturilli
Copy link
Contributor

@GKNB was this discussion good enough? Can we close this ticket and, if you will need to use the proposed method with stage.post_exec you will open a new ticket?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants