Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to use the output of a task in a workflow? #21

Open
chrstn-hntschl opened this issue May 19, 2016 · 1 comment
Open

Possibility to use the output of a task in a workflow? #21

chrstn-hntschl opened this issue May 19, 2016 · 1 comment

Comments

@chrstn-hntschl
Copy link

Hi,

I am using sciluigi in classification experiments where i would like to have one task per model trainer. The number of model trainer is defined by a list of categories/labels for which models need to be trained which is defined by a dataset descriptor file in yaml. I would like the ability to either define a subset of categories as a parameter (easy) or - if no categories are given - load the dataset from yaml (since this is a lengthy process due to some verification DatasetProvider is a task in itself, which validates the descriptor and stores a pickled version) and extract the total list of categories from that descriptor.
I.e. in my workflow() routine I have something like:

class MyWorkflow(sl.WorkflowTask):

dataset_path = luigi.Parameter(description="path to the dataset descriptor file")
categories = TupleParameter(default=(), description="tuple with all category labels for which models files should be trained")

def workflow():
   if not self.categories or not len(self.categories):
      FIXME: load categories from dataset_path (using a DatasetProviderTask) and set self.categories accordingly....
   ...
   for c in self.categories:
      ....
      model_trainer = self.new_task('model_trainer_' + c,
                                          ModelTrainer,
                                          trainer_params=...
                                          )

Any idea on how to solve this?
May thanks in advance!

@samuell
Copy link
Member

samuell commented Nov 8, 2016

Hi @chrstn-hntschl!

I don't know how I have managed to miss your issue 😕 ...

Did you solve this?

There is an inherent problem in Luigi that scheduling and running the workflow happens separately, and that you can't really access parameter values (as far as I know) during the scheduling phase of the workflow, but only at the running phase.

Thus, you can't easily set up the workflow differently based on parameter values, but have to rely on information that can be read in by normal python code during scheduling (in your workflow() method).

There is functionality for dynamic depencies in Luigi since some time ago, but it specifies dynamically upstream tasks only, and not downstream tasks, which is what I think is most often needed.

This constraint of Luigi's scheduling model is what made us start experimenting with a workflow engine based on the dataflow paradigm instead, where scheduling and execution happens concurrently all the time, which allows to do these kinds of things, SciPipe.

It is a bit crude yet, and not yet used in production, but it has quite some tests and example workflows, and is the tool we are plan to use for our upcoming computational projects in the near future.

Hope these pointers are of any help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants