Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to create a Workflow Run Crate file #148

Open
kinow opened this issue Feb 23, 2023 · 11 comments
Open

Document how to create a Workflow Run Crate file #148

kinow opened this issue Feb 23, 2023 · 11 comments

Comments

@kinow
Copy link
Member

kinow commented Feb 23, 2023

Hi!

I am creating the Autosubmit RO-Crate using ro-crate-py, using COMPSs as reference. It calls add_workflow, add(Person), and other functions that appear in the ro-crate-py README. However, inspecting the JSON file I see no mention of CreateAction.

I noticed this after I tried to validate the rocrate.zip file created with Autosubmit using runcrate.

$ python consume_crate.py /home/kinow/autosubmit/a000/rocrate.zip 
Traceback (most recent call last):
  File "/home/kinow/Development/python/workspace/runcrate/tools/consume_crate/consume_crate.py", line 83, in <module>
    main(parser.parse_args())
  File "/home/kinow/Development/python/workspace/runcrate/tools/consume_crate/consume_crate.py", line 65, in main
    assert len(actions[wf.id]) == 1
KeyError: 'tmp/tmpkqjm8m6e/a000/workflow.yml'

Is there an easy way to use ro-crate-py to produce an RO-Crate that conforms to Workflow Run Crate profile, and that can be validated with the consume_crate script? cc @simleo

Thanks!
Bruno

@simleo
Copy link
Collaborator

simleo commented Feb 24, 2023

rocrate.model does not contain a representation of all possible RO-Crate types. It's something we considered at some point but then abandoned, see #89 for details. Person is one of the few exceptions: it was there since the early days of development (before I started contributing). Other entities that have a representation in Python are ComputationalWorkflow and some others that appear in known profiles. In general, though, to add an entity like CreateAction, which does not have a representation in the model, you need to add a ContextEntity and specify its @type

from rocrate.model.contextentity import ContextEntity
...
action = crate.add(ContextEntity(crate, properties={
    "@type": "CreateAction",
    "name": "Execution of foo.cwl",
}))
workflow = crate.add_workflow(...)
action["instrument"] = workflow
...
crate.root_dataset["mentions"] = [action]

Take a look at the runcrate code, it generates a very detailed workflow run ro-crate so it has many examples.

@kinow
Copy link
Member Author

kinow commented Feb 24, 2023

Take a look at the runcrate code, it generates a very detailed workflow run ro-crate so it has many examples.

Will do. Thank you @simleo !

@kinow
Copy link
Member Author

kinow commented Mar 8, 2023

@rsirvent has kindly shared his most up to date code, and I noticed he was manually writing the WRROC CreateAction and properties. I then had a look at streamflow and noticed they were doing the same.

I implemented a similar approach in the Autosubmit merge request, and after that got the consume_crate utility to run fine:

(venv) (autosubmit4) kinow@ranma:~/Development/python/workspace/runcrate/tools/consume_crate$ python consume_crate.py ~/autosubmit/a000/rocrate.zip 
action #d88221a0-ede7-4dad-a478-618d9f53c88e
  instrument: workflow.yml (['File', 'SoftwareSourceCode', 'ComputationalWorkflow'])
  started: 2023-02-17T15:42:31
  ended: 2023-02-17T15:43:45
  inputs:
  outputs:

I think inputs and outputs are empty because Autosubmit workflows are not structured with inputs and outputs. There are inputs in the configuration, but they are not fixed (i.e. users provide YAML files that can contain pretty much anything, plus some pre-defined settings used by Autosubmit), and of course it produces outputs, but they are not controlled in the workflow configuration, nor tracked by the WMS (i.e. tasks may write several files, some temporary, others important for the workflow run, but these are not maintained by Autosubmit).

I can see the log files in my JSON metadata, as well as the workflow graph plot PDF file. So I think I am done with this initial version of RO-Crate support for Autosubmit, compliant with Workflow Run Crate. Since Autosubmit doesn't track the tools executed by the workflow tasks (i.e. we just execute a shell script that may execute one or more executables) I think I won't implement the Provenance Run Crate profile (@simleo you asked that in the last meeting, I believe).

As far as I can tell, ro-crate-py doesn't seem to provide a way to create RO-Crate files in Python that are compatible with specific profiles yet. That was part of my confusion/issue here. Having that, and validation, would be really excellent! As well as a way to have parts of the crate loaded from external files/resources #146

I will finish writing tests and documentation for RO-Crate in Autosubmit, and then start writing the text for the RO-Crate paper 🤓 🥳

Thanks for the help!
Bruno

@simleo
Copy link
Collaborator

simleo commented Mar 8, 2023

Looking at https://autosubmit.readthedocs.io/en/master/userguide/defining_workflows/index.html, it seems that the workflow refers to shell scripts for the various steps, and from what you've said I guess these scripts can contain arbitrary code, and they are the ones that actually know about input and output files. So the WMS does not know about inputs and outputs, and therefore cannot copy them to the RO-Crate. This means that the RO-Crate would not be much informative except for the log files, workflow diagram and timestamps. Moreover, the absence of object and / or result indicates that the workflow took no inputs and / or outputs, which is not the case.

I guess the inclusion of inputs and outputs could be made to work through some sort of convention. For instance, suppose that users write their scripts so that all input files (and directories) are taken from under a top-level IN_DIR directory, and all output files (and directories) are written under a top-level OUT_DIR directory: autosubmit could allow users following this convention to specify the IN_DIR and OUT_DIR paths, then it would know where to look for inputs and outputs. So users who want a Workflow Run Crate for their runs could follow the convention and pass the relevant values for the input and output paths.

@kinow
Copy link
Member Author

kinow commented Mar 8, 2023

Hi @simleo

Looking at https://autosubmit.readthedocs.io/en/master/userguide/defining_workflows/index.html, it seems that the workflow refers to shell scripts for the various steps, and from what you've said I guess these scripts can contain arbitrary code, and they are the ones that actually know about input and output files.

Correct.

So the WMS does not know about inputs and outputs, and therefore cannot copy them to the RO-Crate.

Correct too. I believe this is not exclusive to Autosubmit. With ecFlow or Cylc, you would start a workflow and the tasks could access a NFS partition to fetch data, or maybe a remote service like ECMWF Mars, NOAA, some FTP server, etc. The output may be stored locally, or the workflow may not produce anything (e.g. call a web service posting some data, i.e. not stored locally).

This means that the RO-Crate would not be much informative except for the log files, workflow diagram and timestamps.

Exactly.

Moreover, the absence of object and / or result indicates that the workflow took no inputs and / or outputs, which is not the case.

Yes. I hadn't thought that far, and it sounds wrong to me too now.

I guess the inclusion of inputs and outputs could be made to work through some sort of convention.

That's interesting. I will think about it, and talk with other engineers that work on Autosubmit to check if they have other ideas too.

Thanks!

@kinow
Copy link
Member Author

kinow commented Mar 17, 2023

I've updated the AS merge request with a check box to implement the inputs and outputs. I will use a link of globs in the Autosubmit config file conf/rocrate.yml (used now for license, and list of authors, metadata that we do not have in Autosubmit's configuration model) like:

- inputs:
  - 'proj/PROJECT_FOLDER/inputs/namelist1.nml'
  - 'proj/PROJECT_FOLDER/inputs/**/*.xml'
  - ...
- outputs:
  - '/scratch/project_12345/MODEL/ABC/SV/1/200101*\.*.nc'
  - ...

Maybe I will replace the simple string by an object/map to allow users to choose the mime-type or add more info if needed (can't recall what goes in the inputs/outputs schema).

But this convention suggested by @simleo looks like the simplest solution for WMS's that do not have the feature to track inputs and outputs, like Cylc or ecFlow too. Probably worth adding it to the RO-Crate site/docs, and maybe to ro-crate-py too, so other maintainers of similar WMS's know of this limitation & workaround.

I also pinged my group leader to ask about a public workflow to use a test and upload to Zenodo/WorkflowHub.eu/etc. for testing it 👍

@kinow
Copy link
Member Author

kinow commented Mar 20, 2023

@simleo , I started working on the inputs and outputs today, but got stuck working on the inputs & outputs. Could you confirm if I have to use the FormalParameter type of Bioschemas for all the inputs/outputs? And would there be some good examples on how to manually craft a BioSchemas parameter without CWL? I only found CWL files with these FormalParameter's, but I got the impression that the name/description/type were being retrieved from the CWL file (ditto for Galaxy, I think?). Not sure how to use it for Autosubmit.

Looking at PyCOMPSS, there are three workflows. But none displays inputs nor outputs in the WorkflowHub.eu UI. I opened the one that was most recently updated, with version 2 created 23rd Jan 2023.

It looks like that workflow actually has inputs attached to the ComputationalWorkflow, similarly to what I am trying to do, but for some reason these are not displayed in WorkflowHub.eu? Is the WH displaying only inputs and outputs compatible with CWL/bio* types, perhaps?

image

Thanks!
Bruno

@simleo
Copy link
Collaborator

simleo commented Mar 21, 2023

AFAIK, WorkflowHub does not read workflow inputs and outputs from the RO-Crate, but only from the workflow file, for languages it knows (e.g., CWL). Note that many workflows are not uploaded to WorkflowHub as RO-Crates at all (but you can download them as RO-Crates because WorkflowHub generates one for you), so in many cases there would be no RO-Crate to read anyway. The COMPSs workflow you're looking at was uploaded as an RO-Crate, but it lists actual files in the workflow's input and output, rather than formal parameters, which is incorrect. I discussed this with @rsirvent, who said that COMPSs has no info on the formal parameters, so the current version of the workflow run crate does not list them at all (they are not required), and the actual files are listed in the action's object and result, as they should.

So you can also avoid listing formal parameters altogether, and only list actual files in the action's object and result. Perhaps you can consider formal parameters for future versions.

@kinow
Copy link
Member Author

kinow commented Mar 21, 2023

I will have to read a bit more about formal parameters and take another look at workflowhub/compss/more ro-crate files. But I think I got the right direction to follow here. Thank you @simleo !

@kinow
Copy link
Member Author

kinow commented Apr 2, 2023

Hi @simleo

Spent some time reading about FAIRDom, Seek, WorkflowHub, BioSchemas, and the formal parameter. It was quite a journey reviewing terms and how things are connected.

AFAIK, WorkflowHub does not read workflow inputs and outputs from the RO-Crate, but only from the workflow file, for languages it knows (e.g., CWL).

I think you are right. I can see Seek seems to have some code executed only for CWL

Seems like CWL is the workflow class that enables more features in Seek/WorkflowHub.

Note that many workflows are not uploaded to WorkflowHub as RO-Crates at all (but you can download them as RO-Crates because WorkflowHub generates one for you), so in many cases there would be no RO-Crate to read anyway.

Noted, I wasn't aware of that. Thanks!

The COMPSs workflow you're looking at was uploaded as an RO-Crate, but it lists actual files in the workflow's input and output, rather than formal parameters, which is incorrect. I discussed this with @rsirvent, who said that COMPSs has no info on the formal parameters, so the current version of the workflow run crate does not list them at all (they are not required), and the actual files are listed in the action's object and result, as they should.

I think I understand this part now.

So you can also avoid listing formal parameters altogether,

I believe I will have to do just that. As in the COMPSs case, Autosubmit does not have enough information to create Formal Parameters as defined in the BioSchemas spec — and even if I look at the Autosubmit runtime/saved data, all I can get are probably file and parameter names, without content type, default values, and most of what's available for Formal Parameters. I think it wouldn't make much sense to use that with Autosubmit.

and only list actual files in the action's object and result.

👍

Perhaps you can consider formal parameters for future versions.

Definitely. If one day CWL has better support for the kind of workflows produced with Autosubmit/Cylc/ecFlow I would then investigate integrating CWL into Autosubmit .Then the workflow class used for a tool such as WorkflowHub would be, I think, either Autosubmit but handled in the WorkflowHub as CWL (I think it does that for Galaxy, more or less, but there's a galaxy2cwl that's used, I think), or CWL directly.

But that's a little further in the future, I think.

I'm trying to implement the Workflow Run Crate profile, and I think the object&result of the workflow for inputs/outputs are fine - https://www.researchobject.org/workflow-run-crate/requirements & ResearchObject/workflow-run-crate#16. So I will start looking at how to add the inputs and outputs in Autosubmit, looking at COMPSs for reference.

My initial idea is to use the same JSON file I am using for the "patch" applied to the Autosubmit configuration, but with something like

{
  "@graph": { ... patch goes here, with license, author, etc },
  "inputs": [
    { "name": "model_input/abc.nc",  "encodingFormat": "application/netcdf", "valueRequired": true, "description": "Input file for the grid data..."},
    { "glob": "extra_input/**/*.tmp", "encodingFormat": "application/text", "valueRequired": false, "description": "Auxiliary, optional files"},
    ...
  ],
  "outputs": [ ... ]

The first entry of inputs has a name, and is listing a single file. This could instead be part of the @graph and added normally as with other contextual entities, or here (in case workflow devs prefer to keep inputs together in the patch file).

The second entry has a format that would be useful for inputs & outputs of workflow managers that do not have formal parameters, allowing the WMS to iterate the .inputs and .outputs values, and checking if there's a .name or .glob value. For name just add to RO-Crate, for .glob iterate the results and add to the RO-Crate.

I think that way I will have everything needed to create a basic workflow run crate, with inputs & outputs in a similar way as implemented in COMPSs 🤞

Cheers
Bruno
p.s. TIL there's no official IANA media type for netcdf, yet, only a provisional entry - Unidata/netcdf#42
Thank you!

@simleo
Copy link
Collaborator

simleo commented Apr 6, 2023

{ "name": "model_input/abc.nc",  "encodingFormat": "application/netcdf", "valueRequired": true, "description": "Input file for the grid data..."}

Note that valueRequired applies to instances of FormalParameter, not File.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants