Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provenance #54

Open
mdrio opened this issue Feb 17, 2021 · 3 comments
Open

provenance #54

mdrio opened this issue Feb 17, 2021 · 3 comments
Labels

Comments

@mdrio
Copy link

mdrio commented Feb 17, 2021

Is your feature request related to a problem? Please describe.
Hi, I need to track the provenance of artifacts produced by worklflows.

Describe the solution you'd like
The workflow report contains only information about the output, it would be great to have the associate the related cwl and its inputs. Is there an easy way to obtain that?

Thanks

@mdrio mdrio added the Feature label Feb 17, 2021
@michael-kotliar
Copy link
Member

michael-kotliar commented Mar 15, 2021

Hi @mdrio, sorry for the late reply. All the information about the workflow execution is stored in the Airflow metadata database, which you can access from the Airflow UI or directly. We use Xcom to store location of the JSON file with outputs of each step. Also, if you add in your airflow.cfg file the following section

[cwl]
keep_tmp_data = true

the system won't even delete any temporary data between the steps.
The information about the workflow is basically DAG ID in the context of Airflow. We assume that for each new workflow you will add a separate DAG. Each specific workflow execution is DagRun in the context of Airflow. For each DagRun you can get information about the parameters it was triggered with. Those parameters are available through Airflow UI as the configuration for the DagRun. We also report workflow execution statistics such as time for each step, disk usage for temporary and output files. Let me know if you need any additional information about it.

@mdrio
Copy link
Author

mdrio commented May 11, 2021

Hi @michael-kotliar, thanks for the reply. How the statistics gathering works? I see the a connection has to be created in order to receive the data, but which service or API is expected to be called? Is the workflow report also included in the data POSTed?

@michael-kotliar
Copy link
Member

Hi @mdrio,
We post all collected statistics as part of the progress report. We trigger POST as Task's or DAG's on_success/on_failure callbacks, so you don't need to have CWL-Airflow API running.

The endpoints are defined as

CONN_ID = "process_report"
ROUTES = {
    "progress": "airflow/progress",
    "results":  "airflow/results",
    "status":   "airflow/status"
}

Where process_report connection should be created in Airflow.

Please, see more details here
https://cwl-airflow.readthedocs.io/en/latest/readme/how_to_use.html#posting-pipeline-execution-progress-statistics-and-results

Let me know if it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants