Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical bugs with reporting results #209

Open
nsheff opened this issue Feb 17, 2024 · 3 comments
Open

Critical bugs with reporting results #209

nsheff opened this issue Feb 17, 2024 · 3 comments

Comments

@nsheff
Copy link
Member

nsheff commented Feb 17, 2024

I just found 2 critical bugs with pypiper result reporting. Actually it may just be 2 outcomes from the same critical bug.

When you report a result it doesn't overwrite the result if there's already one there. This means there's no way to update something if an incorrect result was entered. In the past, results would be reported as many times as you ran the pipeline. All of these results are recorded. Now, the pipeline only reports the first result. If you re-report that result, it's simply lost.

When I'm re-running a pipeline to correct something, I get these:

These results exist for 'demo4': seqcol_digest
These results exist for 'demo4': Time
These results exist for 'demo4': Success

And they're not updated. So, that's the first problem: reporting of results is limited to a single value now, and multiple reports is not recorded.

The second issue is related to it:

The result is not only not reported to the results.yaml file, it's also not reported to screen! So it doesn't even show up in the log file anymore. Results should be reported to screen every time a result report is requested.

Solutions

The second issue should be easy to solve, just make sure pypiper is always printing the result. I'm not sure if this should happen in pypiper or in pipestat, but somewhere it has to happen.

The first problem is harder to solve, because it's a consequence of the switch from a simple tab-delimited stats.tsv file pypiper used to use, to the pipestat file. In pipestat, results are recorded per sample, not per pipeline run, which means there's going to be issues when you re-run a sample. The best I can think of is this:

  1. In the short term, at the very least, the last reported result should be the one used, not the first reported result, as is currently happening. I guess this is as simple as forcing updates whenever you report an existing result.
  2. In the longer term, pipestat should offer the option to include a history of results, and these should be stored somehow in the file (and database). This may not actually be too hard to implement; just add a 'history' function, and when something is overwritten, just move the old values into the history in a way that is an array, rather than a single value. Then, pipestat could offer a clear history function to remove old stuff, if desired, but otherwise, repeated reports of the same result will simply add to the history.
@donaldcampbelljr
Copy link
Member

I believe the first issue is related to #201
I solved it by adding a force_overwrite flag to both report_result and report_object

@donaldcampbelljr
Copy link
Member

I switched force_overwrite to default to True and confirmed that it will overwrite the results and these interactions are saved in log.md and only the most recent is in the results.yaml file (as expected).

import pypiper

pm = pypiper.PipelineManager(name="TEST_OVERWRITING",outfolder="/home/drc/PythonProjects/pipestat/opt_dependencies",
                             pipestat_record_identifier="pypiperRecordIdentifier1", pipestat_schema="sample_output_schema.yaml", 
                             pipestat_pipeline_type="sample", pipestat_results_file="results.yaml")
print("pm made")
pm.report_result(key="number_of_things", value=100)
print("\ndone reporting a result 1st time")
pm.report_result(key="number_of_things", value=200)
print("\ndone reporting a result 2nd time")
pm.report_result(key="number_of_things", value=300)
print("\ndone reporting a result 3rd time")
pm.complete()

Log excerpt:

pm made

> `number_of_things`	100	_RES_

> `pipestat_created_time`	2024-04-04 14:16:54	_RES_

> `pipestat_modified_time`	2024-04-04 14:16:54	_RES_

done reporting a result 1st time
These results exist for 'pypiperRecordIdentifier1': number_of_things
Overwriting existing results: number_of_things

> `number_of_things`	200	_RES_

> `pipestat_modified_time`	2024-04-04 14:16:54	_RES_

done reporting a result 2nd time
These results exist for 'pypiperRecordIdentifier1': number_of_things
Overwriting existing results: number_of_things

> `number_of_things`	300	_RES_

> `pipestat_modified_time`	2024-04-04 14:16:54	_RES_

done reporting a result 3rd time

Results file:

test_pipe:
  project: {}
  sample:
    pypiperRecordIdentifier1:
      number_of_things: 300
      pipestat_created_time: '2024-04-04 14:16:54'
      pipestat_modified_time: '2024-04-04 14:16:54'

Next steps will be to add history enhancements to Pipestat via pepkit/pipestat#177

@donaldcampbelljr
Copy link
Member

Closed with 0.14.1 release and pipestats 0.9.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants