Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep individual timeseries files #228

Merged
merged 7 commits into from May 4, 2021
Merged

Conversation

nmerket
Copy link
Member

@nmerket nmerket commented May 3, 2021

Fixes #182.

Pull Request Description

Moves the eagle.postprocessing.keep_intermediate_files to postprocessing.keep_individual_timeseries and changes behavior to keep only the timeseries parquet files. Also, removes the deprecated aggregate_timeseries key as that aggregation always happens.

Checklist

Not all may apply

  • Code changes (must work)
  • Tests exercising your feature/bug fix (check coverage report on CircleCI build -> Artifacts)
  • All other unit tests passing
  • Update validation for project config yaml file changes
  • Update existing documentation
  • Run a small batch run to make sure it all works (local is fine, unless an Eagle specific feature)
  • Add to the changelog_dev.rst file and propose migration text in the pull request

@nmerket nmerket requested review from asparke2 and rHorsey May 3, 2021 20:10
Copy link
Contributor

@rHorsey rHorsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - one question which is outside the scope of this PR but still relevant to the key function at issue. Happy to approve @nmerket if this is a priority to get merged!

results_job_json_glob = f'{sim_output_dir}/results_job*.json.gz'
logger.info('Removing temporary files')
fs.rm(ts_in_dir, recursive=True)
logger.info('Removing results_job*.json.gz')
for filename in fs.glob(results_job_json_glob):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmerket In the past this function would run even if there was an uncaught error in the metadata results aggregation, leading to the entire analysis having to be rerun. Is it possible still for this code to be reached if that dask cluster job doesn't complete successfully?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible. I didn't do anything in this PR to mitigate that possibility. In general, this should run only after the results.csv aggregation is complete. Is there a way you could provide a minimum verifiable example of this issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the scope of this PR is to leave the timeseries files on disk, not the results jsons.

@nmerket nmerket merged commit 54b84bf into develop May 4, 2021
@nmerket nmerket deleted the keep_intermediate_files branch May 4, 2021 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch for automatic deletion of time series parquet files
2 participants