Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory efficient results.csv creation #258

Merged
merged 7 commits into from Jan 27, 2022
Merged

Memory efficient results.csv creation #258

merged 7 commits into from Jan 27, 2022

Conversation

rajeee
Copy link
Contributor

@rajeee rajeee commented Jan 21, 2022

Potentially fixes #253.

Pull Request Description

Instead of loading all the results_jobx.json.gz files at once to memory, use dask to load only the results_jobs for an upgrade at a time. This should prevent out of memory error when dealing with large run with many upgrades.

Checklist

Not all may apply

  • Code changes (must work)
  • Tests exercising your feature/bug fix (check coverage report on CircleCI build -> Artifacts)
  • All other unit tests passing
  • Update validation for project config yaml file changes
  • Update existing documentation
  • Run a small batch run to make sure it all works (local is fine, unless an Eagle specific feature)
  • Add to the changelog_dev.rst file and propose migration text in the pull request

@rajeee rajeee changed the title Read results.csv one at a time Memory efficient results.csv creation Jan 21, 2022
@rajeee rajeee marked this pull request as ready for review January 21, 2022 18:00
@rajeee rajeee requested a review from nmerket January 21, 2022 18:00
@rajeee
Copy link
Contributor Author

rajeee commented Jan 25, 2022

Works in a small batch. Need to verify for a super large batch that previously would have failed.

Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from what I can see. I think the dask dataframe is a good choice here. Let me know how it goes with a larger dataset.

@rajeee
Copy link
Contributor Author

rajeee commented Jan 27, 2022

Works in a large run with 350K buildings, 16 upgrades. With n_worker=10 for postprocessing took ~10 hours.

@rajeee rajeee merged commit 61bda9a into develop Jan 27, 2022
@rajeee rajeee deleted the oom_fix branch January 27, 2022 17:21
@nmerket
Copy link
Member

nmerket commented Jan 27, 2022

Works in a large run with 350K buildings, 16 upgrades. With n_worker=10 for postprocessing took ~10 hours.

That's a long time, but it's good that it worked. I suppose that's what matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Post-processing timeout/memory error for large runs
2 participants