Memory efficient results.csv creation #258

rajeee · 2022-01-21T16:22:49Z

Potentially fixes #253.

Pull Request Description

Instead of loading all the results_jobx.json.gz files at once to memory, use dask to load only the results_jobs for an upgrade at a time. This should prevent out of memory error when dealing with large run with many upgrades.

Checklist

Not all may apply

Code changes (must work)
Tests exercising your feature/bug fix (check coverage report on CircleCI build -> Artifacts)
All other unit tests passing
Update validation for project config yaml file changes
Update existing documentation
Run a small batch run to make sure it all works (local is fine, unless an Eagle specific feature)
Add to the changelog_dev.rst file and propose migration text in the pull request

rajeee · 2022-01-25T15:04:19Z

Works in a small batch. Need to verify for a super large batch that previously would have failed.

nmerket

Looks good from what I can see. I think the dask dataframe is a good choice here. Let me know how it goes with a larger dataset.

rajeee · 2022-01-27T17:19:07Z

Works in a large run with 350K buildings, 16 upgrades. With n_worker=10 for postprocessing took ~10 hours.

nmerket · 2022-01-27T23:44:53Z

Works in a large run with 350K buildings, 16 upgrades. With n_worker=10 for postprocessing took ~10 hours.

That's a long time, but it's good that it worked. I suppose that's what matters.

rajeee and others added 5 commits January 20, 2022 15:11

Read results.csv one at a time

81cc89b

Flake8 fixes

8cbba8f

Prevent datatype from being changed

c0d4372

Merge branch 'develop' into oom_fix

27251a6

Fix failing tests

08802b5

rajeee changed the title ~~Read results.csv one at a time~~ Memory efficient results.csv creation Jan 21, 2022

rajeee marked this pull request as ready for review January 21, 2022 18:00

rajeee requested a review from nmerket January 21, 2022 18:00

Correct the path of job*.json files

e5a766c

nmerket approved these changes Jan 25, 2022

View reviewed changes

Sort column order for each results_json dataframe

adb293c

rajeee merged commit 61bda9a into develop Jan 27, 2022

rajeee deleted the oom_fix branch January 27, 2022 17:21

This was referenced Jan 29, 2022

Fix postprocessing failure on local docker #262

Merged

ensure all_ts_cols is always a set #263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory efficient results.csv creation #258

Memory efficient results.csv creation #258

rajeee commented Jan 21, 2022 •

edited

rajeee commented Jan 25, 2022

nmerket left a comment

rajeee commented Jan 27, 2022 •

edited

nmerket commented Jan 27, 2022

Memory efficient results.csv creation #258

Memory efficient results.csv creation #258

Conversation

rajeee commented Jan 21, 2022 • edited

Pull Request Description

Checklist

rajeee commented Jan 25, 2022

nmerket left a comment

Choose a reason for hiding this comment

rajeee commented Jan 27, 2022 • edited

nmerket commented Jan 27, 2022

rajeee commented Jan 21, 2022 •

edited

rajeee commented Jan 27, 2022 •

edited