memory_issue_fix #212

rajeee · 2021-03-15T18:54:56Z

Fixes #210 .

Pull Request Description

The v0.19 postprocessing has been failing on eagle with issue #210 when trying to postprocess the recent restock runs with schedules. After series of debugging session, it seems like dask and eagle's nodes/process just don't work well together.
This pull request does a few things:

Reduces the number of process and threads on each nodes to just 1. This was necessary to avoid memory issues. Probably Dask's multiprocess/multithread memory management is causing problems.
Instead of using a giant dask datafarame created from list of delayed objects, to write all of the parquet files using single command, it uses a simple list of delayed function that reads a group of parquet files, combines them and writes them. This was necessary to avoid communication/memory issue we were having while using the giant dask dataframe.
Exposes the postprocessing node memory size in the yaml.
Adds a "keep_intermediate_files" boolean option in the yaml. Intermediate files are very useful during debugging and sometimes also useful in use cases where individual simulation output timeseries is required.

Oh, I also did some benchmarking for the optimal size of the aggregated parquet files. Looks like both 1GB and 4GB (the new default) is good.

Query1: SELECT "build_existing_model.state" as state, sum("total_site_electricity_kwh") as total_energy FROM "test_runs"."<test_table>_timeseries" join "<test_table>" on "<test_table>_timeseries"."building_id" = "<test_table>_baseline"."building_id" group by 1 order by 1;
Query2: SELECT count(*) from "test_runs"."<test_table>_timeseries";

Checklist

Not all may apply

Code changes (must work)
Tests exercising your feature/bug fix (check coverage report on CircleCI build -> Artifacts)
All other unit tests passing
Update validation for project config yaml file changes
Update existing documentation
Run a small batch run to make sure it all works (local is fine, unless an Eagle specific feature)
Add to the changelog_dev.rst file and propose migration text in the pull request

…ith only pip

nmerket

Looks pretty good. A few comments below. Could use some documentation updates and something in the changelog.

This reminds me I need to get things set up to have multiple versions of the docs published at the same time.

buildstockbatch/eagle.py

nmerket · 2021-03-16T17:45:34Z

buildstockbatch/postprocessing.py

+    except Exception:
+        return pd.DataFrame(columns=[all_cols]+['building_id'])


This is a pretty broad except block. It looks like it just returns an empty dataframe with the appropriate columns if anything goes wrong. Intentionally creating these kinds of silent errors is troubling because it means that there could be major problems with reading some of the parquet files and the user will not know and think everything is fine.

Yeah, agreed it could mask errors. But I think a few times there was an issue reading a handful of parquet files that resulted in crashing everything, so I thought it might be a good idea to assemble whatever can be assembled. Do you think it would be okay to have this try-except if we only wrap the read_parquet call?

I figured that's why you had this here. I don't know if it functionally matters much to only wrap the read_parquet call rather than do this. My primary concern is that this will silently omit data the user is expecting. That could mess up results. Right now at least the user knows something went wrong because they don't get their results. Could we put some sort of error message in the postprocessing (or other relevant) log file saying, "building_id = x, upgrade = y failed to load parquet"? They still might miss that, but it would be documented somewhere. It would be better than what we have now, which is an obtuse error with very little to go on to find the problem.

@nmerket I started working on the solution you proposed of logging the buildings we failed to read the timeseries, but then realized we would have to do it at multiple places. If there is any corrupted parquet it would fail here first:

buildstockbatch/buildstockbatch/postprocessing.py

Line 281 in 1cd0349

all_ts_cols = db.from_sequence(ts_filenames, partition_size=100).map(partial(get_cols, fs)).\

And, the corrupted parquet (or not being able to read from disk) is something I have seen only a couple of times when the eagle was having an issue, so it may not be worthwhile to account for that. Even if we need to handle that, maybe not in this PR. So, I have removed all such handling.

nmerket · 2021-03-16T17:56:22Z

buildstockbatch/postprocessing.py

            npartitions = min(len(ts_filenames), npartitions)  # cannot have less than one file per partition
            ts_files_in_each_partition = np.array_split(ts_filenames, npartitions)
+            N = len(ts_files_in_each_partition[0])


Minor thing, but usually a variable written in ALL_CAPS is understood to be a constant in python. CamelCase is a class. This looks to be a regular variable which are usually lower_snake_case. It's not clear from the variable name what this is. Maybe a more descriptive name would be helpful.

point taken :). Thanks. I have a propensity to use big N for counts of any sort. I will change the name.

Co-authored-by: Noel Merket <noel.merket@nrel.gov>

nmerket

Looks good. Thanks for making these changes!

rajeee and others added 6 commits March 12, 2021 12:19

memory_issue_fix

f1b5fa8

Pass through postprocessing node memory size

634773d

Flake8 fixes and add yaml setting for keeping intermediate files

f8e9558

Modify tests

6b0665f

switching back to the circleci python image and installing all deps w…

749346b

…ith only pip

remove unused import

72f50e8

rajeee requested a review from nmerket March 15, 2021 21:58

rajeee marked this pull request as ready for review March 15, 2021 22:30

nmerket requested changes Mar 16, 2021

View reviewed changes

rajeee and others added 4 commits March 16, 2021 12:14

Rename 'memory' to 'node_memory_mb'

befbcb5

Co-authored-by: Noel Merket <noel.merket@nrel.gov>

Remove the broad exception and update the documentation

c7e0546

Style fix

564e089

Update changelog for v0.19_patch

ec365a6

rajeee requested a review from nmerket March 17, 2021 17:21

nmerket approved these changes Mar 22, 2021

View reviewed changes

nmerket merged commit 840630d into v0_19_patch Mar 22, 2021

nmerket deleted the memory_fix branch March 22, 2021 17:56

nmerket mentioned this pull request Mar 22, 2021

Merge patch #213

Merged

7 tasks

asparke2 mentioned this pull request Apr 23, 2021

Switch for automatic deletion of time series parquet files #182

Closed

This was referenced May 4, 2021

Postprocess timeseries fails when upgrade is not applicable to buildings #199

Closed

Skip upgrades without ts #230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory_issue_fix #212

memory_issue_fix #212

rajeee commented Mar 15, 2021 •

edited

nmerket left a comment

nmerket Mar 16, 2021

rajeee Mar 16, 2021

nmerket Mar 16, 2021

rajeee Mar 16, 2021 •

edited

nmerket Mar 16, 2021

rajeee Mar 16, 2021

nmerket left a comment

		except Exception:
		return pd.DataFrame(columns=[all_cols]+['building_id'])

memory_issue_fix #212

memory_issue_fix #212

Conversation

rajeee commented Mar 15, 2021 • edited

Pull Request Description

Checklist

nmerket left a comment

Choose a reason for hiding this comment

nmerket Mar 16, 2021

Choose a reason for hiding this comment

rajeee Mar 16, 2021

Choose a reason for hiding this comment

nmerket Mar 16, 2021

Choose a reason for hiding this comment

rajeee Mar 16, 2021 • edited

Choose a reason for hiding this comment

nmerket Mar 16, 2021

Choose a reason for hiding this comment

rajeee Mar 16, 2021

Choose a reason for hiding this comment

nmerket left a comment

Choose a reason for hiding this comment

rajeee commented Mar 15, 2021 •

edited

rajeee Mar 16, 2021 •

edited