Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory_issue_fix #212

Merged
merged 10 commits into from Mar 22, 2021
Merged

memory_issue_fix #212

merged 10 commits into from Mar 22, 2021

Conversation

rajeee
Copy link
Contributor

@rajeee rajeee commented Mar 15, 2021

Fixes #210 .

Pull Request Description

The v0.19 postprocessing has been failing on eagle with issue #210 when trying to postprocess the recent restock runs with schedules. After series of debugging session, it seems like dask and eagle's nodes/process just don't work well together.
This pull request does a few things:

  1. Reduces the number of process and threads on each nodes to just 1. This was necessary to avoid memory issues. Probably Dask's multiprocess/multithread memory management is causing problems.
  2. Instead of using a giant dask datafarame created from list of delayed objects, to write all of the parquet files using single command, it uses a simple list of delayed function that reads a group of parquet files, combines them and writes them. This was necessary to avoid communication/memory issue we were having while using the giant dask dataframe.
  3. Exposes the postprocessing node memory size in the yaml.
  4. Adds a "keep_intermediate_files" boolean option in the yaml. Intermediate files are very useful during debugging and sometimes also useful in use cases where individual simulation output timeseries is required.

Oh, I also did some benchmarking for the optimal size of the aggregated parquet files. Looks like both 1GB and 4GB (the new default) is good.
image
Query1: SELECT "build_existing_model.state" as state, sum("total_site_electricity_kwh") as total_energy FROM "test_runs"."<test_table>_timeseries" join "<test_table>" on "<test_table>_timeseries"."building_id" = "<test_table>_baseline"."building_id" group by 1 order by 1;
Query2: SELECT count(*) from "test_runs"."<test_table>_timeseries";

Checklist

Not all may apply

  • Code changes (must work)
  • Tests exercising your feature/bug fix (check coverage report on CircleCI build -> Artifacts)
  • All other unit tests passing
  • Update validation for project config yaml file changes
  • Update existing documentation
  • Run a small batch run to make sure it all works (local is fine, unless an Eagle specific feature)
  • Add to the changelog_dev.rst file and propose migration text in the pull request

@rajeee rajeee requested a review from nmerket March 15, 2021 21:58
@rajeee rajeee marked this pull request as ready for review March 15, 2021 22:30
Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. A few comments below. Could use some documentation updates and something in the changelog.

This reminds me I need to get things set up to have multiple versions of the docs published at the same time.

buildstockbatch/eagle.py Outdated Show resolved Hide resolved
Comment on lines 229 to 230
except Exception:
return pd.DataFrame(columns=[all_cols]+['building_id'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty broad except block. It looks like it just returns an empty dataframe with the appropriate columns if anything goes wrong. Intentionally creating these kinds of silent errors is troubling because it means that there could be major problems with reading some of the parquet files and the user will not know and think everything is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed it could mask errors. But I think a few times there was an issue reading a handful of parquet files that resulted in crashing everything, so I thought it might be a good idea to assemble whatever can be assembled. Do you think it would be okay to have this try-except if we only wrap the read_parquet call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that's why you had this here. I don't know if it functionally matters much to only wrap the read_parquet call rather than do this. My primary concern is that this will silently omit data the user is expecting. That could mess up results. Right now at least the user knows something went wrong because they don't get their results. Could we put some sort of error message in the postprocessing (or other relevant) log file saying, "building_id = x, upgrade = y failed to load parquet"? They still might miss that, but it would be documented somewhere. It would be better than what we have now, which is an obtuse error with very little to go on to find the problem.

Copy link
Contributor Author

@rajeee rajeee Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmerket I started working on the solution you proposed of logging the buildings we failed to read the timeseries, but then realized we would have to do it at multiple places. If there is any corrupted parquet it would fail here first:

all_ts_cols = db.from_sequence(ts_filenames, partition_size=100).map(partial(get_cols, fs)).\

And, the corrupted parquet (or not being able to read from disk) is something I have seen only a couple of times when the eagle was having an issue, so it may not be worthwhile to account for that. Even if we need to handle that, maybe not in this PR. So, I have removed all such handling.

npartitions = min(len(ts_filenames), npartitions) # cannot have less than one file per partition
ts_files_in_each_partition = np.array_split(ts_filenames, npartitions)
N = len(ts_files_in_each_partition[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, but usually a variable written in ALL_CAPS is understood to be a constant in python. CamelCase is a class. This looks to be a regular variable which are usually lower_snake_case. It's not clear from the variable name what this is. Maybe a more descriptive name would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point taken :). Thanks. I have a propensity to use big N for counts of any sort. I will change the name.

@rajeee rajeee requested a review from nmerket March 17, 2021 17:21
Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for making these changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants