New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory_issue_fix #212
memory_issue_fix #212
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. A few comments below. Could use some documentation updates and something in the changelog.
This reminds me I need to get things set up to have multiple versions of the docs published at the same time.
buildstockbatch/postprocessing.py
Outdated
except Exception: | ||
return pd.DataFrame(columns=[all_cols]+['building_id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty broad except
block. It looks like it just returns an empty dataframe with the appropriate columns if anything goes wrong. Intentionally creating these kinds of silent errors is troubling because it means that there could be major problems with reading some of the parquet files and the user will not know and think everything is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed it could mask errors. But I think a few times there was an issue reading a handful of parquet files that resulted in crashing everything, so I thought it might be a good idea to assemble whatever can be assembled. Do you think it would be okay to have this try-except if we only wrap the read_parquet call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that's why you had this here. I don't know if it functionally matters much to only wrap the read_parquet call rather than do this. My primary concern is that this will silently omit data the user is expecting. That could mess up results. Right now at least the user knows something went wrong because they don't get their results. Could we put some sort of error message in the postprocessing (or other relevant) log file saying, "building_id = x, upgrade = y failed to load parquet"? They still might miss that, but it would be documented somewhere. It would be better than what we have now, which is an obtuse error with very little to go on to find the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nmerket I started working on the solution you proposed of logging the buildings we failed to read the timeseries, but then realized we would have to do it at multiple places. If there is any corrupted parquet it would fail here first:
all_ts_cols = db.from_sequence(ts_filenames, partition_size=100).map(partial(get_cols, fs)).\ |
And, the corrupted parquet (or not being able to read from disk) is something I have seen only a couple of times when the eagle was having an issue, so it may not be worthwhile to account for that. Even if we need to handle that, maybe not in this PR. So, I have removed all such handling.
buildstockbatch/postprocessing.py
Outdated
npartitions = min(len(ts_filenames), npartitions) # cannot have less than one file per partition | ||
ts_files_in_each_partition = np.array_split(ts_filenames, npartitions) | ||
N = len(ts_files_in_each_partition[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor thing, but usually a variable written in ALL_CAPS
is understood to be a constant in python. CamelCase
is a class. This looks to be a regular variable which are usually lower_snake_case
. It's not clear from the variable name what this is. Maybe a more descriptive name would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
point taken :). Thanks. I have a propensity to use big N for counts of any sort. I will change the name.
Co-authored-by: Noel Merket <noel.merket@nrel.gov>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for making these changes!
Fixes #210 .
Pull Request Description
The v0.19 postprocessing has been failing on eagle with issue #210 when trying to postprocess the recent restock runs with schedules. After series of debugging session, it seems like dask and eagle's nodes/process just don't work well together.
This pull request does a few things:
Oh, I also did some benchmarking for the optimal size of the aggregated parquet files. Looks like both 1GB and 4GB (the new default) is good.
Query1: SELECT "build_existing_model.state" as state, sum("total_site_electricity_kwh") as total_energy FROM "test_runs"."<test_table>_timeseries" join "<test_table>" on "<test_table>_timeseries"."building_id" = "<test_table>_baseline"."building_id" group by 1 order by 1;
Query2: SELECT count(*) from "test_runs"."<test_table>_timeseries";
Checklist
Not all may apply