New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upload-buildstockcsv-to-s3 #365
upload-buildstockcsv-to-s3 #365
Conversation
Minimum allowed coverage is Generated by 🐒 cobertura-action against 621e574 |
@yingli-NREL I added the pull request checklist back in. What you need to do is make sure you've done each item in that list (or verified that it wasn't applicable). If you're not sure what some of those mean, @rajee or I can help. |
create one folder named "buildstock_csv" in the S3 results directory and add buildstock.csv in the folder "buildstock_csv" |
…ockcsv-to-s3-during-postprocessing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for taking so long to properly review this. Have you run a small batch to verify this is working? This could use some unit testing, which is kind of tricky with s3 stuff because you'll probably need to mock things. There's some existing tests I linked to that could be added to. A couple other notes below. I know it is a lot, but I've been thorough and somewhat critical hopefully as a teaching tool. If it would help, we can meet to talk through some of this.
buildstockbatch/postprocessing.py
Outdated
|
||
def upload_buildstock_csv(filepath): | ||
full_path = buildstock_dir.joinpath(filepath) | ||
s3 = boto3.resource('s3') | ||
bucket = s3.Bucket(s3_bucket) | ||
s3_prefix_output_new = s3_prefix_output+ '/' + 'buildstock_csv' + '/' | ||
s3key = Path(s3_prefix_output_new).joinpath(filepath).as_posix() | ||
bucket.upload_file(str(full_path), str(s3key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thought on this, could we not have upload_buildstock_csv
here? It seems to mostly be a copy of the more generic upload_file
but puts the file somewhere else. Maybe there could be a second, optional argument to upload_file
where you could specify the s3 location when it's different than the default and then just use it for buildstock.csv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a local environment in my HPC account and used that environment to run a small batch. It works. The results are upload in Amazon S3>Buckets>eulp>buildstock_csv_to_s3/>test/>test10/
buildstockbatch/postprocessing.py
Outdated
buildstock_csv = [] | ||
for file in buildstock_dir.glob('buildstock.csv'): | ||
buildstock_csv.append(file.relative_to(buildstock_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you made this a list so you could use map
below, but that's not necessary. You can just pull the filename here. We should also only upload it if it exists. I think ComStock might put this file somewhere different, so it may not always be where you're looking for it.
buildstock_csv = [] | |
for file in buildstock_dir.glob('buildstock.csv'): | |
buildstock_csv.append(file.relative_to(buildstock_dir)) | |
buildstock_csv = buildstock_dir / 'buildstock.csv' |
I haven't tested that this works. I might be missing some normalization or something.
buildstockbatch/postprocessing.py
Outdated
dask.compute(map(dask.delayed(upload_file), all_files)) | ||
dask.compute(map(dask.delayed(upload_buildstock_csv), buildstock_csv)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's best to gather all the work you want to do into one list of tasks and then send that to dask.compute
once. You don't have to call dask.delayed
with a map
, it was just convenient in this case. See dask.delayed documentation for more details. This change depends on buildstock_csv
being a filename, not a list with just one item in it.
dask.compute(map(dask.delayed(upload_file), all_files)) | |
dask.compute(map(dask.delayed(upload_buildstock_csv), buildstock_csv)) | |
tasks = list(map(dask.delayed(upload_file), all_files)) | |
tasks.append(dask.delayed(upload_buildstock_csv)(buildstock_csv)) | |
dask.compute(tasks) |
…ockcsv-to-s3-during-postprocessing
The buildstock.csv is in a different place depending on whether you're running
I can take a look at why next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more notes. I'll try to get this working early next week.
buildstockbatch/postprocessing.py
Outdated
@@ -595,6 +595,7 @@ def remove_intermediate_files(fs, results_dir, keep_individual_timeseries=False) | |||
def upload_results(aws_conf, output_dir, results_dir): | |||
logger.info("Uploading the parquet files to s3") | |||
|
|||
buildstock_dir = Path(results_dir).parent.joinpath('housing_characteristics') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the problem right here. It's assuming that file is in the usual place it is on Eagle. I think the sampler returns the buildstock.csv location. We should catch that and use that file. Could be a bit tricky because the sample is called way before the postprocessing on a different node and everything.
buildstockbatch/postprocessing.py
Outdated
buildstock_csv = '' | ||
for file in buildstock_dir.glob('buildstock.csv'): | ||
buildstock_csv = file.relative_to(buildstock_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is a little bizarre to use a for loop and glob to look for one file.
…ockcsv-to-s3-during-postprocessing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed it's working for both the local version and on Eagle.
Modified postprocessing.py def upload_results() to upload buildstockcsv to s3
Checklist
Not all may apply
minimum_coverage
in.github/workflows/ci.yml
as necessary.