Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload-buildstockcsv-to-s3 #365

Merged
merged 20 commits into from Oct 10, 2023

Conversation

yingli-NREL
Copy link

@yingli-NREL yingli-NREL commented Apr 19, 2023

Modified postprocessing.py def upload_results() to upload buildstockcsv to s3

Checklist

Not all may apply

  • Code changes (must work)
  • Tests exercising your feature/bug fix (check coverage report on Checks -> BuildStockBatch Tests -> Artifacts)
  • Coverage has increased or at least not decreased. Update minimum_coverage in .github/workflows/ci.yml as necessary.
  • All other unit and integration tests passing
  • Update validation for project config yaml file changes
  • Update existing documentation
  • Run a small batch run on Eagle to make sure it all works if you made changes that will affect Eagle
  • Add to the changelog_dev.rst file and propose migration text in the pull request

@yingli-NREL yingli-NREL requested a review from rajeee April 19, 2023 02:55
@yingli-NREL yingli-NREL linked an issue Apr 19, 2023 that may be closed by this pull request
@github-actions
Copy link

github-actions bot commented Apr 19, 2023

File Coverage
All files 84%
base.py 89%
eagle.py 79%
exc.py 57%
local.py 50%
postprocessing.py 84%
utils.py 96%
sampler/base.py 79%
sampler/downselect.py 33%
sampler/precomputed.py 93%
sampler/residential_quota.py 61%
test/test_docker.py 33%
test/test_validation.py 97%
workflow_generator/base.py 90%
workflow_generator/commercial.py 53%
workflow_generator/residential_hpxml.py 84%

Minimum allowed coverage is 33%

Generated by 🐒 cobertura-action against 621e574

@nmerket
Copy link
Member

nmerket commented Apr 21, 2023

@yingli-NREL I added the pull request checklist back in. What you need to do is make sure you've done each item in that list (or verified that it wasn't applicable). If you're not sure what some of those mean, @rajee or I can help.

@yingli-NREL yingli-NREL marked this pull request as ready for review May 2, 2023 17:46
@yingli-NREL
Copy link
Author

create one folder named "buildstock_csv" in the S3 results directory and add buildstock.csv in the folder "buildstock_csv"

Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for taking so long to properly review this. Have you run a small batch to verify this is working? This could use some unit testing, which is kind of tricky with s3 stuff because you'll probably need to mock things. There's some existing tests I linked to that could be added to. A couple other notes below. I know it is a lot, but I've been thorough and somewhat critical hopefully as a teaching tool. If it would help, we can meet to talk through some of this.

Comment on lines 635 to 642

def upload_buildstock_csv(filepath):
full_path = buildstock_dir.joinpath(filepath)
s3 = boto3.resource('s3')
bucket = s3.Bucket(s3_bucket)
s3_prefix_output_new = s3_prefix_output+ '/' + 'buildstock_csv' + '/'
s3key = Path(s3_prefix_output_new).joinpath(filepath).as_posix()
bucket.upload_file(str(full_path), str(s3key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I look at the coverage report, this doesn't seem to be getting called. It's not clear to me why.

image

It would make sense to add something to this test to check if the file got uploaded. It's using mock to not actually upload file to s3 but to see that it was called successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought on this, could we not have upload_buildstock_csv here? It seems to mostly be a copy of the more generic upload_file but puts the file somewhere else. Maybe there could be a second, optional argument to upload_file where you could specify the s3 location when it's different than the default and then just use it for buildstock.csv.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a local environment in my HPC account and used that environment to run a small batch. It works. The results are upload in Amazon S3>Buckets>eulp>buildstock_csv_to_s3/>test/>test10/

Comment on lines 611 to 613
buildstock_csv = []
for file in buildstock_dir.glob('buildstock.csv'):
buildstock_csv.append(file.relative_to(buildstock_dir))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you made this a list so you could use map below, but that's not necessary. You can just pull the filename here. We should also only upload it if it exists. I think ComStock might put this file somewhere different, so it may not always be where you're looking for it.

Suggested change
buildstock_csv = []
for file in buildstock_dir.glob('buildstock.csv'):
buildstock_csv.append(file.relative_to(buildstock_dir))
buildstock_csv = buildstock_dir / 'buildstock.csv'

I haven't tested that this works. I might be missing some normalization or something.

Comment on lines 644 to 645
dask.compute(map(dask.delayed(upload_file), all_files))
dask.compute(map(dask.delayed(upload_buildstock_csv), buildstock_csv))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best to gather all the work you want to do into one list of tasks and then send that to dask.compute once. You don't have to call dask.delayed with a map, it was just convenient in this case. See dask.delayed documentation for more details. This change depends on buildstock_csv being a filename, not a list with just one item in it.

Suggested change
dask.compute(map(dask.delayed(upload_file), all_files))
dask.compute(map(dask.delayed(upload_buildstock_csv), buildstock_csv))
tasks = list(map(dask.delayed(upload_file), all_files))
tasks.append(dask.delayed(upload_buildstock_csv)(buildstock_csv))
dask.compute(tasks)

@yingli-NREL yingli-NREL requested a review from nmerket July 21, 2023 17:33
@nmerket nmerket added this to the v2023.10.0 milestone Oct 3, 2023
@nmerket
Copy link
Member

nmerket commented Oct 6, 2023

The buildstock.csv is in a different place depending on whether you're running buildstock_eagle or buildstock_local. When I run buildstock_local and have the output_directory not in the default location, it fails to upload:

INFO:2023-10-06 17:06:19:buildstockbatch.postprocessing:Uploading the parquet files to s3
2023-10-06 17:06:20,315 - distributed.worker - WARNING - Compute Failed
Key:       upload_buildstock_csv-3a9eafb5-350b-4d0d-ad8d-efae94791c46
Function:  upload_buildstock_csv
args:      ('')
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

I can take a look at why next week.

Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more notes. I'll try to get this working early next week.

@@ -595,6 +595,7 @@ def remove_intermediate_files(fs, results_dir, keep_individual_timeseries=False)
def upload_results(aws_conf, output_dir, results_dir):
logger.info("Uploading the parquet files to s3")

buildstock_dir = Path(results_dir).parent.joinpath('housing_characteristics')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the problem right here. It's assuming that file is in the usual place it is on Eagle. I think the sampler returns the buildstock.csv location. We should catch that and use that file. Could be a bit tricky because the sample is called way before the postprocessing on a different node and everything.

Comment on lines 611 to 613
buildstock_csv = ''
for file in buildstock_dir.glob('buildstock.csv'):
buildstock_csv = file.relative_to(buildstock_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is a little bizarre to use a for loop and glob to look for one file.

Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed it's working for both the local version and on Eagle.

@nmerket nmerket merged commit 6e17414 into develop Oct 10, 2023
6 checks passed
@nmerket nmerket deleted the 348-upload-buildstockcsv-to-s3-during-postprocessing branch October 10, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upload buildstock.csv to S3 during postprocessing
4 participants