Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand historical coverage pre-2019 #295

Merged
merged 6 commits into from May 16, 2024

Conversation

grgmiller
Copy link
Collaborator

@grgmiller grgmiller commented Mar 11, 2023

Summary

This PR updates the data pipeline to allow for the creation of historical data from 2013-2018. Because EIA-930 data is not available for a complete year prior to 2019, the data outputs prior to that year will be limited to the following:

  • Only monthly and annual-resolution data will be available (no hourly data) - we could potentially still include hourly data for the plants that report to CEMS.
  • Only generated data will be available since interchange data from EIA-930 is not available.

Where to look

Most of the updates are in data_pipeline.py with minor changes in other files to update allowed year ranges, and update certain functions to accept an argument to specify different behavior based on whether hourly data is available or not.

Update details

Document in more depth the changes being made

Screenshots

A couple screenshots of the changes/data if relevant.

Testing / Validation

  • Run the entire pipeline for 2018 without Errors, examine warning messages
  • Run the entire pipeline for 2017 without Errors, examine warning messages
  • Run the entire pipeline for 2016 without Errors, examine warning messages
  • Run the entire pipeline for 2015 without Errors, examine warning messages
  • Run the entire pipeline for 2014 without Errors, examine warning messages

After running the 2018 pipeline, I noticed the following warnings that are not tripped in the more recent data:

  • Negative Nox emissions detected in CEMS
  • Columns missing in the column_checks
  • "Missing" data being dropped from cems with non-zero data
  • SO2 emission factors appear to be missing for JF and certain boiler configs
  • Some data is getting dropped or changed during the EIA-923 allocation process

Linear ticket

Closes CAR-2968, CAR-1823, CAR-4206

Concerns

Anything you'd like to point out that the reviewers should pay special attention to

Next steps / Not addressed here

The availability of certain input data prior to 2013 may be different so that will be addressed in a future PR.

Checklist

  • Update the documentation to reflect changes made in this PR

@grgmiller
Copy link
Collaborator Author

Picking this PR back up on 11/22/23 after months of inactivity. At this point, just merged the most recent development branch in and did a test run of the pipeline with a single year (2018) of data to see if it is working. I just wanted to sync this branch with the most recent changes before we started all of our other updates so it doesn't get too far out of sync, but will probably pick work back up on this after the 2022 data update is complete.

@rouille rouille force-pushed the historical_coverage branch 2 times, most recently from dfb981d to d411aa3 Compare May 13, 2024 19:49
@rouille rouille changed the base branch from development to historical_coverage_feature May 14, 2024 18:19
@rouille rouille self-requested a review May 14, 2024 18:19
@rouille rouille added new feature New feature or request data cleaning Cleaning and standardizing data data inputs related to new data, downloading, or loading data labels May 14, 2024
@rouille
Copy link
Collaborator

rouille commented May 14, 2024

This PR is now part of larger group of PRs that aim to update the data pipeline to allow for the creation of historical data from 2005-2018. All the PRs created to the expansion of the historical coverage will be merged into the historical_coverage_feature feature branch.

This PR allow to run the pipeline without error from 2008 to 2018. Not that the warnings have not been investigated yet and the outputs have not been validated. This PR simply fixes errors encountered when running 2008 - 2018.

Next steps:

  • Remove add_subplant_id keyword in oge.data_cleaning.clean_eia923
  • Add columns in the plant_attributes file such as lat/lon, nameplate capacity
  • Handle 2005, 2006 and 2007 that years for which some of the pudl table retrieved oge.data_cleaning.clean_eia923 come back empty
  • Compress output files to save disk space
  • Validate 2005 - 2018
  • Update documentation

@rouille rouille marked this pull request as ready for review May 14, 2024 18:36
Copy link
Collaborator Author

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments - some changes requested.

src/oge/data_cleaning.py Show resolved Hide resolved
src/oge/data_cleaning.py Outdated Show resolved Hide resolved
src/oge/output_data.py Outdated Show resolved Hide resolved
src/oge/output_data.py Outdated Show resolved Hide resolved
src/oge/validation.py Show resolved Hide resolved
@grgmiller
Copy link
Collaborator Author

In addition to the next steps you listed above, it looks like we will need to figure out how to deal with the download_eia923() function since it will not work with some of the early data, and some of the functions that use those raw files may need alternative file handling in those earlier years.

@grgmiller
Copy link
Collaborator Author

Also, can you please test this to make sure that the results for say 2022 match the existing outputs? This change should theoretically not affect any of the results

@rouille
Copy link
Collaborator

rouille commented May 14, 2024

In addition to the next steps you listed above, it looks like we will need to figure out how to deal with the download_eia923() function since it will not work with some of the early data, and some of the functions that use those raw files may need alternative file handling in those earlier years.

Indeed:

open-grid-emissions[~/Singularity/open-grid-emissions/src/oge] (historical_coverage) brdo$ python data_pipeline.py --year 2005
2024-05-14 15:26:48 [INFO] oge.data_pipeline:71 

Running with the following options:
  * year = 2005
  * shape_individual_plants = True
  * small = False
  * flat = False
  * skip_outputs = False

2024-05-14 15:26:48 [INFO] oge.data_pipeline:121 Running data pipeline for year 2005
2024-05-14 15:26:48 [WARNING] oge.oge.validation:32 
        ################################################################################
        The data pipeline has only been validated to work for years 2019-2022.
        Running the pipeline for 2005 may cause it to fail or may lead to poor-quality
        or anomalous results. To check on the progress of validating additional years of
        data, see: https://github.com/singularity-energy/open-grid-emissions/issues/117
        ################################################################################
        
2024-05-14 15:26:48 [INFO] oge.data_pipeline:126 1. Downloading data
2024-05-14 15:26:48 [INFO] oge.oge.download_data:126 Using nightly build version of PUDL sqlite database downloaded 2024-04-03
2024-05-14 15:26:48 [INFO] oge.oge.download_data:147 Using nightly build version of PUDL epacems parquet file downloaded 2024-04-03
2024-05-14 15:26:48 [INFO] oge.oge.download_data:45 egrid2018_data.xlsx already downloaded, skipping.
2024-05-14 15:26:48 [INFO] oge.oge.download_data:45 egrid2019_data.xlsx already downloaded, skipping.
2024-05-14 15:26:48 [INFO] oge.oge.download_data:45 egrid2020_data.xlsx already downloaded, skipping.
2024-05-14 15:26:48 [INFO] oge.oge.download_data:45 egrid2021_data.xlsx already downloaded, skipping.
2024-05-14 15:26:48 [INFO] oge.oge.download_data:45 egrid2022_data.xlsx already downloaded, skipping.
2024-05-14 15:26:48 [INFO] oge.oge.download_data:45 epa_eia_crosswalk.csv already downloaded, skipping.
2024-05-14 15:26:48 [INFO] oge.oge.download_data:45 eia8602005 already downloaded, skipping.
2024-05-14 15:26:48 [INFO] oge.oge.download_data:45 eia8602022 already downloaded, skipping.
Traceback (most recent call last):
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 656, in <module>
    main(sys.argv[1:])
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 147, in main
    download_data.download_raw_eia923(year)
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/download_data.py", line 302, in download_raw_eia923
    raise NotImplementedError(f"EIA-923 data is unavailable for '{year}'.")
NotImplementedError: EIA-923 data is unavailable for '2005'.

src/oge/data_cleaning.py Outdated Show resolved Hide resolved
@grgmiller
Copy link
Collaborator Author

I added one comment with a suggested name change, otherwise this looks good to merge once we confirm that this is not modifying the 2022 outputs.

@grgmiller
Copy link
Collaborator Author

@rouille Looks good to me - ready to merge!

@grgmiller grgmiller merged commit afff28a into historical_coverage_feature May 16, 2024
1 check passed
@grgmiller grgmiller deleted the historical_coverage branch May 16, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data cleaning Cleaning and standardizing data data inputs related to new data, downloading, or loading data new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants