Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate GridPath RA Toolkit hourly renewable generation profiles #3467

Open
12 of 14 tasks
zaneselvans opened this issue Mar 14, 2024 · 2 comments
Open
12 of 14 tasks
Assignees
Labels
csv Issues related to working with / extracting data from CSV files epic Any issue whose primary purpose is to organize other issues into a group. gridlab Work related to open modeling input data integration funded/coordinated by GridLab gridpathratoolkit Data derived from the GridPath Resource Adequacy Toolkit new-data Requests for integration of new data.

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Mar 14, 2024

Tasks

  1. gridlab
    zaneselvans
  2. 7 of 7
    gridlab new-data zenodo
    zaneselvans
  3. 3 of 3
    datastore gridlab new-data zenodo
    zaneselvans
  4. 5 of 5
    gridpathratoolkit new-data zenodo
    zaneselvans
  5. 4 of 4
    csv gridlab gridpathratoolkit new-data
    zaneselvans
  6. 4 of 4
    gridlab gridpathratoolkit new-data
    zaneselvans
  7. 3 of 3
    csv gridlab gridpathratoolkit new-data
    zaneselvans
  8. 10 of 11
    gridlab gridpathratoolkit new-data
    zaneselvans
  9. 12 of 12
    gridlab gridpathratoolkit new-data
    zaneselvans
  10. 2 of 4
    gridlab gridpathratoolkit new-data
    zaneselvans

Design Considerations

Wind Profiles

  • The files inside HourlyWind_byProject.zip are all of the form 57282_capfactor.csv.
  • The integer in the filename is the EIA facility ID, and that information is not in the data, so it will need to be extracted from the filename.
  • There are no timestamps in the data, only columns for year, month, day, which are repeated 24 times, one for each hour of the day. So timestamps will need to be constructed based on those columns and the ordering of the records.
  • Need to clarify what it means for these time stamps to be "hour ending"
  • Implied timestamps in the wind data should be treated as PST.
  • Should the provided capacity factors be applied to all wind generators that are associated with each plant_id_eia?
  • There are roughly 20M records total, 166MB total decompressed as a CSV.

Solar Profiles

  • The files inside HourlySolar_byProject.zip have names like 10437_SUN2.csv where the leading integer is the EIA facility ID, and the part after the underscore is the generator ID. They contain hourly capacity factors, with UTC timestamps that are always at 30min after the hour.
  • Since these profiles only contain capacity factors, looking up the generator capacity will be necessary to calculate the actual power generated in any hour, so we'll already need to be bringing in other tables
  • The minimum natural PK seems like (plant_id_eia, generator_id, timestamp_utc) with a single data column of capacity_factor
  • There are ~1400 solar facilities, 7.8GB when decompressed, ~260M records. Might be a good candidate for just Parquet output, with each generator or generator-year or year-state its own row group.
  • What, if any, additional columns should be added to the tables for better ergonomics (categorical columns with low cardinality in Parquet can take up very little space).
  • Is it worth generalizing the one-off Parquet output method we're using the EPA CEMS to also handle writing this data?

Overall

  • Is there anything fundamentally different about the wind and solar data? Why not combine them into a single table, with a categorical column that indicates whether it's wind or solar?
  • Alternatively, given the plant and generator IDs, the type of generator can also be looked up in EIA-860. However, the fact that the wind data only includes the plant and not the generator ID confuses things. Should we assume the same capacity factor for each generator in the plant? Or is there a 1:1 relationship between plants & generators for wind?
  • To allow uniformity of access, it would be nice if all of the wind and solar data were reported on the same basis, with a primary key of (plant_id_eia, generator_id, timestamp_utc) + a single capacity_factor data column. Given that there will be 300M+ rows, adding extra columns seems unwise, but a set of generator IDs could be selected on the basis of various attributes from the EIA860 data, and then used to query the wind or solar time series data. Would that be convenient? If so, is there a good reason not to store all of the wind and solar hourly generation profiles in the same Parquet file?

Questions

  • To help organize the outputs for efficient access, in what kind of blocks are the hourly profile time series data typically accessed? All data for a given generator? All generators for a single year? All generators in a given area? Based on the fact that the data is currently split up by plant or generator, I would imagine the row-groups should at least be defined to represent generators.
    • It’s hard to say how it will be used in general, but for the MVP, I think we’ll want to be able to pull the BA-level shapes for an entire weather year. Keep in mind that we typically chose a time zone for year-long simulations and different studies might choose different time zones, so ideally, you’d be able to pull a whole weather year corresponding to the time zone you’ve selected for your model run. The BA-level shapes are in MonteCarlo_Inputs/historical_data and the timestamps (in hour ending PST) corresponding to the timeseries in each directory are in the corresponding timestamps.csv files. The project level shapes were provided for the sake of transparency, but there are approximations that were made in developing those shapes that are more reasonable in aggregate than for individual projects, so I’m not sure there’s much value added for hosting them at the project level.
  • Why is wind reported by plant, but solar by generator?
    • The hourly wind cap factors were developed using plant-specific empirical power curves, which were derived based on historical monthly generation data reported to EIA. The monthly generation data from EIA is reported at the plant level, so the power curves and the resulting hourly cap factors are also at the plant level.
  • Why do the wind and solar data use different conventions for their timestamps (PST vs. UTC, "hour ending" vs. 30min)
    • This I just because NSRDB and the Wind Toolkit have different time zone and timestamp conventions and we carried them forward in these particular datasets. When we’re doing simulations, everything gets converted into the same time zone (usually PST for Western US studies) and the same timestep convention (usually hour ending). I think it’s generally good practice to post this type of data in hour ending “HE” UTC. Hour ending means that hour that ends at the timepoint listed, so HE 01:00 represents the hour between midnight and 1am. And it’s supposed to mean that the data represents the average over the course of that hour (this convention comes from the practice of reporting the integral of power generation over the course of each hour, so the total energy generated over the course of a day, for example, equals the sum of the hour ending values). Since the NSRDB data is provided as measurements at points in time, I usually take the 0:30 point measurement as an estimate of the average conditions between 0:00 and 1:00.
  • Should plant-level wind capacity factors be associated with all wind generators that are part of the referenced plant?
    • Yes.
  • Just to be sure, what hours of the day do the implied timestamps in the wind data refer to? Is the first record the capacity factor that should be associated with the hour from 12am to 1am? Or 11pm the previous day until 12am (is that what "hour ending" means?)
    • The first row is 1am hour ending, so it reflects the hour from midnight to 1am.
  • Can we get an explicit public domain dedication or CC-BY-4.0 license added to a separate LICENSE file in the GridPath RA Toolkit tarball which indicates the terms under which the wind and solar profiles can be republished and used?
    • Done!

Notes from README

Appendices refer to the GridPath RA Toolkit report

Hourly Wind Profiles

HourlyWind_byProject.zip: contains hourly simulated wind capacity factor data by project between 2007 and 2014, based on wind speed data from NREL's Wind Toolkit and empirically-derived power curves. Each file corresponds to a project from EIA Form 860: [Plant ID]_capfactor.csv. Note that the hour ending or "HE" time stamp column is missing, but the 24 hours of data corresponding to each day represents HE 1 through HE 24 of that day in Pacific Standard Time. For more information about how this data was developed and used in the study, see Appendix A.4.

Hourly Solar Profiles

HourlySolar_byProject.zip: contains hourly simulated solar capacity factor data by project between 1998 and 2019, based on data from the NSRDB and NREL's SAM model. Each file corresponds to a project from EIA Form 860: [Plant ID]_[Generator ID].csv. Timestamps are in UTC. For more information about how this data was developed and used in the study, see Appendix A.5.

Weather Data

DailyWeatherData_cleaned.csv: daily weather data from 16 locations in the West between 1948 and 2021. For more information, see Appendix E of the report.

Hydro Data

MonthlyHydro_byPlant.csv: monthly hydro energy by plant from EIA Form 923/906 between 2001 and 2020, listed by EIA Plant ID and EIA Plant Name. For more information about how this data was used in the study, see Appendix A.3.

Hourly Load Profiles

HourlyLoad_FERC714_cleaned.zip: contains hourly load data between 2006 and 2020 from FERC Form 714, which was used to develop the load shapes in the Western RA Case Study. Each file corresponds to a FERC respondent. In each file, the columns are: year, month, day, hour ending (Pacific Standard Time), load (MW). This data has been cleaned for use in this study, including making manual adjustments for missing or bad data. For more information about how this data was used in the study, see Appendix A.1.

Thermal Generators

HourlyThermal_byGenerator.zip: contains hourly estimated thermal temperature derates by generator between 1998 and 2019, based on temperature data from the NSRDB and project-specific piece-wise linear derate functions. Each file corresponds to a project from EIA Form 860: [Plant ID]_[GeneratorID].csv. Timestamps are contained in timestamps.csv and are listed in hour ending, Pacific Standard Time. For more information about how this data was developed and used in the study, see Appendix A.2.

Three Levels

There are 3 different versions of the wind and solar generation profiles available in the archived data

  • Hourly generator (solar) or plant (wind) level capacity factors derived from NSRDB / WINDS Toolkit
  • Hourly capacity-weighted aggregated (generally BA or transmission zone level) capacity factors for wind and solar
  • A cleaned up version of the aggregated capacity factors, in which problematic generators/plants that couldn't be fit well have been assigned the production curve defined by the overall aggregation (scaled down to their capacity).

Eventually I think we would like to be able to run this aggregation and data repair process within PUDL so that it could be adapted to different purposes. However, at the moment for the MVP we just need the final output. We can backfill the other steps later with better understanding.

One complication is that there are a small number of wind & solar projects which are "hybrid" -- they include energy storage as well as renewable generation. They have their own separate production curves, but may not be straightforwardly combinable with the pure renewable generation. Need to ask @anamileva & @elainekhart how to treat this data in relation to the other profiles.

@zaneselvans zaneselvans added new-data Requests for integration of new data. epic Any issue whose primary purpose is to organize other issues into a group. csv Issues related to working with / extracting data from CSV files gridlab Work related to open modeling input data integration funded/coordinated by GridLab labels Mar 14, 2024
@zaneselvans zaneselvans self-assigned this Mar 17, 2024
@zaneselvans
Copy link
Member Author

zaneselvans commented Mar 19, 2024

Additional Questions for @elainekhart & @anamileva

How are the BA-level renewable generation curves derived from the project (plant/generator) level curves? Are they just the capacity-weighted sums of the project-level capacity factors for all projects associated with a given BA?

  • Elaine: They are capacity-weighted averages based on a generator list that was developed for the Western US RA study. The generator list reflects all resources that were online or under construction and expected to be online by 2026, based on EIA Form 860M, published in Feb 2021. In the long run, it would be awesome to be able to automatically create a generator list for a future year by applying filters to the latest EIA 860 data. I don’t know if this is necessary for the MVP – that’s probably a question for you and Ana. Note that while the shapes happen to line up with the BAs, they don’t have to – as long as they represent aggregations of resources that are fully within a single transmission zone. Because there is so much project-specific data and we don’t want to process it over and over again, we typically build cases starting with the aggregated shapes we’ve shared to reflect “existing” wind and solar and then add additional shapes that correspond to aggregations of new or planned resources. We might add multiple aggregated wind shapes to the case that correspond to different areas of potential development within a BA so we can play around with portfolios that develop these resources to different degrees.
  • Ana This would be nice in the longer run, but is not necessary for the MVP, as we can apply a one-time filter on our end.

In aggregating the project-level wind and solar data into BA level data, how do you deal with changes in the associations between plants and BAs? These could come from changes in the BA boundaries over time, or maybe for other reasons. Is it the case that the same projects can end up in different BAs depending on what year of data you're looking at?

  • Elaine I suppose this technically could happen, though I can’t think of any examples this. Remember that every simulation has an associated study year, or the future year that the simulation is supposed to reflect, in terms of loads and energy infrastructure. So all that matters is that the generators are mapped to the correct zones in the study year. The primary challenge is that different simulations might use different zonal representations of the system – with more or less granularity. The zones that we used for the Western US RA study are probably as granular as you would want to make them (at least for the MVP), so changing the topology typically means aggregating these zones into bigger zones.

If it's not a simple transformation from the project-level curves to the BA level curves, then for now we should probably just use the BA level curves. Which of those curves would we need? The solar/wind or solar_syn/wind_syn data? Or both?

  • Elaine I would pull the _syn files. These include data for years where historical hourly weather data is available as well as some years with synthesized data to create better coverage of overlapping datasets. As long as we document which years are synthesized (synthesized wind: 2015-2020, synthesized solar: 2020)

Do the BA codes associated with these production curves correspond to the reported BA codes associated with the individual plants/generators which we would find in EIA860, or do they refer to the simplified / aggregated BAs that you created to deduplicate some data and consolidate many tiny BAs into a smaller number of big BAs?

  • Elaine The projects were mapped to the zones based on the transmission topology. In some cases, the BAA listed in EIA 860 was not as granular as the transmission topology (e.g. CISO was listed, but we needed to know if it’s CIPV, CISC, etc). I used information from the Common Case and other sources to do my best to map them to our transmission zones. If we’re using EIA 930 to develop the transmission topology and constraints, then maybe this isn’t an issue and we can just use the BAAs listed in EIA 860 to map all the generators (if that’s being taken on in the MVP).

Is there an explicit mapping stored somewhere that defines these aggregations by BA code or EIA IDs?

  • Elaine Yes, but recall they are mapped to the wind and solar aggregations, which are not necessarily BAs. The mappings we used for the Western US RA study are attached.

What are the Hybrid_Wind_* and Hybrid_Solar_* series? If we're providing the BA level production curves, should these also be made available?

  • Elaine Great question! Hybrid projects are sometimes modeled individually because there are project-specific constraints for how the storage can be used and you lose this information (or under-constrain the systems) if you aggregate them. I defer to Ana on whether the MVP should explicitly model individual hybrid projects or try to aggregate them.
  • Ana Yes, I think it makes sense to make the individual hybrid project time series available for the MVP production cost modeling purposes. That said, it is not absolutely critical if it's a problem for any reason.

@zaneselvans zaneselvans added the gridpathratoolkit Data derived from the GridPath Resource Adequacy Toolkit label Mar 26, 2024
@zaneselvans
Copy link
Member Author

zaneselvans commented Mar 30, 2024

A couple of plots of average capacity factor by hour of day that looked a bit odd. For AZPS it seems like there's a storage component. And also a tiny bit of nighttime power consumption?

image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
csv Issues related to working with / extracting data from CSV files epic Any issue whose primary purpose is to organize other issues into a group. gridlab Work related to open modeling input data integration funded/coordinated by GridLab gridpathratoolkit Data derived from the GridPath Resource Adequacy Toolkit new-data Requests for integration of new data.
Projects
Status: Backlog
Development

No branches or pull requests

1 participant