NREL ATB data non-unique in 2019 (mostly) and 2020 #3576

cmgosnell · 2024-04-17T20:47:14Z

In trying to integrate NREL ATB, I ran into an oddity that made it difficult to integrate the data from 2019 and 2020. Because of this we are not integrating these years of ATB. All of this exploration was based on this semi-cleaned asset: _core_nrelatb__transform_start.pkl.zip

There is a column called core_metric_case which is always either "Market" or "R&D". Then there is another column called core_metric_key which is a composite (semi-)primary key column containing codes representing info stored in other columns in the data. The first character of the core_metric_key is always either Ror M. We've called this first letter of the core_metric_key the mystery_code. We and other collaborators thought this corresponded to the core_metric_case. It does the ~75% of the time:

>>> _core_nrelatb__transform_start[["mystery_code", "core_metric_case"]].value_counts()
mystery_code  core_metric_case
R             R&D                 728007
M             Market              453201
R             Market              330483
M             R&D                  55677
Name: count, dtype: int64

I'll note that the core_metric_key seems to have changed structure over time - especially in 2023. Also, the mystery_code never deviated from the core_metric_case in 2023.

This would be fine if the values in the value column did not vary based on the mystery_code (we could drop fully duplicate records w/o this mystery_code or the core_metric_key). But the data does seem to value truly different values. Of the three data tables which are derived from the info in the value column, two tables have real variability in value based on the mystery_code. The records that are variable by mystery_code make up 12% of the core_nrelatb__yearly_projections_by_scenario table and 5% of the core_nrelatb__yearly_rates_projections table.

The text was updated successfully, but these errors were encountered:

cmgosnell · 2024-04-24T14:02:08Z

just to note here, when i was trying to integrate 2019 and 2020 there is data in these two columns that indicate info about the revision of the data. those columns were completely null for the more recent years, so I took this out after removing 2019 and 2020 from the integration. but if we ever go back and tackle this weird thing and integrate 2019 and 2020 we could add back in this very small normalized table

in Normalizer:

revisions: TableNormalizer = TableNormalizer(
        idx=["report_year"],
        columns=["revision_num", "update_date"],
    )

and the asset definition:

@asset
def core_nrelatb__yearly_revisions(
    _core_nrelatb__transform_start: pd.DataFrame,
) -> pd.DataFrame:
    """Transform small table including which revision the data pertains to and when it was updated."""
    return transform_normalize(_core_nrelatb__transform_start, Normalizer().revisions)

cmgosnell added new-data Requests for integration of new data. nrelatb NREL's Annual Technology Baseline data labels Apr 17, 2024

This was referenced Apr 17, 2024

Create a transform that reshapes the NREL ATB data into a well normalized core table. #3507

Closed

Transform NREL ATB #3570

Merged

zaneselvans mentioned this issue Apr 27, 2024

Nightly Build Failure 2024-04-27 #3593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NREL ATB data non-unique in 2019 (mostly) and 2020 #3576

NREL ATB data non-unique in 2019 (mostly) and 2020 #3576

cmgosnell commented Apr 17, 2024 •

edited

cmgosnell commented Apr 24, 2024

NREL ATB data non-unique in 2019 (mostly) and 2020 #3576

NREL ATB data non-unique in 2019 (mostly) and 2020 #3576

Comments

cmgosnell commented Apr 17, 2024 • edited

cmgosnell commented Apr 24, 2024

cmgosnell commented Apr 17, 2024 •

edited