Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NREL ATB data non-unique in 2019 (mostly) and 2020 #3576

Open
cmgosnell opened this issue Apr 17, 2024 · 1 comment
Open

NREL ATB data non-unique in 2019 (mostly) and 2020 #3576

cmgosnell opened this issue Apr 17, 2024 · 1 comment
Labels
new-data Requests for integration of new data. nrelatb NREL's Annual Technology Baseline data

Comments

@cmgosnell
Copy link
Member

cmgosnell commented Apr 17, 2024

In trying to integrate NREL ATB, I ran into an oddity that made it difficult to integrate the data from 2019 and 2020. Because of this we are not integrating these years of ATB. All of this exploration was based on this semi-cleaned asset: _core_nrelatb__transform_start.pkl.zip

There is a column called core_metric_case which is always either "Market" or "R&D". Then there is another column called core_metric_key which is a composite (semi-)primary key column containing codes representing info stored in other columns in the data. The first character of the core_metric_key is always either Ror M. We've called this first letter of the core_metric_key the mystery_code. We and other collaborators thought this corresponded to the core_metric_case. It does the ~75% of the time:

>>> _core_nrelatb__transform_start[["mystery_code", "core_metric_case"]].value_counts()
mystery_code  core_metric_case
R             R&D                 728007
M             Market              453201
R             Market              330483
M             R&D                  55677
Name: count, dtype: int64

I'll note that the core_metric_key seems to have changed structure over time - especially in 2023. Also, the mystery_code never deviated from the core_metric_case in 2023.

This would be fine if the values in the value column did not vary based on the mystery_code (we could drop fully duplicate records w/o this mystery_code or the core_metric_key). But the data does seem to value truly different values. Of the three data tables which are derived from the info in the value column, two tables have real variability in value based on the mystery_code. The records that are variable by mystery_code make up 12% of the core_nrelatb__yearly_projections_by_scenario table and 5% of the core_nrelatb__yearly_rates_projections table.

image
image

@cmgosnell cmgosnell added new-data Requests for integration of new data. nrelatb NREL's Annual Technology Baseline data labels Apr 17, 2024
@cmgosnell
Copy link
Member Author

just to note here, when i was trying to integrate 2019 and 2020 there is data in these two columns that indicate info about the revision of the data. those columns were completely null for the more recent years, so I took this out after removing 2019 and 2020 from the integration. but if we ever go back and tackle this weird thing and integrate 2019 and 2020 we could add back in this very small normalized table

in Normalizer:

revisions: TableNormalizer = TableNormalizer(
        idx=["report_year"],
        columns=["revision_num", "update_date"],
    )

and the asset definition:

@asset
def core_nrelatb__yearly_revisions(
    _core_nrelatb__transform_start: pd.DataFrame,
) -> pd.DataFrame:
    """Transform small table including which revision the data pertains to and when it was updated."""
    return transform_normalize(_core_nrelatb__transform_start, Normalizer().revisions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-data Requests for integration of new data. nrelatb NREL's Annual Technology Baseline data
Projects
Status: New
Development

No branches or pull requests

1 participant