Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failing test cases to illustrate potentially conflicting information #315

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

phackstock
Copy link
Contributor

Closes #290.

@danielhuppmann, I took a look at the questions you brought up in #290 and I think we should be good.

The case that you described would be as follows. Given a model mapping:

model: m_a
native_regions: [region_A, region_B]
common_regions:
  - region_C: [region_A, region_B]

with a variable code list:

- Variable A:
    definition: Test variable to be used for computing a max aggregate
    unit: EJ/yr
    region-aggregation:
        - Variable A (max):
            method: max
- Variable A (max):
    unit: EJ/yr

and input data:

IamDataFrame(
            pd.DataFrame(
                [
                    ["m_a", "s_a", "region_A", "Variable A", "EJ/yr", 1],
                    ["m_a", "s_a", "region_B", "Variable A", "EJ/yr", 1],
                    ["m_a", "s_a", "region_A", "Variable A (max)", "EJ/yr", 2],
                    ["m_a", "s_a", "region_B", "Variable A (max)", "EJ/yr", 1],
                ],
                columns=IAMC_IDX + [2020],
            )
        )

yields a pyam error for duplicate data:

E       ValueError: Duplicate rows in `data`:
E         model scenario    region          variable   unit  year
E       0   m_a      s_a  region_C  Variable A (max)  EJ/yr  2020

meaning that as expected both operations are attempted. The aggregation of Variable A (max) though the region-aggregation attribute in Variable A as well as the "standard" aggregation from the entry Variable A (max).
This case is safe though since pyam yields an error. We could specifically protect against it but I'd say it's fine.

There might be more cases to consider though.

Only Variable A (max)

Take the above data but eliminate the first two lines for Varible A. In this case we'd get the following aggregation result:

  model scenario    region          variable   unit  year  value
0   m_a      s_a  region_A  Variable A (max)  EJ/yr  2020      2
1   m_a      s_a  region_B  Variable A (max)  EJ/yr  2020      1
2   m_a      s_a  region_C  Variable A (max)  EJ/yr  2020      3

for region_C, we now get 3 which is the sum, not the max of region_A and region_B.
This is wrong but expected since there is no method set for the aggregation of Variable A (max). We could safeguard against that relatively easy by enforcing that aggregation methods between the region-aggregation attribute and the "normal" variable must be the same. So:

- Variable A:
    definition: Test variable to be used for computing a max aggregate
    unit: EJ/yr
    region-aggregation:
        - Variable A (max):
            method: max
- Variable A (max):
    unit: EJ/yr
    method: max

in the above example. We could also make it more simple and remove the method attribute from the variable inside the region-aggregation attribute so that the method information is taken from the main variable directly.

Only Variable A

This is the straightforward version of the above case but I wanted to mention it. Taking only the first two rows of data gives:

  model scenario    region          variable   unit  year  value
0   m_a      s_a  region_A        Variable A  EJ/yr  2020      1
1   m_a      s_a  region_B        Variable A  EJ/yr  2020      1
2   m_a      s_a  region_C  Variable A (max)  EJ/yr  2020      1

which is correct and what we expect.

Variable A (max) in aggregation region

The final case that I could find is this one:

IamDataFrame(
            pd.DataFrame(
                [
                    ["m_a", "s_a", "region_A", "Variable A", "EJ/yr", 1],
                    ["m_a", "s_a", "region_B", "Variable A", "EJ/yr", 1],
                    ["m_a", "s_a", "region_C", "Variable A (max)", "EJ/yr", 2],
                ],
                columns=IAMC_IDX + [2020],
            )
        )
    )

where Variable A (max) exists but for the common region region_C. In this case we also don't get an error since the provided data always takes precedence over aggregated and we get:

  model scenario    region          variable   unit  year  value
0   m_a      s_a  region_A        Variable A  EJ/yr  2020      1
1   m_a      s_a  region_B        Variable A  EJ/yr  2020      1
2   m_a      s_a  region_C  Variable A (max)  EJ/yr  2020      2

with the warning that there is a difference between aggregated and provided data for region_C.

Summary

  1. The case described by you in Potential conflicts with overlapping region-aggregation instructions #290, would throw a pyam error and since I've never seen it so far I'd say we can ignore that case.
  2. The only other case we maybe should be safeguarding against is conflicting information between the variable mentioned in the region-aggregation attribute and the "original" variable entry. One way out of this could be to only allow mentioning the variable name in region-aggregation, all other information is then read from the original entry.

@danielhuppmann, looking forward to your thoughts. I think I've thought through every case but please let me know if you've spotted an error.

@phackstock phackstock self-assigned this Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential conflicts with overlapping region-aggregation instructions
1 participant