Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and extend RequiredDataValidator #278

Open
phackstock opened this issue Aug 21, 2023 · 2 comments
Open

Refactor and extend RequiredDataValidator #278

phackstock opened this issue Aug 21, 2023 · 2 comments
Assignees

Comments

@phackstock
Copy link
Contributor

In order to support validation of data ranges the RequiredDataValidator needs to be extended and renamed.
There are three use cases that need to be covered:

  1. "Simple" requirement: Some data just required to be present no further constraints
  2. Required and constrained: Data needs to be present and within a certain range
  3. Constrained but not required: Data does not need to be present but if it is, there are constraints on it

This first use case is already currently covered and will not require any changes.
In order to support the second and third use case, the yaml file that specifies required data and the RequiredDataValidator class will be extended.

For the second use case we introduce the notion of constraints "constraints" as a keyword that can be used like so:

required_data:
- measurand: 
    Emission|CO2:
      unit: Mt CO2/yr
  year: [2020, 2025, 2030, 2035, 2040, 2045, 2050]
  region: [World, Europe]
  constraints: 
  - year: 2020
    region: World
    upper: 46000
    lower: 42000

This would be interpreted as follows:

  1. We require a measurand called Emission|CO2 reported in the unit Mt CO2/yr
  2. We require this measurand to be present for the years 2020 to 2050 in 5 year time steps
  3. We require this measurand to be present for the regions World and Europe
  4. Only for the value for the region World in the year 2020 we place a constraint, to be between 42 and 46 Gt CO2.

For the third use case we introduce a new section in the file besides required_data, namely optional_data.
The use for this identical to required_data with the only exception that the validation does not fail if the data is not present.
Using the above example again but this time with optional_data translates to the following logic:

optional_data:
- measurand: 
    Emission|CO2:
      unit: Mt CO2/yr
  year: [2020, 2025, 2030, 2035, 2040, 2045, 2050]
  region: [World, Europe]
  constraints: 
  - year: 2020
    region: World
    upper: 46000
    lower: 42000
  1. If Emission|CO2 is completely missing from the data the validation passes as we're looking at optional_data.
  2. If it is there, all of the above logic applies. So it need to be reported in a specific unit, for a number of years and regions and the value for 2020, for the World region need to be within a range.

FYI @danielhuppmann

@phackstock phackstock self-assigned this Aug 21, 2023
@danielhuppmann
Copy link
Member

I think that makes a lot of sense. Having optional data without constraints wouldn't have any effect, but it's probably better to keep the same structure between the required and optional rather than being perfectly "efficient".

Two other ideas/suggestions:

  • Rename the class to DataValidator (because it also has not-required components)
  • Rename the directory where such data-validation files are stored (and are tested as part of the nomenclature-validation) to data_validation

[In parallel, the directory with MetaValidator yaml files could be renamed to meta_validation?]

@phackstock
Copy link
Contributor Author

Ah yes, agreed with all of the proposed renamings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants