Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementation for fast categorize #819

Closed
wants to merge 5 commits into from

Conversation

gidden
Copy link
Member

@gidden gidden commented Feb 27, 2024

This is only a prototype for now, but I hope shows how we could implement a fast categorization in line with previous work on df.validate()

Please confirm that this PR has done the following:

  • Tests Added no user-facing api changes
  • Documentation Added no user-facing api changes
  • Name of contributors Added to AUTHORS.rst
  • Description in RELEASE_NOTES.md Added

Description of PR

This PR uses the machinery from validate() now in categorize() resulting in huge speedups

@gidden gidden marked this pull request as ready for review March 12, 2024 11:11
Copy link

codecov bot commented Mar 12, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 95.0%. Comparing base (ddbb88e) to head (fed374a).
Report is 3 commits behind head on main.

Files Patch % Lines
pyam/core.py 85.7% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main    #819   +/-   ##
=====================================
  Coverage   95.0%   95.0%           
=====================================
  Files         64      63    -1     
  Lines       6134    6144   +10     
=====================================
+ Hits        5828    5841   +13     
+ Misses       306     303    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gidden
Copy link
Member Author

gidden commented Mar 12, 2024

Not clear to my why ruff CI is failing

@gidden gidden removed the request for review from glatterf42 March 12, 2024 11:11
@glatterf42
Copy link
Collaborator

The error message seems instructive to me:

pyam/core.py:1:1: I001 [*] Import block is un-sorted or un-formatted
pyam/core.py:69:29: F401 [*] `pyam.validation._apply_criteria` imported but unused
Found 2 errors.
[*] 2 fixable with the `--fix` option.

We have recently introduced ruff in this repo, which can handle several nice things at once. For example, it sorts your import blocks and cleans up unused imports. All these fixes have been applied to the main branch, so if you rebase this PR on top of that, you'll automatically get them. If you don't want to do that for whatever reason, run ruff check . --fix in your local environment (or use --diff instead of --fix first to see what will happen) and commit the changes again :)

@glatterf42
Copy link
Collaborator

If you've done that and the action is still failing, please compare your local version with the one used for the check. Per default, GHA will use the latest pypi release, so 0.3.2.

@gidden gidden changed the title prototype implementation for fast categorize with linting implementation for fast categorize Mar 13, 2024
@gidden gidden requested a review from glatterf42 March 14, 2024 07:20
Copy link
Collaborator

@glatterf42 glatterf42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in principle, but for this to be a proof of concept, I'd like to see a test case using the new behaviour and confirming that it works as intended.

Or am I misunderstanding this? The PR this one is following up on changed the call signature of df.validate(), but this one keeps categorize() the same, isn't it?

Copy link
Collaborator

@glatterf42 glatterf42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the existing tests are enough for me :)

(But please note that I'm not actually a maintainer of this repo.)

@gidden
Copy link
Member Author

gidden commented Mar 14, 2024

Correct, all we do here is update the internals of categorize() to utilize another top-level API (validate()). I'd argue this is even more robust than depending on other internal-only functions. Maybe @danielhuppmann can quickly give a thumbs up?

@danielhuppmann
Copy link
Member

Will review tonight

@danielhuppmann
Copy link
Member

This PR does not actually implement the new signature of validate() using direct filters and upper/lower bound arguments, hence the deprecation warning (inherited from validate()) without being able to use the new signature would be probably very confusing to users. So I fully implemented this in a new branch building on the prototype by @gidden..

Closing in favor of #837

@gidden
Copy link
Member Author

gidden commented Mar 15, 2024

As an FYI, I was utilizing this prototype including the filters and upper/lower bound arguments. Code snippets that I was using are here:

in a config.yaml

  categorize:
    name: foo
    label: Medium Risk
    color: xkcd:baby blue
  criteria:
    upper_bound: 1000000
    lower_bound: 800000
    year: 2100
    region: World

functioning code

        for key, data in self.definitions.items():
            if "categorize" not in data:
                continue
            logger.info(f"Preparing categorization for {key}")

            args = data["categorize"]
            self.db.categorize(
                args["name"],
                args["label"],
                variable=data["variable"],
                **data["criteria"],
                **{k: v for k, v in args.items() if k not in ["name", "label"]},
            )

if len(idx) == 0:
# find all data that matches categorization
# TODO: if validate returned an empty index, this check would be easier
not_valid = self.validate(criteria=criteria, **kwargs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here @danielhuppmann the validate kwargs like upper_bound and other filtering args are taken in

@danielhuppmann
Copy link
Member

Ok, right, the method works with the new signature, but it was not described in the docstring and there were no tests added in this PR - this is what I implemented in the new branch and PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants