implementation for fast categorize #819

gidden · 2024-02-27T17:06:04Z

This is only a prototype for now, but I hope shows how we could implement a fast categorization in line with previous work on df.validate()

Please confirm that this PR has done the following:

~~Tests Added~~ no user-facing api changes
~~Documentation Added~~ no user-facing api changes
Name of contributors Added to AUTHORS.rst
Description in RELEASE_NOTES.md Added

Description of PR

This PR uses the machinery from validate() now in categorize() resulting in huge speedups

codecov · 2024-03-12T11:11:15Z

Codecov Report

Attention: Patch coverage is 85.71429% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 95.0%. Comparing base (ddbb88e) to head (fed374a).
Report is 3 commits behind head on main.

Files	Patch %	Lines
pyam/core.py	85.7%	1 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##            main    #819   +/-   ##
=====================================
  Coverage   95.0%   95.0%           
=====================================
  Files         64      63    -1     
  Lines       6134    6144   +10     
=====================================
+ Hits        5828    5841   +13     
+ Misses       306     303    -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gidden · 2024-03-12T11:11:16Z

Not clear to my why ruff CI is failing

glatterf42 · 2024-03-12T11:32:21Z

The error message seems instructive to me:

pyam/core.py:1:1: I001 [*] Import block is un-sorted or un-formatted
pyam/core.py:69:29: F401 [*] `pyam.validation._apply_criteria` imported but unused
Found 2 errors.
[*] 2 fixable with the `--fix` option.

We have recently introduced ruff in this repo, which can handle several nice things at once. For example, it sorts your import blocks and cleans up unused imports. All these fixes have been applied to the main branch, so if you rebase this PR on top of that, you'll automatically get them. If you don't want to do that for whatever reason, run ruff check . --fix in your local environment (or use --diff instead of --fix first to see what will happen) and commit the changes again :)

glatterf42 · 2024-03-12T11:35:11Z

If you've done that and the action is still failing, please compare your local version with the one used for the check. Per default, GHA will use the latest pypi release, so 0.3.2.

glatterf42

Looks good to me in principle, but for this to be a proof of concept, I'd like to see a test case using the new behaviour and confirming that it works as intended.

Or am I misunderstanding this? The PR this one is following up on changed the call signature of df.validate(), but this one keeps categorize() the same, isn't it?

glatterf42

Then the existing tests are enough for me :)

(But please note that I'm not actually a maintainer of this repo.)

gidden · 2024-03-14T09:41:35Z

Correct, all we do here is update the internals of categorize() to utilize another top-level API (validate()). I'd argue this is even more robust than depending on other internal-only functions. Maybe @danielhuppmann can quickly give a thumbs up?

danielhuppmann · 2024-03-14T13:41:41Z

Will review tonight

danielhuppmann · 2024-03-14T19:56:26Z

This PR does not actually implement the new signature of validate() using direct filters and upper/lower bound arguments, hence the deprecation warning (inherited from validate()) without being able to use the new signature would be probably very confusing to users. So I fully implemented this in a new branch building on the prototype by @gidden..

Closing in favor of #837

gidden · 2024-03-15T06:55:30Z

As an FYI, I was utilizing this prototype including the filters and upper/lower bound arguments. Code snippets that I was using are here:

in a config.yaml

  categorize:
    name: foo
    label: Medium Risk
    color: xkcd:baby blue
  criteria:
    upper_bound: 1000000
    lower_bound: 800000
    year: 2100
    region: World

functioning code

        for key, data in self.definitions.items():
            if "categorize" not in data:
                continue
            logger.info(f"Preparing categorization for {key}")

            args = data["categorize"]
            self.db.categorize(
                args["name"],
                args["label"],
                variable=data["variable"],
                **data["criteria"],
                **{k: v for k, v in args.items() if k not in ["name", "label"]},
            )

gidden · 2024-03-15T06:56:53Z

pyam/core.py

-        if len(idx) == 0:
+        # find all data that matches categorization
+        # TODO: if validate returned an empty index, this check would be easier
+        not_valid = self.validate(criteria=criteria, **kwargs)


Here @danielhuppmann the validate kwargs like upper_bound and other filtering args are taken in

danielhuppmann · 2024-03-15T07:01:24Z

Ok, right, the method works with the new signature, but it was not described in the docstring and there were no tests added in this PR - this is what I implemented in the new branch and PR.

gidden added 4 commits February 27, 2024 18:05

prototype implementation for fast categorize with linting

9af8a36

Merge branch 'main' into fast-categorize

9833066

relint

719916f

update release notes

3429405

gidden requested review from phackstock, glatterf42 and danielhuppmann March 12, 2024 11:11

gidden marked this pull request as ready for review March 12, 2024 11:11

gidden removed the request for review from glatterf42 March 12, 2024 11:11

apply ruff

fed374a

gidden changed the title ~~prototype implementation for fast categorize with linting~~ implementation for fast categorize Mar 13, 2024

gidden requested a review from glatterf42 March 14, 2024 07:20

glatterf42 suggested changes Mar 14, 2024

View reviewed changes

glatterf42 approved these changes Mar 14, 2024

View reviewed changes

danielhuppmann closed this Mar 14, 2024

gidden commented Mar 15, 2024

View reviewed changes

gidden mentioned this pull request Mar 15, 2024

Implement explicit API for categorize() #837

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementation for fast categorize #819

implementation for fast categorize #819

gidden commented Feb 27, 2024 •

edited

codecov bot commented Mar 12, 2024 •

edited

gidden commented Mar 12, 2024

glatterf42 commented Mar 12, 2024

glatterf42 commented Mar 12, 2024

glatterf42 left a comment •

edited

glatterf42 left a comment •

edited

gidden commented Mar 14, 2024

danielhuppmann commented Mar 14, 2024

danielhuppmann commented Mar 14, 2024

gidden commented Mar 15, 2024

gidden Mar 15, 2024

danielhuppmann commented Mar 15, 2024

implementation for fast categorize #819

implementation for fast categorize #819

Conversation

gidden commented Feb 27, 2024 • edited

Please confirm that this PR has done the following:

Description of PR

codecov bot commented Mar 12, 2024 • edited

Codecov Report

gidden commented Mar 12, 2024

glatterf42 commented Mar 12, 2024

glatterf42 commented Mar 12, 2024

glatterf42 left a comment • edited

Choose a reason for hiding this comment

glatterf42 left a comment • edited

Choose a reason for hiding this comment

gidden commented Mar 14, 2024

danielhuppmann commented Mar 14, 2024

danielhuppmann commented Mar 14, 2024

gidden commented Mar 15, 2024

gidden Mar 15, 2024

Choose a reason for hiding this comment

danielhuppmann commented Mar 15, 2024

gidden commented Feb 27, 2024 •

edited

codecov bot commented Mar 12, 2024 •

edited

glatterf42 left a comment •

edited

glatterf42 left a comment •

edited