Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Remove Cython - Reduce tech debt #346

Open
wants to merge 85 commits into
base: main
Choose a base branch
from

Conversation

eddiebergman
Copy link
Contributor

@eddiebergman eddiebergman commented Dec 18, 2023

This PR attempts to first convert everything to Python, so a type checker can be run across everything as well as use linters properly. This also includes updating everything to pytest.

Note: This PR is unreviewable. The strategy has been to minimally change tests while refactoring to ensure tested behaviour still passes. Some notable exceptions are where functionality was removed or sampling strategy was changed and the expected values differ. Unfortunatly the formatter has rendered them also unreviewable.

Update: All SMAC unittests passed without code changes so it seems mostly backwards compatible.

Some notable changes:

  • No more unbounded Normals
  • No more quantized hyperparameters
  • Sampling is 3x faster, Neighbors 3.5x faster and validation 2x faster (see scripts/benchmark_sampling.py)
  • More code re-use between classes
  • Pure Python, jumping to definition and docstrings should work in editors
  • Defined data-structures to capture heirarchy in conditional spaces, moving the logic out of ConfigurationSpace.
  • Better defined public API for ConfigurationSpace.
  • Better support for custom hyperparameters, i.e. backed by Scipy Distributions (see BetaFloat/Int). Should require a lot less code to do so.
  • Better defined public API for hyperparameters
  • A whole load of deprecations with recommended alternatives
  • Categoricals/Ordinals support arbitrary objects, e.g. no more need to specify Category([True, False, "None"]) (Now handled in follow up feat(Hyperparameters): Allow arbitrary objects in category ordinal #359)

TODO:


@eddiebergman eddiebergman changed the title Refactor: Update types for Cython Refactor: Update types for Cython [In Progress] Dec 18, 2023
Copy link

codecov bot commented Dec 19, 2023

Codecov Report

Attention: Patch coverage is 92.06843% with 51 lines in your changes are missing coverage. Please review.

Project coverage is 78.91%. Comparing base (93abc5b) to head (2a29120).
Report is 2 commits behind head on main.

❗ Current head 2a29120 differs from pull request most recent head f84cbd4. Consider uploading reports for the commit f84cbd4 to get more accurate results

Files Patch % Lines
ConfigSpace/forbidden.py 82.35% 17 Missing and 1 partial ⚠️
ConfigSpace/conditions.py 93.04% 7 Missing and 1 partial ⚠️
ConfigSpace/configuration_space.py 64.28% 3 Missing and 2 partials ⚠️
ConfigSpace/hyperparameters/normal_float.py 88.09% 4 Missing and 1 partial ⚠️
ConfigSpace/hyperparameters/normal_integer.py 85.71% 4 Missing and 1 partial ⚠️
ConfigSpace/hyperparameters/categorical.py 90.90% 3 Missing ⚠️
ConfigSpace/hyperparameters/ordinal.py 91.42% 3 Missing ⚠️
ConfigSpace/hyperparameters/uniform_integer.py 96.29% 2 Missing ⚠️
ConfigSpace/hyperparameters/numerical.py 97.05% 1 Missing ⚠️
ConfigSpace/hyperparameters/uniform_float.py 96.96% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #346      +/-   ##
==========================================
+ Coverage   73.54%   78.91%   +5.37%     
==========================================
  Files          29       45      +16     
  Lines        2831     4767    +1936     
  Branches      629     1005     +376     
==========================================
+ Hits         2082     3762    +1680     
- Misses        632      806     +174     
- Partials      117      199      +82     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eddiebergman eddiebergman changed the title Refactor: Update types for Cython [In Progress] Refactor: Remove Cython - Reduce tech debt Mar 28, 2024
Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here's a first batch of comments. I skimmed roughly 40% so far, and will probably take a bit more time...

.github/workflows/pytest.yml Show resolved Hide resolved
.github/workflows/pytest.yml Show resolved Hide resolved
.pre-commit-config.yaml Show resolved Hide resolved
@@ -1,12 +1,12 @@
[project]
name = "ConfigSpace"
version = "0.7.2"
version = "0.8.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a major change, how about we go up to 1.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with that

pyproject.toml Show resolved Hide resolved
(np.array([]), (0, 10), 10, np.array([])),
],
)
def test_quantize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not test for the function that quantizes on a log scale. But as asked above, I'm not sure if that's actually still needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true, there's no tests for it. The parameters that use it are tested though and I figured that's good enough.

"test_searchspaces",
"mini_autosklearn_original.pcs",
)
cs.seed(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to change the seeds here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops might have been through debugging it was left in. No reason it should have been changed at the end.

all_neighbors = _test_get_one_exchange_neighbourhood(hp)
all_neighbors = [neighbor["a"] for neighbor in all_neighbors]
assert pytest.approx(np.mean(all_neighbors), abs=1e-2) == 5.44
assert pytest.approx(np.var(all_neighbors), abs=1e-2) == 9.14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea why the variance is twice as high as in the previous implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after some investigation, I found out why, the problem was to do with the test handle it called.

The first part of def _test_get_one_exchange_neighbourhood calculated a num_neighbors which for this uniform int becomes 10.

In the second part, previously it was never passed in and the num_neighbors parameter wasn't used. In this PR, I noticed it wasn't used and must have passed it in as I thought I might have removed it accidentally.

I'm not sure what the intended behaviour is:

  1. Do we remove the for loop calculating num_neighbors? In which case, I get something close to the previous mean and variance
  2. Do we pass it in? In which case the variance is ~9 as we compute a neighbor configuration for every possible integer.

As a side note, the parameter name is misleading as the num_neighbors parameter to get_one_exchange_neighbourhood is actually the number of neighbor configs per hyperparameter, not the total number of neighbors to retrieve.

The test relies on this function:

def _test_get_one_exchange_neighbourhood(hp):
    cs = ConfigurationSpace()
    num_neighbors = 0
    if not isinstance(hp, list):
        hp = [hp]

    for hp_ in hp:
        cs.add(hp_)
        if np.isinf(hp_.get_num_neighbors()):
            num_neighbors += 4
        else:
            num_neighbors += hp_.get_num_neighbors()

    cs.seed(1)
    config = cs.get_default_configuration()
    all_neighbors = []
    for i in range(100):
        neighborhood = get_one_exchange_neighbourhood(
            config,
            i,
			# Before, num_neighbors was set as blank, which defaulted to 4
            num_neighbors=4,
			# The test after updating had it set to the above `num_neighbors`
			num_neighbors=num_neighbors
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfeurer this needs a resolution.

all_neighbors = [neighbor["a"] for neighbor in all_neighbors]
assert hp.default_value == 3

assert pytest.approx(np.mean(all_neighbors), abs=1e-2) == 5.77
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea why this value is so much higher than in the previous implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

assert hp.default_value == 3

assert pytest.approx(np.mean(all_neighbors), abs=1e-2) == 5.77
assert pytest.approx(np.var(all_neighbors), abs=1e-2) == 8.39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar as above, why would the variance increase so much here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And above

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 2: I finished looking at all the tests. Looks really great, I've only got a few questions this time.

wrong_shape_2 = np.array([3, 5, 7]).reshape(1, -1)
wrong_shape_3 = np.array([3, 5, 7]).reshape(-1, 1)

assert c1.pdf_values(point_1)[0] == pytest.approx(0.7559999999999997)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did these five values increase by a factor of five?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ay yes, this I didn't know how to handle properly.

The previous implementation returned the pdf in the value space (even when considering vectorized values).

def _pdf(self, vector: np.ndarray) -> np.ndarray:
"""
Computes the probability density function of the parameter in
the transformed (and possibly normalized, depends on the parameter
type) space. As such, one never has to worry about log-normal
distributions, only normal distributions (as the inverse_transform
in the pdf method handles these).
Parameters
----------
vector: np.ndarray
the (N, ) vector of inputs for which the probability density
function is to be computed.
Returns
----------
np.ndarray(N, )
Probability density values of the input vector
"""
ub = self._inverse_transform(self.upper)
lb = self._inverse_transform(self.lower)
alpha = self.alpha
beta = self.beta
return spbeta(alpha, beta, loc=lb, scale=ub-lb).pdf(vector) \
* (ub-lb) / (self._upper - self._lower)

You can see from the bottom of that function it normalizes with respect to self.upper and self.lower, numbers in value space. In this case reducing it by a factor of 10 (and not 5 as you mentioned).

The newer implementation essentially has hyperparameters being backed by some distribution in vectorized space, that have no idea about what their vectorized format means in value space. The new values reflect this in they not normalized by the lower and upper bound in value space. I figured this was okay as pdf only has a relative meaning when compared to other pdfs, and they all no longer have this value space normalization.

Here's the relevant snippet from the code. I did originally use to pass in the _pdf_norm to rescale things but i figured it was better to have the pdf refer to their pdf in vectorized space and be agnostic to value space.

@dataclass
class ScipyContinuousDistribution(Distribution):
    rv: rv_continuous_frozen

    lower_vectorized: f64
    upper_vectorized: f64
    _max_density: float
    _pdf_norm: float = 1

    def sample_vector(
        self,
        n: int,
        *,
        seed: RandomState | None = None,
    ) -> Array[f64]:
        return self.rv.rvs(size=n, random_state=seed).astype(f64)  # type: ignore

    def max_density(self) -> float:
        return float(self._max_density)

    def pdf_vector(self, vector: Array[f64]) -> Array[f64]:
        pdf = self.rv.pdf(vector) / self._pdf_norm

        # By definition, we don't allow NaNs in the pdf
        return np.nan_to_num(pdf, nan=0)  # type: ignore

As a side beneift, this means its first of all easier to implement new hps as you do not need to consider the interplay of pdf w.r.t. vectorized and value space and secondly if we just rely on the pdf returned by the underlying disitribution, things become a lot more uniform in terms of implementation.

test/test_hyperparameters.py Outdated Show resolved Hide resolved
src/ConfigSpace/api/types/float.py Outdated Show resolved Hide resolved
src/ConfigSpace/api/types/integer.py Outdated Show resolved Hide resolved
assert cs["algo2_subspace:algo2_param1"].default_value == "Y"


def test_acts_as_mapping_2():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, what's the reason to rename this test and add a number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was an issue where two tests in the same file had the same name but were part of test classes. Making it pytest and removing the classes caused it to fail and I think this was me being lazy.

Renamed to test_configuration_space_acts_as_mapping and test_configuration_acts_as_mapping

test/test_configuration_space.py Outdated Show resolved Hide resolved
assert cond.vector_value == cond_.vector_value

# Test invalid conditions:
self.assertRaises(TypeError, EqualsCondition, hp2, "parent", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to remove this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems I removed the these checks in the base Condition class when reworking them. I honestly find crowding every __init__ with several instance(param, ThingIThinkItToBe): raise TypeError("Usually a long message to tell user they called it wrong") to be a bit noisy and provide little value that your editor/type checker can provide before running anything. Exceptions being when the type doesnt convey information, i.e. a float between 0-1.

Also raises the blurry line of which parameters should we check of which objects?
If your set on these being required I can add the checks back in.

assert not condition.evaluate({"blkrest": "0"})
assert condition.evaluate({"blkrest": "1"})

def test_get_parents(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you drop this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray causality, I added it back in and it passes.

@sonovice
Copy link

For everyone else being stuck due to type errors etc:
You can pull in the current state of this PR to your poetry env with

poetry add git+https://github.com/automl/ConfigSpace.git@refs/pull/346/merge

Might help in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment