Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type estimators #1542

Open
wants to merge 105 commits into
base: development
Choose a base branch
from
Open

Type estimators #1542

wants to merge 105 commits into from

Conversation

eddiebergman
Copy link
Contributor

@eddiebergman eddiebergman commented Jul 16, 2022

This is a pretty big PR aimed at doing a simple thing, remove estimators.py and automl.py from the mypy ignore list. In Progress, notes on changes are TODO. I'll resolve conflicts once v0.15 is out. I can also split this into multiple smaller PRS to make it easier if needed.

Tests still need to be updated to accomodate changes.

There were 168 typing errors :) Some of them were actual possible bugs based on order of things being called and parameters set.

Major points:

class AutoMLRegressor(AutoML, RegressorMixin):
    _task_mappings = {...}
    is_classification = False
    
class AutoMLClassifier(AutoML, ClassifierMixin):
    def predict(...): ...
    def predict_proba(...): ...

Made the AutoSklearnEstimator smarter with respect to types in a similar fashion, notably it's smarter around what it retunrns through the use of a Generic in the main class and providing those types in the subclass. This mainly means that code editors will know if predict_proba will be available or not and that fit will return the the right estimator and not just the abtract AutoSklearnEstimator.

Self = TypeVar("Self", bound="AutoSklearnEstimator")

TParetoModel = TypeVar("TParetoModel", VotingClassifier, VotingRegressor)
TAutoML = TypeVar("TAutoML", bound=AutoML)

class AutoSklearnEstimator(ABC, BaseEstimator, Generic[TAutoML, TParetoModel]): ...

    # Knows it returns the same type as self, AsklearnClassifier or AsklearnRegressor
    def fit(self: Self, ...) -> Self: ...
    
    # Knows if its a AutoMLClassifier, AutoMLRegressor
    def automl() -> TAutoML: ...
    
    # Knows that the pareto models are a VotingClassifier/Regressor 
    def get_pareto_set() -> Sequence[TParetoModel]: ...

These are then specified in the subclass as

class AutoSklearnClassifier(AutosklearnEstimator[AutoMLClassifier, VotingClassifier]): ... 
class AutoSklearnRegressor(AutosklearnEstimator[AutoMLRegressor, VotingRegressor]): ... 

  • Many things that are set in fit are now wrapped in a property, ie. self._logger or self._task and raising a NotFittedError as sklearn would. This is because using them in other methods would correctly warn something like "self._task could be "None" if trying to call methods relying on fit to have been called first.
class AutoMl:

    @property
    def task(self) -> int:
        if self._task is None:
            raise NotFittedError("`task` has not been set, please call `fit` first")

        return self._task
        
    @property
    def input_validator(self) -> InputValidator:
        if self._input_validator is None:
            raise NotFittedError(
                "`input_validator` has not been set, please call `fit` first"
            )

        return self._input_validator

  • Make transform, get_cost_of_crash smarter with @overload, it now knows it's return type correctly based on type of input.
@overload   # Not the type for y is None, indicating None returned
def transform(self, X: XType, y: None = None,) -> tuple[XType, None]: ...

@overload  # And here, it's different
def transform(self, X: XType, y: YType) -> tuple[XType, YType]: ...

def transform(self, X: XType, y: YType | None = None) -> tuple[XType, YType | None]:
    ...

  • Run pyupgrade on a few files I touched

  • Simplified the datacompression things into a class, the typing caught some weirdness when datacompression was on but memory_limit wasn't set.

mfeurer and others added 30 commits November 17, 2021 14:39
* only active if kernel == 'poly'
* adapt the metadata to reflect this
* black checker

* Simplified

* add examples to black format check

Co-authored-by: Matthias Feurer <feurerm@informatik.uni-freiburg.de>
* re-structure manual and use 'collapse'

* ADD link to auto-sklearn-talks

* unifying titles

* Clarify default memory and cpu usage

* FIX sphinx_gallery to <=0.10.0

0.10.1 would raise an error for '-D plot_gallery=0'

* Re-structure faq

* FIX comments by mfeurer

* boldface items

* merge manual into FAQ

* FIX minor

* FIX typo

* Update doc/faq.rst

Co-authored-by: Eddie Bergman <eddiebergmanhs@gmail.com>

* Update doc/faq.rst

Co-authored-by: Eddie Bergman <eddiebergmanhs@gmail.com>

* Update doc/faq.rst

Co-authored-by: Eddie Bergman <eddiebergmanhs@gmail.com>

* Update doc/faq.rst

Co-authored-by: Eddie Bergman <eddiebergmanhs@gmail.com>

* Update doc/manual.rst

Co-authored-by: Eddie Bergman <eddiebergmanhs@gmail.com>

* Update doc/manual.rst

Co-authored-by: Eddie Bergman <eddiebergmanhs@gmail.com>

* Update doc/faq.rst

Co-authored-by: Eddie Bergman <eddiebergmanhs@gmail.com>

* FIX link

Co-authored-by: Eddie Bergman <eddiebergmanhs@gmail.com>
If you're only exposure to using... -> If your only exposure to using...
* np.bool deprecation

* Invalid escape sequence \_

* Series specify dtype

* drop na requires keyword args deprecation

* unspecified np.int size deprecated, use int instead

* deprecated unspeicifed np.int precision

* Element wise comparison failed, will raise error in the future

* Specify explicit dtype for empty series

* metric warnings for mismatch between y_pred and y_true label count

* Quantile transformer n_quantiles larger than n_samples warning ignored

* Silenced convergence warnings

* pass sklearn args as keywords

* np.bool deprecation

* Invalid escape sequence \_

* Series specify dtype

* drop na requires keyword args deprecation

* unspecified np.int size deprecated, use int instead

* deprecated unspeicifed np.int precision

* Element wise comparison failed, will raise error in the future

* Specify explicit dtype for empty series

* metric warnings for mismatch between y_pred and y_true label count

* Quantile transformer n_quantiles larger than n_samples warning ignored

* Silenced convergence warnings

* pass sklearn args as keywords

* flake8'd

* flake8'd

* Fixed CategoricalImputation not accounting for sparse matrices

* Updated to use distro for linux distribution

* Ignore convergence warnings for gaussian process regressor

* Averaging metrics now use zero_division parameter

* Readded scorers to module scope

* flake8'd

* Fix

* Fixed dtype for metalearner no run

* Catch gaussian process iterative fit warning

* Moved ignored warnings to tests

* Correctly type pd.Series

* Revert back to usual iterative fit

* Readded missing iteration increment

* Removed odd backslash

* Fixed imputer for sparse matrices

* Ignore warnings we are aware about in tests

* Flake'd:

* Revert "Fixed imputer for sparse matrices"

This reverts commit 05675ad.

* Revert "Revert "Fixed imputer for sparse matrices""

This reverts commit d031b0d.

* Back to default values

* Reverted to default behaviour with comment

* Added xfail test to document

* flaked

* Fixed test, moved to np.testing for assertion

* Update autosklearn/pipeline/components/data_preprocessing/categorical_encoding/encoding.py

Co-authored-by: Matthias Feurer <feurerm@informatik.uni-freiburg.de>

Co-authored-by: Matthias Feurer <feurerm@informatik.uni-freiburg.de>
* Added manual dispatch to tests

* Removed parameters to manual dispatch
…tors (#1332)

* Update docstrings and types

* doc typo fix

* flake'd
* added python 3.10 to versions

* Added quotes around versions

* Trigger tests
* Add submodule

* Port to abstract_ensemble, backend from automl_common

* Updated workflow files

* Update imports

* Trigger actions

* Another import fix

* update import

* m

* Backend fixes

* Backend parameter update

* fixture fix for backend

* Fix tests

* readd old abstract ensemble for now

* flake8'd

* Added install from source to readme

* Moved installation w.r.t submodules to the docs

* Temporarily remove submodule

* Readded submodule

* Updated to use automl_common under autosklearn

* Updated MANIFEST

* Removed uneeded statements from MANIFEST

* Fixed import

* Fixed comment line in MANIFEST.in

* Added automl_common/setup.py to MANIFEST

* Added prefix to script

* Re-added removed title #

* Added note for submodule for CONTRIBUTING

* Made the submodule step a bit more clear for contributing.md

* CONTRIBUTING fixes
* Added versioning for sphinx, docutils - introduced by sphinxtoolbox

* Fixed bug with config value for `plot_gallery` in doc makefile

* Update linkcheck command as well
* Added ignored_warnings file

* Use ignored_warnings file

* Test regressors with 1d, 1d as 2d and 2d targets

* Flake'd

* Fix broken relative imports to ignore_warnings

* Removed print and updated parameter type for tests

* Type import fix
* Added random state to classifiers

* Added some doc strings

* Removed random_state again

* flake'd

* Fix some test issues

* Re-added seed to test

* Updated test doc for unknown test

* flake'd
* Added ignored_warnings file

* Use ignored_warnings file

* Test regressors with 1d, 1d as 2d and 2d targets

* Flake'd

* Fix broken relative imports to ignore_warnings

* Removed print and updated parameter type for tests

* Added warning catches to fit methods in tests

* Added more warning catches

* Flake'd

* Created top-level module to allow relativei imports

* Deleted blank line in __init__

* Remove uneeded ignore warnings from tests

* Fix bad indent

* Fix github merge conflict editor whitespaces and indents
* update workflow files

* typo fix

* Update pytest

* remove bad semi-colon

* Fix test runner command

* Remove explicit steps required from older version

* Explicitly add Conda python to path for subprocess command in test

* Fix the mypy compliance check

* Added PEP 561 compliance

* Add py.typed to MANIFEST for dist

* Remove py.typed from setup.py
* rename OSX -> macOS as it is the new name

rename OSX -> macOS as it is the new name for the operating system. e.g. see https://www.apple.com/macos

* Update doc/installation.rst

Co-authored-by: Matthias Feurer <lists@matthiasfeurer.de>

* Update doc/installation.rst

Co-authored-by: Matthias Feurer <lists@matthiasfeurer.de>

Co-authored-by: Matthias Feurer <feurerm@informatik.uni-freiburg.de>
Co-authored-by: Matthias Feurer <lists@matthiasfeurer.de>
…semble (#1321)

* Changed show_models() function to return a dictionary of models in the ensemble instead of a string
* Remove flaky dep

* Remove unused pytest import
* Fix: MLPRegressor tests

* Fix: Ordering of statements in test

* Fix: MLP n_calls
* Fix: Raises errors with the config

* Add: Skip error for kernal_pca

Seems kernel_pca emits the error:
* `"zero-size array to reduction operation maximum which has no identity"`

This is gotten on the line `max_eig = lambdas.max()` which makes me
assume it emits a matrix with no real eigen values, not something we
can really control for
…ures (#1250)

* Moved to new splitter, moved to util file

* flake8'd

* Fixed errors, added test specifically for CustomStratifiedShuffleSplit

* flake8'd

* Updated docstring

* Updated types in docstring

* reduce_dataset_size_if_too_large supports more types

* flake8'd

* flake8'd

* Updated docstring

* Seperated out the data subsampling into individual functions

* Improved typing from Automl.fit to reduce_dataset_size_if_too_large

* flak8'd

* subsample tested

* Finished testing and flake8'd

* Cleaned up transform function that was touched

* ^

* Removed double typing

* Cleaned up typing of convert_if_sparse

* Cleaned up splitters and added size test

* Cleanup doc in data

* rogue line added was removed

* Test fix

* flake8'd

* Typo fix

* Fixed ordering of things

* Fixed typing and tests of target_validator fit, transform, inv_transform

* Updated doc

* Updated Type return

* Removed elif gaurd

* removed extraneuous overload

* Updated return type of feature validator

* Type fixes for target validator fit

* flake8'd

* Moved to new splitter, moved to util file

* flake8'd

* Fixed errors, added test specifically for CustomStratifiedShuffleSplit

* flake8'd

* Updated docstring

* Updated types in docstring

* reduce_dataset_size_if_too_large supports more types

* flake8'd

* flake8'd

* Updated docstring

* Seperated out the data subsampling into individual functions

* Improved typing from Automl.fit to reduce_dataset_size_if_too_large

* flak8'd

* subsample tested

* Finished testing and flake8'd

* Cleaned up transform function that was touched

* ^

* Removed double typing

* Cleaned up typing of convert_if_sparse

* Cleaned up splitters and added size test

* Cleanup doc in data

* rogue line added was removed

* Test fix

* flake8'd

* Typo fix

* Fixed ordering of things

* Fixed typing and tests of target_validator fit, transform, inv_transform

* Updated doc

* Updated Type return

* Removed elif gaurd

* removed extraneuous overload

* Updated return type of feature validator

* Type fixes for target validator fit

* flake8'd

* Fixed err message str and automl sparse y tests

* Flak8'd

* Fix sort indices

* list type to List

* Remove uneeded comment

* Updated comment to make it more clear

* Comment update

* Fixed warning message for reduce_dataset_if_too_large

* Fix test

* Added check for error message in tests

* Test Updates

* Fix error msg

* reinclude csr y to test

* Reintroduced explicit subsample values test

* flaked

* Missed an uncomment

* Update the comment for test of splitters

* Updated warning message in CustomSplitter

* Update comment in test

* Update tests

* Removed overloads

* Narrowed type of subsample

* Removed overload import

* Fix `todense` giving np.matrix, using `toarray`

* Made subsampling a little less aggresive

* Changed multiplier back to 10

* Allow argument to specfiy how auto-sklearn handles compressing dataset size  (#1341)

* Added dataset_compression parameter and validation

* Fix docstring

* Updated docstring for `resampling_strategy`

* Updated param def and memory_allocation can now be absolute

* insert newline

* Fix params into one line

* fix indentation in docs

* fix import breaks

* Allow absolute memory_allocation

* Tests

* Update test on for precision omitted from methods

* Update test for akslearn2 with same args

* Update to use TypedDict for better Mypy parsing

* Added arg to asklearn2

* Updated tests to remove some warnings

* flaked

* Fix broken link?

* Remove TypedDict as it's not supported in Python3.7

* Missing import

* Review changes

* Fix magic mock for python < 3.9

* Fixed bad merge
* commit meta learning data bases

* commit changed files

* commit new files

* fixed experimental settings

* implemented last comments on old PR

* adapted metalearning to last commit

* add a text preprocessing example

* intigrated feedback

* new changes on *.csv files

* reset changes

* add changes for merging

* add changes for merging

* add changes for merging

* try to merge

* fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale)

* fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale)

* fixed string representation for metalearning (some sort of hot fix, maybe this needs to be fixed in a bigger scale)

* init

* init

* commit changes for text preprocessing

* text prepreprocessing commit

* fix metalearning

* fix metalearning

* adapted test to new text feature

* fix style guide issues

* integrate PR comments

* integrate PR comments

* implemented the comments to the last PR

* fitted operation is not in place therefore we have to assgin the fitted self.preprocessor again to it self

* add first text processing tests

* add first text processing tests

* including comments from 01.25.

* including comments from 01.28.

* including comments from 01.28.

* including comments from 01.28.

* including comments from 01.31.
eddiebergman and others added 8 commits June 17, 2022 14:26
* Init commit

* Fix logging server cleanup (#1503)

* Fix logging server cleanup

* Add comment relating to the `try: finally:`

* Remove nested try: except: from `fit`

* Bump peter-evans/find-comment from 1 to 2 (#1520)

Bumps [peter-evans/find-comment](https://github.com/peter-evans/find-comment) from 1 to 2.
- [Release notes](https://github.com/peter-evans/find-comment/releases)
- [Commits](peter-evans/find-comment@v1...v2)

---
updated-dependencies:
- dependency-name: peter-evans/find-comment
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump actions/stale from 4 to 5 (#1521)

Bumps [actions/stale](https://github.com/actions/stale) from 4 to 5.
- [Release notes](https://github.com/actions/stale/releases)
- [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
- [Commits](actions/stale@v4...v5)

---
updated-dependencies:
- dependency-name: actions/stale
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Init commit

* Update evaluation module

* Clean up other occurences of the word validation

* Re-add test for test predictions

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add debug statements and 30s timeouts

* Fix formatting

* Update internal timeout param

* +timeout, use allocated tmpdir

* +timeout, use allocated tmpdir

* Remove another occurence of explicit `tmp`

* Increase timelimits once again

* Remove incomplete comment
* Init commit

* Fix DummyClassifiers in _load_pareto_set

* Add test for dummy only in classifiers

* Update no ensemble docstring

* Add automl case where automl only has dummy

* Remove tmp file

* Fix `include` statement to be regressor
* Create PR

* Update MLP regressor values
* Make docker file install from `setup.py`

* Add pytest cache to gitignore

* Up timeouts on test_metadata_generation
* Bump docker/build-push-action from 1 to 3

Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 1 to 3.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v1...v3)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update docker-publish.yml

Replace password by token

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matthias Feurer <feurerm@informatik.uni-freiburg.de>
* Create PR

* Abstract out dask client types

* Fix _ issue

* Extend scope of dask_client in automl.py

* Add docstring to dask module

* Indent result addition

* Add basic tests for Dask wrappers
@eddiebergman eddiebergman added the maintenance Internal maintenance label Jul 16, 2022
@eddiebergman eddiebergman added this to the v0.16 milestone Jul 16, 2022
@eddiebergman eddiebergman self-assigned this Jul 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Internal maintenance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants