Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Raise error when missing-values encountered in scikit-tree trees #264

Merged
merged 5 commits into from
May 6, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 6 additions & 0 deletions doc/whats_new/v0.8.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ Version 0.8
Changelog
---------

- |Fix| Previously missing-values in ``X`` input array for sktree estimators
did not raise an error, and silently ran, assuming the missing-values were
encoded as infinity value. This is now fixed, and the estimators will raise an
ValueError if missing-values are encountered in ``X`` input array.
By `Adam Li`_ (:pr:`#264`)

Code and Documentation Contributors
-----------------------------------

Expand Down
4 changes: 4 additions & 0 deletions sktree/tree/_neighbors.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,7 @@ def compute_similarity_matrix(self, X):
The similarity matrix among the samples.
"""
return compute_forest_similarity_matrix(self, X)

def _more_tags(self):
# XXX: no scikit-tree estimators support NaNs as of now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scikit-Tree appears to be based on Scikit-Learn 1.3.X, which supports missing values in standalone decision tree models (DecisionTreeClassifier, DecisionTreeRegressor) by default. The support for missing values in random forest models (RandomForestClassifier and RandomForestRegressor) was added in 1.4.X.

In that regard, perhaps the _more_tags method should be base class-dependent (one tag for DecisionTree and another one for RandomForest)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scikit-tree will depend on each version of scikit-learn successively very tightly for now, so we'll consider the next release of v0.8 to be tied to 1.4.x. Is that okay?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, when there is no missing value support, then it would be okay to attach the same "disable all"-type_more_tags method to the tree ABC, so that it affects all tree and tree ensemble estimators.

The exact SkTree <-> SkLearn version matching is not critical until SkTree really starts enabling missing values.

return {"allow_nan": False}
23 changes: 22 additions & 1 deletion sktree/tree/tests/test_all_trees.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import pytest
from numpy.testing import assert_almost_equal, assert_array_equal
from sklearn.base import is_classifier
from sklearn.datasets import make_blobs
from sklearn.datasets import load_iris, make_blobs
from sklearn.tree._tree import TREE_LEAF

from sktree.tree import (
Expand Down Expand Up @@ -162,3 +162,24 @@ def test_similarity_matrix(tree):

assert np.allclose(sim_mat, sim_mat.T)
assert np.all((sim_mat.diagonal() == 1))


@pytest.mark.parametrize("tree", ALL_TREES)
def test_missing_values(tree):
"""Smoke test to ensure that correct error is raised when missing values are present.

xref: https://github.com/neurodata/scikit-tree/issues/263
"""
rng = np.random.default_rng(123)

iris_X, iris_y = load_iris(return_X_y=True, as_frame=True)

# Make the feature matrix 25% sparse
iris_X = iris_X.mask(rng.standard_normal(iris_X.shape) < 0.25)

classifier = tree()
with pytest.raises(ValueError, match="Input X contains NaN"):
if tree.__name__.startswith("Unsupervised"):
classifier.fit(iris_X)
else:
classifier.fit(iris_X, iris_y)