Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current status of missing values (in oblique trees) #263

Closed
vruusmann opened this issue Apr 25, 2024 · 4 comments · Fixed by #264
Closed

Current status of missing values (in oblique trees) #263

vruusmann opened this issue Apr 25, 2024 · 4 comments · Fixed by #264

Comments

@vruusmann
Copy link

This is a request for clarification.

The documentation for v0.7.X says that supervised trees (such as oblique decision trees) do not support missing values: https://github.com/neurodata/scikit-tree/blob/v0.7.0/doc/modules/supervised_tree.rst (jump to "Limitations compared to decision trees")

However, the following Python code runs without errors:

from sklearn.datasets import load_iris
from sktree.tree import ObliqueDecisionTreeClassifier

import numpy

iris_X, iris_y = load_iris(return_X_y = True, as_frame = True)

# Make the feature matrix 25% sparse
iris_X = iris_X.mask(numpy.random.random(iris_X.shape) < .25)

classifier = ObliqueDecisionTreeClassifier()
classifier.fit(iris_X, iris_y)

pred_proba = classifier.predict_proba(iris_X)
print(pred_proba)

If missing values are really not supported, then I would expect the ObliqueDecisionTreeClassifier.fit(X, y) method to fail quickly and cleanly with an appropriate error message.

The ObliqueTree.missing_go_to_left attribute is definitely set. But I can see that its elements hold values that are not valid according to Scikit-Learn's missing-go-to-left conventions (ie. all elements should be either 0 or 1, but the "active" values appear arbitrary 1-byte integers with values up to 128).

I was experimenting with a setup where a "oblique projection feature" computed to missing value when any of its input features was a missing value, and then scoring the oblique tree using Scikit-Learn's missing-go-to-left algorithm. However, the predictions didn't seem to agree, which suggests that Scikit-Tree is doing something differently.

TLDR: As of today, is it permitted to pass missing values into oblique tree-based estimators or not?

@adam2392
Copy link
Collaborator

Missing values aren't supported yet, and taking a look, I think this is a silent bug, so thanks for the report!

I will need to go through and enable an error message to be raised. Rn it is silent failing cuz there is no check at the Python level. At the Cython level, I would guess that the NaNs are somehow being represented as infinity, so this is definitely erroneous.

Will submit a PR to fix

@adam2392
Copy link
Collaborator

PR #264 should fix this issue. Lmk what you think @vruusmann

@vruusmann
Copy link
Author

vruusmann commented Apr 26, 2024

Yes, the {"allow_nan": False} is the correct way to signal the current status of missing value support.

All fine by me! I don't call the shots here anyway.

@vruusmann
Copy link
Author

Then again, enabling missing value support based on the (Oblique)Tree.missing_go_to_left attribute shouldn't be too complicated.

The thing that needs discussion is what to do when computing "oblique features" when one or more projection matrix elements evaluates to a missing value (eg. consider a PM row with three input features, with two being available and one missing):

  • The "oblique feature" as a whole evaluates to a missing value when any of its components is missing.
  • The "oblique feature" skips over its missing components. If there is at least one component available, then it will be possible to proceed using Scikit-Learn's standard tree traversal algorithm. If all the components are missing, only then fall back to Scikit-Learn's "missing-go-to-left" algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants