Current status of missing values (in oblique trees) #263

vruusmann · 2024-04-25T07:34:03Z

This is a request for clarification.

The documentation for v0.7.X says that supervised trees (such as oblique decision trees) do not support missing values: https://github.com/neurodata/scikit-tree/blob/v0.7.0/doc/modules/supervised_tree.rst (jump to "Limitations compared to decision trees")

However, the following Python code runs without errors:

from sklearn.datasets import load_iris
from sktree.tree import ObliqueDecisionTreeClassifier

import numpy

iris_X, iris_y = load_iris(return_X_y = True, as_frame = True)

# Make the feature matrix 25% sparse
iris_X = iris_X.mask(numpy.random.random(iris_X.shape) < .25)

classifier = ObliqueDecisionTreeClassifier()
classifier.fit(iris_X, iris_y)

pred_proba = classifier.predict_proba(iris_X)
print(pred_proba)

If missing values are really not supported, then I would expect the ObliqueDecisionTreeClassifier.fit(X, y) method to fail quickly and cleanly with an appropriate error message.

The ObliqueTree.missing_go_to_left attribute is definitely set. But I can see that its elements hold values that are not valid according to Scikit-Learn's missing-go-to-left conventions (ie. all elements should be either 0 or 1, but the "active" values appear arbitrary 1-byte integers with values up to 128).

I was experimenting with a setup where a "oblique projection feature" computed to missing value when any of its input features was a missing value, and then scoring the oblique tree using Scikit-Learn's missing-go-to-left algorithm. However, the predictions didn't seem to agree, which suggests that Scikit-Tree is doing something differently.

TLDR: As of today, is it permitted to pass missing values into oblique tree-based estimators or not?

The text was updated successfully, but these errors were encountered:

adam2392 · 2024-04-26T07:49:29Z

Missing values aren't supported yet, and taking a look, I think this is a silent bug, so thanks for the report!

I will need to go through and enable an error message to be raised. Rn it is silent failing cuz there is no check at the Python level. At the Cython level, I would guess that the NaNs are somehow being represented as infinity, so this is definitely erroneous.

Will submit a PR to fix

adam2392 · 2024-04-26T07:59:46Z

PR #264 should fix this issue. Lmk what you think @vruusmann

vruusmann · 2024-04-26T08:08:07Z

Yes, the {"allow_nan": False} is the correct way to signal the current status of missing value support.

All fine by me! I don't call the shots here anyway.

vruusmann · 2024-04-26T08:32:59Z

Then again, enabling missing value support based on the (Oblique)Tree.missing_go_to_left attribute shouldn't be too complicated.

The thing that needs discussion is what to do when computing "oblique features" when one or more projection matrix elements evaluates to a missing value (eg. consider a PM row with three input features, with two being available and one missing):

The "oblique feature" as a whole evaluates to a missing value when any of its components is missing.
The "oblique feature" skips over its missing components. If there is at least one component available, then it will be possible to proceed using Scikit-Learn's standard tree traversal algorithm. If all the components are missing, only then fall back to Scikit-Learn's "missing-go-to-left" algorithm.

adam2392 mentioned this issue Apr 26, 2024

FIX Raise error when missing-values encountered in scikit-tree trees #264

Merged

5 tasks

adam2392 closed this as completed in #264 May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current status of missing values (in oblique trees) #263

Current status of missing values (in oblique trees) #263

vruusmann commented Apr 25, 2024

adam2392 commented Apr 26, 2024

adam2392 commented Apr 26, 2024

vruusmann commented Apr 26, 2024 •

edited

vruusmann commented Apr 26, 2024

Current status of missing values (in oblique trees) #263

Current status of missing values (in oblique trees) #263

Comments

vruusmann commented Apr 25, 2024

adam2392 commented Apr 26, 2024

adam2392 commented Apr 26, 2024

vruusmann commented Apr 26, 2024 • edited

vruusmann commented Apr 26, 2024

vruusmann commented Apr 26, 2024 •

edited