New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add probabilistic classification to hiclass #119
base: main
Are you sure you want to change the base?
Conversation
@@ -96,6 +105,8 @@ def __init__( | |||
If True, skip scikit-learn's checks and sample_weight passing for BERT. | |||
classifier_abbreviation : str, default="" | |||
The abbreviation of the local hierarchical classifier to be displayed during logging. | |||
calibration_method : {"ivap", "cvap", "platt", "isotonic"}, str, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is a better way to represent the multiple possible values here. Perhaps Union["ivap", "cvap", "platt", "isotonic"]? I am not so sure without building the documentation to see the result.
|
||
else: | ||
calibrators = Parallel(n_jobs=self.n_jobs)( | ||
delayed(logging_wrapper)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested the parallel logging in the cluster? It used to be the case that messages were repeated multiple times.
) | ||
proba = calibrator.predict_proba(X) | ||
|
||
y[:, 0] = calibrator.classes_[np.argmax(proba, axis=1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need to use the predictions, wouldn't it be better to use the already implemented predict method? I imagine it could simplify the code here and avoid redundancy
@@ -5,6 +5,8 @@ | |||
from .LocalClassifierPerLevel import LocalClassifierPerLevel | |||
from .LocalClassifierPerNode import LocalClassifierPerNode | |||
from .LocalClassifierPerParentNode import LocalClassifierPerParentNode | |||
from .LocalClassifierPerLevel import LocalClassifierPerLevel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from .LocalClassifierPerLevel import LocalClassifierPerLevel |
X : {array-like, sparse matrix} of shape (n_samples, n_features) | ||
The calibration input samples. Internally, its dtype will be converted | ||
to ``dtype=np.float32``. If a sparse matrix is provided, it will be | ||
converted into a sparse ``csc_matrix``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converted into a sparse ``csc_matrix``. | |
converted into a sparse ``csr_matrix``. |
y_true = make_leveled(y_true) | ||
y_true = classifier._disambiguate(y_true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow why make_leveled and _disambiguate need to be called here.
predecessor = list(self.classifier.hierarchy_.predecessors(node))[0] | ||
except NetworkXError: | ||
# skip empty levels | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it would be better to have an if/else here since this might hide real errors
@@ -189,10 +350,82 @@ def test_predict_sparse(fitted_logistic_regression): | |||
assert_array_equal(ground_truth, prediction) | |||
|
|||
|
|||
def test_predict_proba(fitted_logistic_regression): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can use parametrize to reduce redundancy when tests are the same in different files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this wrapper necessary?
positive_label = 1 | ||
unique_labels = np.unique(y) | ||
assert len(unique_labels) <= 2 | ||
|
||
y = np.where(y == positive_label, 1, 0) | ||
y = y.reshape(-1) # make sure it's a 1D array | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these lines can be replaced with the binary_only
estimator tag https://scikit-learn.org/stable/developers/develop.html#estimator-tags
positive_label = 1 | |
unique_labels = np.unique(y) | |
assert len(unique_labels) <= 2 | |
y = np.where(y == positive_label, 1, 0) | |
y = y.reshape(-1) # make sure it's a 1D array |
Hi @LukasDrews97, Just a quick request from someone from France that reached out to me via e-mail. Would it be possible to add a threshold to remove labels that have low probability? |
Add probabilistic classification via calibration to hiclass using the following methods: