Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add probabilistic classification to hiclass #119

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

LukasDrews97
Copy link
Collaborator

Add probabilistic classification via calibration to hiclass using the following methods:

  • Platt Scaling
  • Isotonic Regression
  • Beta calibration
  • (Inductive/Cross) Venn-ABERS calibration

@@ -96,6 +105,8 @@ def __init__(
If True, skip scikit-learn's checks and sample_weight passing for BERT.
classifier_abbreviation : str, default=""
The abbreviation of the local hierarchical classifier to be displayed during logging.
calibration_method : {"ivap", "cvap", "platt", "isotonic"}, str, default=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a better way to represent the multiple possible values here. Perhaps Union["ivap", "cvap", "platt", "isotonic"]? I am not so sure without building the documentation to see the result.


else:
calibrators = Parallel(n_jobs=self.n_jobs)(
delayed(logging_wrapper)(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the parallel logging in the cluster? It used to be the case that messages were repeated multiple times.

)
proba = calibrator.predict_proba(X)

y[:, 0] = calibrator.classes_[np.argmax(proba, axis=1)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to use the predictions, wouldn't it be better to use the already implemented predict method? I imagine it could simplify the code here and avoid redundancy

@@ -5,6 +5,8 @@
from .LocalClassifierPerLevel import LocalClassifierPerLevel
from .LocalClassifierPerNode import LocalClassifierPerNode
from .LocalClassifierPerParentNode import LocalClassifierPerParentNode
from .LocalClassifierPerLevel import LocalClassifierPerLevel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from .LocalClassifierPerLevel import LocalClassifierPerLevel

X : {array-like, sparse matrix} of shape (n_samples, n_features)
The calibration input samples. Internally, its dtype will be converted
to ``dtype=np.float32``. If a sparse matrix is provided, it will be
converted into a sparse ``csc_matrix``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
converted into a sparse ``csc_matrix``.
converted into a sparse ``csr_matrix``.

Comment on lines +265 to +266
y_true = make_leveled(y_true)
y_true = classifier._disambiguate(y_true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow why make_leveled and _disambiguate need to be called here.

predecessor = list(self.classifier.hierarchy_.predecessors(node))[0]
except NetworkXError:
# skip empty levels
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it would be better to have an if/else here since this might hide real errors

@@ -189,10 +350,82 @@ def test_predict_sparse(fitted_logistic_regression):
assert_array_equal(ground_truth, prediction)


def test_predict_proba(fitted_logistic_regression):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can use parametrize to reduce redundancy when tests are the same in different files

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this wrapper necessary?

Comment on lines +19 to +25
positive_label = 1
unique_labels = np.unique(y)
assert len(unique_labels) <= 2

y = np.where(y == positive_label, 1, 0)
y = y.reshape(-1) # make sure it's a 1D array

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these lines can be replaced with the binary_only estimator tag https://scikit-learn.org/stable/developers/develop.html#estimator-tags

Suggested change
positive_label = 1
unique_labels = np.unique(y)
assert len(unique_labels) <= 2
y = np.where(y == positive_label, 1, 0)
y = y.reshape(-1) # make sure it's a 1D array

@mirand863
Copy link
Collaborator

mirand863 commented May 3, 2024

Hi @LukasDrews97,

Just a quick request from someone from France that reached out to me via e-mail. Would it be possible to add a threshold to remove labels that have low probability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants