Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Array API support for LabelEncoder #27381

Merged
merged 31 commits into from May 16, 2024

Conversation

OmarManzoor
Copy link
Contributor

Reference Issues/PRs

Towards #26024
Related to #27369

What does this implement/fix? Explain your changes.

  • Adds Array API support for LabelEncoder including all the inner functions that it uses.

Any other comments?

CC: @betatim @ogrisel

@github-actions
Copy link

github-actions bot commented Sep 15, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7500c2f. Link to the linter CI: here

@OmarManzoor OmarManzoor marked this pull request as ready for review September 15, 2023 10:08
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I don't have time to finalize my review now but here is a first few feedback:

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_array_api.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more feedback.

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/_encode.py Outdated Show resolved Hide resolved
sklearn/utils/_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2024

@OmarManzoor sorry I had not seen you had taken my first pass of review into account.

This PR now needs a bunch of conflict resolution because of concurrent changes merged in main. Are you interested in following-up or are you busy on other things and would rather someone else to takeover?

@OmarManzoor
Copy link
Contributor Author

@OmarManzoor sorry I had not seen you had taken my first pass of review into account.

This PR now needs a bunch of conflict resolution because of concurrent changes merged in main. Are you interested in following-up or are you busy on other things and would rather someone else to takeover?

I would not mind working on it but I might need some time since I would have to revisit what I had done here and what further needs to be done after conflict resolution. 😊

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more feedback below:

sklearn/utils/_encode.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_array_api.py Outdated Show resolved Hide resolved
@OmarManzoor
Copy link
Contributor Author

@betatim I think the tests are failing because of array-api-strict. Should we handle array-api-strict too?

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented May 3, 2024

@ogrisel It seems that the tests are failing because of array-api-strict. It doesn't have anything called argsort which we are using in the isin function. From your comment above it seems we don't need to include array-api-strict in this but then how do we ensure that the tests don't fail?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a few commits to fix some of the failures (including failures related to devices that are only possible to trigger with torch and non-CPU devices). But we still need to work with libraries such as array-api-strict that do not yet implement xp.searchsorted, see below:

sklearn/utils/_encode.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented May 3, 2024

Should we handle array-api-strict too?

Yes we should. It's the easiest and fastest way to detect non-compliance problems in our code.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also another problem with CuPy that I cannot debug because I temporarily lost my access to my CUDA host, but the short error message was:

FAILED sklearn/preprocessing/tests/test_label.py::test_label_encoder_array_api_compliance[y0-cupy-None-None] - TypeError: unhashable type: 'ndarray'

unfortunately, I no longer have access to the full traceback. I will need to wait for the machine to be free again to re-run that tests.

In the mean time:

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
sklearn/utils/_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/_array_api.py Outdated Show resolved Hide resolved
sklearn/utils/_encode.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR on all supported namespaces and device combinations and all tests pass.

@ogrisel
Copy link
Member

ogrisel commented May 7, 2024

@betatim I think this is ready for another round of review on your end.

@OmarManzoor
Copy link
Contributor Author

Hi @betatim does this looks okay now?

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Another round of comments. Sorry for not including them last time already

Comment on lines 727 to 729
xp_label_fit = xp_label.fit(xp_y)
xp_transformed = xp_label_fit.transform(xp_y)
xp_inv_transformed = xp_label_fit.inverse_transform(xp_transformed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't work out why the xp_label_fit is needed, so suggesting to remove it (same for np_label_fit later on)

Suggested change
xp_label_fit = xp_label.fit(xp_y)
xp_transformed = xp_label_fit.transform(xp_y)
xp_inv_transformed = xp_label_fit.inverse_transform(xp_transformed)
xp_label = xp_label.fit(xp_y)
xp_transformed = xp_label.transform(xp_y)
xp_inv_transformed = xp_label.inverse_transform(xp_transformed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xp_label is defined above as
xp_label = LabelEncoder()

we are using it below to test fit_transform
xp_label.fit_transform(xp_y)

If we set it to xp_label again it would override the original estimator and set it to the fitted one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think that this is already modified when we run fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betatim Could you kindly check if this looks okay now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it is used again but like you found out est.fit(X, y) modifies (and returns) est. The reason it works to reuse it later is that est.fit() and est.fit_transform() reset the state of the estimator before fitting/after calling fit() again the state is as if the first fit() call never happened (there are some exceptions with warm starting, but not here).

sklearn/preprocessing/tests/test_label.py Show resolved Hide resolved
sklearn/preprocessing/tests/test_label.py Show resolved Hide resolved
sklearn/utils/tests/test_array_api.py Outdated Show resolved Hide resolved
@betatim betatim merged commit acd2d90 into scikit-learn:main May 16, 2024
30 checks passed
@betatim
Copy link
Member

betatim commented May 16, 2024

Thanks a lot!

@OmarManzoor OmarManzoor deleted the label_encoder_array_api branch May 16, 2024 13:47
@jeremiedbb jeremiedbb mentioned this pull request May 20, 2024
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants