Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH make Random*Sampler accept dask array and dataframe #777

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Nov 5, 2020

POC to see if we can make at least the RandomUnderSampler and RandomOverSampler accept some dask array and dataframe

Note:

@pep8speaks
Copy link

pep8speaks commented Nov 5, 2020

Hello @glemaitre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 22:64: W504 line break after binary operator
Line 28:71: W504 line break after binary operator

Line 9:1: F401 'numpy as np' imported but unused
Line 59:33: W504 line break after binary operator
Line 95:33: W504 line break after binary operator

Line 5:1: E402 module level import not at top of file
Line 7:1: E402 module level import not at top of file
Line 8:1: E402 module level import not at top of file

Line 622:44: W504 line break after binary operator

Line 11:1: F401 'sklearn.utils._testing.assert_array_equal' imported but unused

Line 46:3: E121 continuation line under-indented for hanging indent

Line 65:21: W503 line break before binary operator

Comment last updated at 2020-11-08 19:22:54 UTC

@glemaitre glemaitre changed the title ENH make RandomUnderSampler accept dask array and dataframe ENH make Random*Sampler accept dask array and dataframe Nov 5, 2020
@lgtm-com
Copy link

lgtm-com bot commented Nov 5, 2020

This pull request introduces 2 alerts when merging 7aae9d9 into edd7522 - view on LGTM.com

new alerts:

  • 2 for Unused local variable

Copy link

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I think the comment about dask.compute(...) rather than x.compute(), y.compute() is the most important.

Other than that I tried to share some of the difficulties I've run into with Dask-ML, but things look nice overall.

_REGISTERED_DASK_CONTAINER = []

try:
from dask import array, dataframe

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People can have just dask[array] installed (not dataframe) so it's possible to have the array import succeed, but the dataframe import fail. So if you want to support that case those would need to be in separate try / except blocks.

Maybe you instead want from dask import is_dask_collection? That's a bit broader though (it also covers anything implementing dask's collection interface like dask.Bag, xarray.DataArray and xarray.Dataset).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems what I wanted :)

Comment on lines +12 to +15
if hasattr(y, "unique"):
labels = np.asarray(y.unique())
else:
labels = np.unique(y).compute()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've struggled with this check in dask-ml. Depending on where it's called, it's potentially very expensive (you might be loading a ton of data just to check if it's multi-label, and then loading it again to to the training).

Whenever possible, it's helpful to provide an option to skip this check by having the user specify it when creating the estimator, or in a keyword to fit (dunno if that applies here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it. Do you think that having a context manager outside would make sense:

with set_config(avoid_check=True):
    # some imblearn/scikit-learn/dask code

Thought, we might get into trouble with issues related to scikit-learn/scikit-learn#18736

It might just be easier to have an optional class parameter that applies only for dask arrays.

force_all_finite=False,
if is_dask_container(y) and hasattr(y, "to_dask_array"):
y = y.to_dask_array()
y.compute_chunk_sizes()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Dask-ML we (@stsievert I think? maybe me?) prefer to have the user do this: https://github.com/dask/dask-ml/blob/7e11ce1505a485104e02d49a3620c8264e63e12e/dask_ml/utils.py#L166-L173. If you're just fitting the one estimator then this is probably equivalent. If you're doing something like a cross_val_score, then I think this would end up loading data multiple times just to compute the chunk sizes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I was unsure of, here. If I recall, the issue was that I could not have called ravel on the Series and then the easiest way to always have an array and convert back to a Series reusing the meta-data.

However, if we assume that the checks are too expensive to be done in a distributive setting, we don't need to call the check below and we can directly pass the Series and handle it during the resampling.

So, we have fewer safeguards but at least it is more performant which is something you probably want in a distrubted setting

Comment on lines 129 to 130
if is_dask_container(unique):
unique, counts = unique.compute(), counts.compute()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, this will fully execute the task graph of y twice. Once to compute unique and once to compute counts.

Suggested change
if is_dask_container(unique):
unique, counts = unique.compute(), counts.compute()
if is_dask_container(unique):
unique, counts = dask.compute(unique, counts)

You'll need to import dask

@lgtm-com
Copy link

lgtm-com bot commented Nov 7, 2020

This pull request introduces 5 alerts when merging d4aabf8 into edd7522 - view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure
  • 2 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Nov 7, 2020

This pull request introduces 5 alerts when merging 58acdf2 into edd7522 - view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure
  • 2 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2020

This pull request introduces 2 alerts when merging e54c772 into edd7522 - view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2020

This pull request introduces 2 alerts when merging 167fc2a into edd7522 - view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@codecov
Copy link

codecov bot commented Nov 8, 2020

Codecov Report

Merging #777 (456c3eb) into master (2a0376e) will increase coverage by 1.63%.
The diff coverage is 94.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #777      +/-   ##
==========================================
+ Coverage   96.55%   98.18%   +1.63%     
==========================================
  Files          82       94      +12     
  Lines        5140     5900     +760     
  Branches        0      515     +515     
==========================================
+ Hits         4963     5793     +830     
+ Misses        177      100      -77     
- Partials        0        7       +7     
Impacted Files Coverage Δ
imblearn/combine/_smote_enn.py 100.00% <ø> (ø)
imblearn/combine/_smote_tomek.py 100.00% <ø> (ø)
imblearn/datasets/_zenodo.py 96.77% <ø> (ø)
imblearn/ensemble/_weight_boosting.py 97.75% <ø> (ø)
imblearn/keras/_generator.py 97.14% <ø> (+44.28%) ⬆️
imblearn/over_sampling/_adasyn.py 98.41% <ø> (ø)
imblearn/over_sampling/_random_over_sampler.py 100.00% <ø> (ø)
imblearn/over_sampling/_smote.py 97.30% <ø> (ø)
imblearn/tensorflow/_generator.py 100.00% <ø> (+64.51%) ⬆️
...rototype_selection/_condensed_nearest_neighbour.py 100.00% <ø> (ø)
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update edd7522...456c3eb. Read the comment docs.

@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2020

This pull request introduces 2 alerts when merging 20b44c6 into edd7522 - view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2020

This pull request introduces 2 alerts when merging a6e975b into edd7522 - view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2020

This pull request introduces 2 alerts when merging 456c3eb into edd7522 - view on LGTM.com

new alerts:

  • 2 for Use of the return value of a procedure

@ridhachahed
Copy link

@glemaitre Why didn't you merge this branch with master everything seems alright, isn't it ?

@glemaitre
Copy link
Member Author

It is one year old so I don't recall the details. It was only a POC to see what would be the issue to deal with dask-ml and dask. I think that one of the issue that I had was about validation: #777 (comment)

It would need further work to go ahead.

@ridhachahed
Copy link

I think it would be a pity if it doesn't go in for this comment. We can't really do much about it except avoiding calling this method. Happy to help if there is anything else that need to be done :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants