ENH make Random*Sampler accept dask array and dataframe #777

glemaitre · 2020-11-05T19:15:45Z

POC to see if we can make at least the RandomUnderSampler and RandomOverSampler accept some dask array and dataframe

Note:

pep8speaks · 2020-11-05T19:15:51Z

Hello @glemaitre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file conftest.py:

Line 22:64: W504 line break after binary operator
Line 28:71: W504 line break after binary operator

In the file imblearn/base.py:

Line 9:1: F401 'numpy as np' imported but unused
Line 59:33: W504 line break after binary operator
Line 95:33: W504 line break after binary operator

In the file imblearn/dask/tests/test_utils.py:

Line 5:1: E402 module level import not at top of file
Line 7:1: E402 module level import not at top of file
Line 8:1: E402 module level import not at top of file

In the file imblearn/ensemble/_forest.py:

Line 622:44: W504 line break after binary operator

In the file imblearn/under_sampling/_prototype_selection/tests/test_random_under_sampler.py:

Line 11:1: F401 'sklearn.utils._testing.assert_array_equal' imported but unused

In the file imblearn/utils/_docstring.py:

Line 46:3: E121 continuation line under-indented for hanging indent

In the file imblearn/utils/testing.py:

Line 65:21: W503 line break before binary operator

Comment last updated at 2020-11-08 19:22:54 UTC

lgtm-com · 2020-11-05T23:34:17Z

This pull request introduces 2 alerts when merging 7aae9d9 into edd7522 - view on LGTM.com

new alerts:

2 for Unused local variable

TomAugspurger

Looks good overall. I think the comment about dask.compute(...) rather than x.compute(), y.compute() is the most important.

Other than that I tried to share some of the difficulties I've run into with Dask-ML, but things look nice overall.

TomAugspurger · 2020-11-06T12:20:34Z

imblearn/dask/_support.py

+_REGISTERED_DASK_CONTAINER = []
+
+try:
+    from dask import array, dataframe


People can have just dask[array] installed (not dataframe) so it's possible to have the array import succeed, but the dataframe import fail. So if you want to support that case those would need to be in separate try / except blocks.

Maybe you instead want from dask import is_dask_collection? That's a bit broader though (it also covers anything implementing dask's collection interface like dask.Bag, xarray.DataArray and xarray.Dataset).

That seems what I wanted :)

TomAugspurger · 2020-11-06T12:22:57Z

imblearn/dask/utils.py

+    if hasattr(y, "unique"):
+        labels = np.asarray(y.unique())
+    else:
+        labels = np.unique(y).compute()


I've struggled with this check in dask-ml. Depending on where it's called, it's potentially very expensive (you might be loading a ton of data just to check if it's multi-label, and then loading it again to to the training).

Whenever possible, it's helpful to provide an option to skip this check by having the user specify it when creating the estimator, or in a keyword to fit (dunno if that applies here).

I thought about it. Do you think that having a context manager outside would make sense:

with set_config(avoid_check=True): # some imblearn/scikit-learn/dask code

Thought, we might get into trouble with issues related to scikit-learn/scikit-learn#18736

It might just be easier to have an optional class parameter that applies only for dask arrays.

TomAugspurger · 2020-11-06T12:27:42Z

imblearn/under_sampling/_prototype_selection/_random_under_sampler.py

-            force_all_finite=False,
+        if is_dask_container(y) and hasattr(y, "to_dask_array"):
+            y = y.to_dask_array()
+            y.compute_chunk_sizes()


In Dask-ML we (@stsievert I think? maybe me?) prefer to have the user do this: https://github.com/dask/dask-ml/blob/7e11ce1505a485104e02d49a3620c8264e63e12e/dask_ml/utils.py#L166-L173. If you're just fitting the one estimator then this is probably equivalent. If you're doing something like a cross_val_score, then I think this would end up loading data multiple times just to compute the chunk sizes.

This is something that I was unsure of, here. If I recall, the issue was that I could not have called ravel on the Series and then the easiest way to always have an array and convert back to a Series reusing the meta-data.

However, if we assume that the checks are too expensive to be done in a distributive setting, we don't need to call the check below and we can directly pass the Series and handle it during the resampling.

So, we have fewer safeguards but at least it is more performant which is something you probably want in a distrubted setting

TomAugspurger · 2020-11-06T12:29:31Z

imblearn/utils/_validation.py

+    if is_dask_container(unique):
+        unique, counts = unique.compute(), counts.compute()


As written, this will fully execute the task graph of y twice. Once to compute unique and once to compute counts.

Suggested change

if is_dask_container(unique):

unique, counts = unique.compute(), counts.compute()

if is_dask_container(unique):

unique, counts = dask.compute(unique, counts)

You'll need to import dask

lgtm-com · 2020-11-07T22:32:27Z

This pull request introduces 5 alerts when merging d4aabf8 into edd7522 - view on LGTM.com

new alerts:

2 for Use of the return value of a procedure
2 for Unused local variable
1 for Unused import

lgtm-com · 2020-11-07T23:13:20Z

This pull request introduces 5 alerts when merging 58acdf2 into edd7522 - view on LGTM.com

new alerts:

2 for Use of the return value of a procedure
2 for Unused local variable
1 for Unused import

lgtm-com · 2020-11-08T11:04:22Z

This pull request introduces 2 alerts when merging e54c772 into edd7522 - view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

lgtm-com · 2020-11-08T12:34:54Z

This pull request introduces 2 alerts when merging 167fc2a into edd7522 - view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

codecov · 2020-11-08T15:22:06Z

Codecov Report

Merging #777 (456c3eb) into master (2a0376e) will increase coverage by 1.63%.
The diff coverage is 94.83%.

@@            Coverage Diff             @@
##           master     #777      +/-   ##
==========================================
+ Coverage   96.55%   98.18%   +1.63%     
==========================================
  Files          82       94      +12     
  Lines        5140     5900     +760     
  Branches        0      515     +515     
==========================================
+ Hits         4963     5793     +830     
+ Misses        177      100      -77     
- Partials        0        7       +7

Impacted Files	Coverage Δ
imblearn/combine/_smote_enn.py	`100.00% <ø> (ø)`
imblearn/combine/_smote_tomek.py	`100.00% <ø> (ø)`
imblearn/datasets/_zenodo.py	`96.77% <ø> (ø)`
imblearn/ensemble/_weight_boosting.py	`97.75% <ø> (ø)`
imblearn/keras/_generator.py	`97.14% <ø> (+44.28%)`	⬆️
imblearn/over_sampling/_adasyn.py	`98.41% <ø> (ø)`
imblearn/over_sampling/_random_over_sampler.py	`100.00% <ø> (ø)`
imblearn/over_sampling/_smote.py	`97.30% <ø> (ø)`
imblearn/tensorflow/_generator.py	`100.00% <ø> (+64.51%)`	⬆️
...rototype_selection/_condensed_nearest_neighbour.py	`100.00% <ø> (ø)`
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update edd7522...456c3eb. Read the comment docs.

lgtm-com · 2020-11-08T15:41:07Z

This pull request introduces 2 alerts when merging 20b44c6 into edd7522 - view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

lgtm-com · 2020-11-08T18:52:03Z

This pull request introduces 2 alerts when merging a6e975b into edd7522 - view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

lgtm-com · 2020-11-08T19:48:29Z

This pull request introduces 2 alerts when merging 456c3eb into edd7522 - view on LGTM.com

new alerts:

2 for Use of the return value of a procedure

ridhachahed · 2021-11-29T16:44:59Z

@glemaitre Why didn't you merge this branch with master everything seems alright, isn't it ?

glemaitre · 2021-11-29T16:50:41Z

It is one year old so I don't recall the details. It was only a POC to see what would be the issue to deal with dask-ml and dask. I think that one of the issue that I had was about validation: #777 (comment)

It would need further work to go ahead.

ridhachahed · 2021-11-29T18:36:51Z

I think it would be a pity if it doesn't go in for this comment. We can't really do much about it except avoiding calling this method. Happy to help if there is anything else that need to be done :)

ENH make RandomUnderSampler accept dask array

95247e6

glemaitre added 6 commits November 5, 2020 20:20

add dask to the install

ea30287

PEP8

0766964

iter

d9edb9a

PEP8

4960724

iter

2152429

PEP8

e5ce7a6

glemaitre changed the title ~~ENH make RandomUnderSampler accept dask array and dataframe~~ ENH make Random*Sampler accept dask array and dataframe Nov 5, 2020

glemaitre added 10 commits November 5, 2020 21:51

iter

b537a20

iter

f781be0

avoid import dask explicitely

fb3d6a4

TST remove redundant test

b7d9f3b

iter

d26da3c

xxx

c065808

install complete dask

f2d0ec0

iter

20ba934

iter

0941a5e

iter

7aae9d9

glemaitre added 2 commits November 6, 2020 11:23

iter

00c0a26

requirements

8bfa040

glemaitre mentioned this pull request Nov 6, 2020

Samplers / pipelines for imbalanced datasets dask/dask-ml#317

Open

TomAugspurger reviewed Nov 6, 2020

View reviewed changes

iter

d4aabf8

iter

58acdf2

PEP8

e54c772

glemaitre added 4 commits November 8, 2020 12:17

iter

f2a572f

iter

36a0aa3

check raise FutureWarning

c7bdc74

iter

f095221

glemaitre force-pushed the dask_base_tools branch from 167fc2a to ec4a717 Compare November 8, 2020 14:50

iter

20b44c6

glemaitre force-pushed the dask_base_tools branch from ec4a717 to 20b44c6 Compare November 8, 2020 15:14

glemaitre added 2 commits November 8, 2020 19:20

iter

4cd9116

PEP8

a6e975b

glemaitre added 3 commits November 8, 2020 19:52

iter

32eda46

iter

6c592ff

PEP8

456c3eb

glemaitre mentioned this pull request Aug 30, 2021

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856

Closed

glemaitre force-pushed the master branch from f8347ad to 56eefdf Compare September 29, 2021 16:10

glemaitre force-pushed the master branch from 3228f8a to 7e94390 Compare October 21, 2021 20:41

glemaitre mentioned this pull request May 20, 2024

how can the tool be used with DASK-ML to deal with large data #1079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH make Random*Sampler accept dask array and dataframe #777

ENH make Random*Sampler accept dask array and dataframe #777

glemaitre commented Nov 5, 2020 •

edited

pep8speaks commented Nov 5, 2020 •

edited

lgtm-com bot commented Nov 5, 2020

TomAugspurger left a comment

TomAugspurger Nov 6, 2020

glemaitre Nov 6, 2020

TomAugspurger Nov 6, 2020

glemaitre Nov 6, 2020

TomAugspurger Nov 6, 2020

glemaitre Nov 6, 2020

TomAugspurger Nov 6, 2020

lgtm-com bot commented Nov 7, 2020

lgtm-com bot commented Nov 7, 2020

lgtm-com bot commented Nov 8, 2020

lgtm-com bot commented Nov 8, 2020

codecov bot commented Nov 8, 2020 •

edited

lgtm-com bot commented Nov 8, 2020

lgtm-com bot commented Nov 8, 2020

lgtm-com bot commented Nov 8, 2020

ridhachahed commented Nov 29, 2021

glemaitre commented Nov 29, 2021

ridhachahed commented Nov 29, 2021

		if is_dask_container(unique):
		unique, counts = unique.compute(), counts.compute()

ENH make Random*Sampler accept dask array and dataframe #777

Are you sure you want to change the base?

ENH make Random*Sampler accept dask array and dataframe #777

Conversation

glemaitre commented Nov 5, 2020 • edited

pep8speaks commented Nov 5, 2020 • edited

Comment last updated at 2020-11-08 19:22:54 UTC

lgtm-com bot commented Nov 5, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Nov 6, 2020

Choose a reason for hiding this comment

glemaitre Nov 6, 2020

Choose a reason for hiding this comment

TomAugspurger Nov 6, 2020

Choose a reason for hiding this comment

glemaitre Nov 6, 2020

Choose a reason for hiding this comment

TomAugspurger Nov 6, 2020

Choose a reason for hiding this comment

glemaitre Nov 6, 2020

Choose a reason for hiding this comment

TomAugspurger Nov 6, 2020

Choose a reason for hiding this comment

lgtm-com bot commented Nov 7, 2020

lgtm-com bot commented Nov 7, 2020

lgtm-com bot commented Nov 8, 2020

lgtm-com bot commented Nov 8, 2020

codecov bot commented Nov 8, 2020 • edited

Codecov Report

lgtm-com bot commented Nov 8, 2020

lgtm-com bot commented Nov 8, 2020

lgtm-com bot commented Nov 8, 2020

ridhachahed commented Nov 29, 2021

glemaitre commented Nov 29, 2021

ridhachahed commented Nov 29, 2021

glemaitre commented Nov 5, 2020 •

edited

pep8speaks commented Nov 5, 2020 •

edited

codecov bot commented Nov 8, 2020 •

edited