Implements Index.putmask #1560

beobest2 · 2020-06-02T06:50:54Z

Implementing Index.putmask

>>> kidx = ks.Index(['a', 'b', 'c', 'd', 'e'])
>>> mask = [True if x < 2 else False for x in range(5)]
>>> value = 100

>>> kidx
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

>>> kidx.putmask(mask, value).sort_values()
Index(['100', '100', 'c', 'd', 'e'], dtype='object')

databricks/koalas/indexes.py

databricks/koalas/tests/test_indexes.py

itholic · 2020-06-02T07:38:22Z

Could you also delete the put_mask of MultiIndex like the below and implement it?

@@ -58,7 +58,6 @@ class MissingPandasLikeIndex(object):
     is_type_compatible = _unsupported_function("is_type_compatible")
     join = _unsupported_function("join")
     map = _unsupported_function("map")
-    putmask = _unsupported_function("putmask")
     ravel = _unsupported_function("ravel")
     reindex = _unsupported_function("reindex")
     searchsorted = _unsupported_function("searchsorted")
@@ -131,7 +130,6 @@ class MissingPandasLikeMultiIndex(object):
     is_type_compatible = _unsupported_function("is_type_compatible")
     join = _unsupported_function("join")
     map = _unsupported_function("map")
-    putmask = _unsupported_function("putmask")
     ravel = _unsupported_function("ravel")
     reindex = _unsupported_function("reindex")
     remove_unused_levels = _unsupported_function("remove_unused_levels")

databricks/koalas/indexes.py

HyukjinKwon · 2020-06-16T10:35:40Z

@beobest2 can you fix the test?

beobest2 · 2020-06-16T10:40:08Z

@HyukjinKwon okay I'll fix the test

codecov-commenter · 2020-06-16T15:19:24Z

Codecov Report

Merging #1560 into master will decrease coverage by 0.30%.
The diff coverage is 97.53%.

@@            Coverage Diff             @@
##           master    #1560      +/-   ##
==========================================
- Coverage   94.55%   94.25%   -0.31%     
==========================================
  Files          38       38              
  Lines        8767     8715      -52     
==========================================
- Hits         8290     8214      -76     
- Misses        477      501      +24

Impacted Files	Coverage Δ
databricks/koalas/missing/indexes.py	`100.00% <ø> (ø)`
databricks/koalas/missing/series.py	`100.00% <ø> (ø)`
databricks/koalas/indexes.py	`96.64% <92.59%> (-0.23%)`	⬇️
databricks/koalas/__init__.py	`93.54% <100.00%> (-0.57%)`	⬇️
databricks/koalas/frame.py	`95.94% <100.00%> (-0.89%)`	⬇️
databricks/koalas/generic.py	`96.65% <100.00%> (-0.02%)`	⬇️
databricks/koalas/groupby.py	`90.44% <100.00%> (-0.08%)`	⬇️
databricks/koalas/series.py	`97.61% <100.00%> (-0.01%)`	⬇️
databricks/koalas/typedef/string_typehints.py	`100.00% <100.00%> (ø)`
databricks/koalas/typedef/typehints.py	`86.00% <100.00%> (-1.37%)`	⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc27c2a...02cf26c. Read the comment docs.

ueshin · 2020-06-16T18:20:35Z

databricks/koalas/indexes.py

+        masking_col = verify_temp_column_name(sdf, "__masking_column__")
+
+        if isinstance(value, (list, tuple)):
+            replace_udf = udf(lambda x: value[x], _infer_type(value[0]))


Is it possible to use pandas_udf instead of udf? If possible, could you replace with it?

It is possible! I modified it to use pandas_udf

ueshin · 2020-06-16T18:21:02Z

databricks/koalas/indexes.py

+            sdf = sdf.withColumn(replace_col, replace_udf(dist_sequence_col_name))
+        elif isinstance(value, (Index, Series)):
+            value = value.to_numpy().tolist()
+            replace_udf = udf(lambda x: value[x], _infer_type(value[0]))


ueshin · 2020-06-16T18:21:41Z

databricks/koalas/indexes.py

+        elif not isinstance(mask, list) and not isinstance(mask, tuple):
+            raise TypeError("Mask data doesn't support type " "{0}".format(type(mask).__name__))
+
+        masking_udf = udf(lambda x: mask[x], BooleanType())


ueshin · 2020-06-16T18:21:57Z

databricks/koalas/indexes.py

+            sdf = sdf.withColumn(replace_col, F.lit(value))
+
+        if isinstance(mask, (Index, Series)):
+            mask = mask.to_numpy().tolist()


I don't think we should do this.

ueshin · 2020-06-16T18:22:32Z

databricks/koalas/indexes.py

+        # |                              4|                e|               500|             false|
+        # +-------------------------------+-----------------+------------------+------------------+
+
+        cond = F.when(sdf[masking_col], sdf[replace_col]).otherwise(sdf[scol_name])


Could you use scol_for(sdf, scol_name)?

ueshin · 2020-06-16T18:24:34Z

databricks/koalas/tests/test_indexes.py

+        self.assert_eq(
+            kidx.putmask(kidx < "c", ks.Series(["g", "h", "i", "j", "k"])).sort_values(),
+            pidx.putmask(pidx < "c", pd.Series(["g", "h", "i", "j", "k"])).sort_values(),
+        )


What if the length of value is not same as the index length? Could you add the tests?

@ueshin Thanks for the comment! I will address it as you comments. :)

@ueshin
If the length of the mask in the pandas is different, ValueError is raised.

>>> pidx Index(['a', 'b', 'c', 'd', 'e'], dtype='object') >>> pidx.putmask([True, False], pd.Series(["g", "h", "i", "j", "k"])).sort_values() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hwpark/Desktop/git_koalas/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4041, in putmask raise err File "/Users/hwpark/Desktop/git_koalas/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4037, in putmask np.putmask(values, mask, self._convert_for_op(value)) File "<__array_function__ internals>", line 6, in putmask ValueError: putmask: mask and data must be the same size

So I fixed Koalas to raise the same error as well.

>>> kidx.putmask([True, False], ks.Series(["g", "h", "i", "j", "k"])).sort_values() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hwpark/Desktop/git_koalas/koalas/databricks/koalas/indexes.py", line 1612, in putmask raise ValueError("mask and data must be the same size") ValueError: mask and data must be the same size

If the value have different length in pandas, it works like this:

>>> pidx Index(['a', 'b', 'c', 'd', 'e'], dtype='object') >>> pidx.putmask(pidx > 'c', pd.Series(["g", "h"])).sort_values() Index(['a', 'b', 'c', 'g', 'h'], dtype='object') >>> pidx.putmask(pidx < 'c', pd.Series(["g", "h"])).sort_values() Index(['c', 'd', 'e', 'g', 'h'], dtype='object') >>> pidx.putmask(pidx < 'c', pd.Series(["g"])).sort_values() Index(['c', 'd', 'e', 'g', 'g'], dtype='object') >>> pidx.putmask([True, False, True, False, True], pd.Series(["g", "h"])).sort_values() Index(['b', 'd', 'g', 'g', 'g'], dtype='object')

I thought the behavior of Pandas was ambiguous, so I left the comments at line 1593 for now.

# TODO: We can't support different size of value for now.

databricks/koalas/indexes.py

itholic · 2020-08-26T11:18:31Z

@beobest2 could you rebase this when available ?

beobest2 · 2020-08-30T10:05:03Z

@itholic sure :)

itholic · 2020-08-31T00:06:11Z

databricks/koalas/indexes.py

+        if isinstance(value, (list, tuple, Index, Series)):
+            if isinstance(value, (list, tuple)):
+                pandas_value = pd.Series(value)
+            elif isinstance(value, (Index, Series)):
+                pandas_value = value.to_pandas()
+
+            if self.size != pandas_value.size:
+                # TODO: We can't support different size of value for now.
+                raise ValueError("value and data must be the same size")


If we can support for only same size, I think we shouldn't support this API for non-scalar objects for now.

Since we're using pd.Series(value) and value.to_pandas() above, It looks quite dangerous.

I think we better support this API only for the ks.Index so that we can avoid the collect all the data into single machine.

Maybe I think we can apply almost same concept with implementation of Series.where. (https://koalas.readthedocs.io/en/latest/_modules/databricks/koalas/series.html#Series.where)

Would you tell me what do you think about this way when you available, @ueshin @HyukjinKwon ?

xinrong-meng · 2021-08-03T23:20:59Z

Hi @beobest2, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36403. Otherwise, I may do that for you next week.

beobest2 · 2021-08-04T17:18:07Z

Hi @beobest2, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36403. Otherwise, I may do that for you next week.

Hi @xinrong-databricks I would like to migrate this PR to the Spark repository. I will try to finish it by next week.

xinrong-meng · 2021-08-04T20:06:06Z

Please take your time :) Thank you!

beobest2 · 2021-08-14T17:29:34Z

@xinrong-databricks I created a PR at apache/spark#33744 . Please take a look :)

xinrong-meng · 2021-08-16T16:44:39Z

Certainly, let's discuss in the new PR then! Thanks for the porting.

itholic reviewed Jun 2, 2020

View reviewed changes

databricks/koalas/indexes.py Outdated Show resolved Hide resolved

itholic reviewed Jun 2, 2020

View reviewed changes

databricks/koalas/indexes.py Outdated Show resolved Hide resolved

itholic reviewed Jun 2, 2020

View reviewed changes

databricks/koalas/indexes.py Outdated Show resolved Hide resolved

itholic reviewed Jun 2, 2020

View reviewed changes

databricks/koalas/tests/test_indexes.py Outdated Show resolved Hide resolved

itholic reviewed Jun 2, 2020

View reviewed changes

databricks/koalas/indexes.py Outdated Show resolved Hide resolved

HyukjinKwon changed the title ~~Implements Index.IndexesTest~~ Implements Index.putmask Jun 3, 2020

beobest2 force-pushed the add_putmask branch from 14568ad to 3771611 Compare June 15, 2020 16:45

ueshin reviewed Jun 16, 2020

View reviewed changes

beobest2 added 8 commits June 26, 2020 20:30

Implements Index.putmask

3d08718

Fix variable names

5e2799b

Fix mask and value to support Series and Index

51f6d4a

Fix test cases

c83568f

Fix udf to infer type

c4bd2b4

Fix using pandas_udf

c341272

Remove spark _infer_type

9bd21b1

Fix value names

097cc3a

beobest2 force-pushed the add_putmask branch from 87bc81d to 097cc3a Compare June 26, 2020 11:33

itholic reviewed Jun 30, 2020

View reviewed changes

databricks/koalas/indexes.py Outdated Show resolved Hide resolved

Fix type infer function

9bbfda4

Merge branch 'master' into add_putmask

f72ee50

Add pandas_udf import

d4e4cbd

itholic reviewed Aug 31, 2020

View reviewed changes

beobest2 mentioned this pull request Aug 14, 2021

[SPARK-36403][PYTHON] Implement Index.putmask apache/spark#33744

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements Index.putmask #1560

Implements Index.putmask #1560

beobest2 commented Jun 2, 2020 •

edited

itholic commented Jun 2, 2020

HyukjinKwon commented Jun 16, 2020

beobest2 commented Jun 16, 2020

codecov-commenter commented Jun 16, 2020 •

edited

ueshin Jun 16, 2020

beobest2 Jun 17, 2020

ueshin Jun 16, 2020

ueshin Jun 16, 2020

ueshin Jun 16, 2020

ueshin Jun 16, 2020

ueshin Jun 16, 2020

beobest2 Jun 16, 2020 •

edited

beobest2 Jun 17, 2020 •

edited

itholic commented Aug 26, 2020

beobest2 commented Aug 30, 2020

itholic Aug 31, 2020

itholic Aug 31, 2020 •

edited

xinrong-meng commented Aug 3, 2021

beobest2 commented Aug 4, 2021

xinrong-meng commented Aug 4, 2021

beobest2 commented Aug 14, 2021

xinrong-meng commented Aug 16, 2021

Implements Index.putmask #1560

Are you sure you want to change the base?

Implements Index.putmask #1560

Conversation

beobest2 commented Jun 2, 2020 • edited

itholic commented Jun 2, 2020

HyukjinKwon commented Jun 16, 2020

beobest2 commented Jun 16, 2020

codecov-commenter commented Jun 16, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beobest2 Jun 16, 2020 • edited

Choose a reason for hiding this comment

beobest2 Jun 17, 2020 • edited

Choose a reason for hiding this comment

itholic commented Aug 26, 2020

beobest2 commented Aug 30, 2020

Choose a reason for hiding this comment

itholic Aug 31, 2020 • edited

Choose a reason for hiding this comment

xinrong-meng commented Aug 3, 2021

beobest2 commented Aug 4, 2021

xinrong-meng commented Aug 4, 2021

beobest2 commented Aug 14, 2021

xinrong-meng commented Aug 16, 2021

beobest2 commented Jun 2, 2020 •

edited

codecov-commenter commented Jun 16, 2020 •

edited

beobest2 Jun 16, 2020 •

edited

beobest2 Jun 17, 2020 •

edited

itholic Aug 31, 2020 •

edited