Enabling binary operations with list-like Python objects. #2054

itholic · 2021-02-15T07:50:49Z

So far, Koalas doesn't support list-like Python objects for Series binary operations.

>>> kser
0    1
1    2
2    3
3    4
4    5
5    6
Name: x, dtype: int64

>>> kser + [10, 20, 30, 40, 50, 60]
Traceback (most recent call last):
...

This PR enables it.

>>> kser
0    1
1    2
2    3
3    4
4    5
5    6
Name: x, dtype: int64
>>> kser + [10, 20, 30, 40, 50, 60]
0    11
1    22
2    33
3    44
4    55
5    66
Name: x, dtype: int64
>>> kser - [10, 20, 30, 40, 50, 60]
0    -9
1   -18
2   -27
3   -36
4   -45
5   -54
Name: x, dtype: int64
>>> kser * [10, 20, 30, 40, 50, 60]
0     10
1     40
2     90
3    160
4    250
5    360
Name: x, dtype: int64
>>> kser / [10, 20, 30, 40, 50, 60]
0    0.1
1    0.1
2    0.1
3    0.1
4    0.1
5    0.1
Name: x, dtype: float64

ref #2022 (comment)

itholic · 2021-02-15T13:46:56Z

databricks/koalas/base.py

@@ -495,7 +495,7 @@ def __rsub__(self, other) -> Union["Series", "Index"]:
                return -column_op(F.datediff)(self, F.lit(other)).astype("long")
            else:
                raise TypeError("date subtraction can only be applied to date series.")
-        return column_op(Column.__rsub__)(self, other)
+        return column_op(lambda left, right: right - left)(self, other)


FYI: Column.__rsub__ doesn't support pyspark.sql.column.Column for second parameter.

>>> kdf = ks.DataFrame({"A": [1, 2, 3, 4], "B": [10, 20, 30, 40]}) >>> sdf = kdf.to_spark() >>> col1 = sdf.A >>> col2 = sdf.B >>> Column.__rsub__(col1, col2) Traceback (most recent call last): ... TypeError: Column is not iterable

It does support:

>>> Column.__rsub__(df.id, 1) Column<'(1 - id)'>

It doesn't work in your case above because the instance is Spark column. In practice, that wouldn't happen because it will only be called when the first operand doesn't know how to handle Spark column e.g.) 1 - df.id.

Does it cause any exception?

If we use column_op(Column.__rsub__)(self, other) as it is, it raises TypeError: Column is not iterable for the case below.

>>> kser = ks.Series([1, 2, 3, 4]) >>> [10, 20, 30, 40] - kser Traceback (most recent call last): ... TypeError: Column is not iterable

Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.

HyukjinKwon · 2021-02-17T04:35:34Z

databricks/koalas/tests/test_ops_on_diff_frames.py

+        # other = tuple with the different length
+        other = (np.nan, 1, 3, 4, np.nan)
+        with self.assertRaisesRegex(
+            ValueError, "operands could not be broadcast together with shapes"


The error message looks weird. Is it matched with pandas'?

The original error message from pandas looks like :

ValueError: operands could not be broadcast together with shapes (4,) (8,)

@ueshin , maybe we don't include the (4,) (8,) part since it requires to compute length of both objects which can be expensive ??

databricks/koalas/series.py

codecov-io · 2021-02-18T07:57:43Z

Codecov Report

Merging #2054 (0fd3666) into master (87f5b18) will decrease coverage by 1.44%.
The diff coverage is 91.17%.

@@            Coverage Diff             @@
##           master    #2054      +/-   ##
==========================================
- Coverage   94.71%   93.26%   -1.45%     
==========================================
  Files          54       54              
  Lines       11503    11735     +232     
==========================================
+ Hits        10895    10945      +50     
- Misses        608      790     +182

Impacted Files	Coverage Δ
databricks/koalas/utils.py	`93.66% <75.00%> (-1.71%)`	⬇️
databricks/koalas/base.py	`97.35% <96.00%> (+0.06%)`	⬆️
databricks/koalas/indexes/base.py	`97.43% <100.00%> (ø)`
databricks/koalas/usage_logging/__init__.py	`26.66% <0.00%> (-65.84%)`	⬇️
databricks/koalas/usage_logging/usage_logger.py	`47.82% <0.00%> (-52.18%)`	⬇️
databricks/koalas/__init__.py	`80.00% <0.00%> (-12.00%)`	⬇️
databricks/conftest.py	`91.30% <0.00%> (-8.70%)`	⬇️
databricks/koalas/accessors.py	`86.43% <0.00%> (-7.04%)`	⬇️
databricks/koalas/spark/accessors.py	`88.67% <0.00%> (-6.29%)`	⬇️
databricks/koalas/typedef/typehints.py	`91.06% <0.00%> (-2.75%)`	⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87f5b18...0fd3666. Read the comment docs.

xinrong-meng · 2021-02-18T18:04:59Z

databricks/koalas/utils.py

@@ -813,3 +813,31 @@ def compare_disallow_null(left, right, comp):

 def compare_allow_null(left, right, comp):
    return left.isNull() | right.isNull() | comp(left, right)
+
+
+def check_same_length(left: "IndexOpsMixin", right: Union[list, tuple]):


Nice utility! The function name might be misleading considering its return type. Would it be possible to annotate the return type or rename the function?

ueshin

Also could you try to reduce the amount of test codes by using loop or parameterizing if there is no difference except for the operators?

ueshin · 2021-02-18T20:42:33Z

databricks/koalas/utils.py

+            if LooseVersion(pd.__version__) < LooseVersion("1.2.0"):
+                right = pd.Index(right, name=pindex_ops.name)


What happens with pandas<1.2?
Seems like it's working with pandas >= 1.0 in the test?

Actually it works:

>>> pd.__version__ '1.0.5' >>> pd.Index([1,2,3]) + [4,5,6] Int64Index([5, 7, 9], dtype='int64') >>> [4,5,6] + pd.Index([1,2,3]) Int64Index([5, 7, 9], dtype='int64')

Ohh,,, seems like It doesn't work for only rmod in pandas < 1.2.

>>> [4, 5, 6] % pd.Index([1,2,3]) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'Int64Index' object has no attribute 'rmod'

Let me address for only this case.

Thanks!

ueshin · 2021-02-18T20:45:46Z

databricks/koalas/utils.py

+            raise ValueError(
+                "operands could not be broadcast together with shapes ({},) ({},)".format(
+                    len_pindex_ops, len_right
+                )


We can show the length of left if it's less than the length of right, but if it's greater, the actual length is unknown.

ueshin · 2021-02-18T20:54:51Z

databricks/koalas/base.py

@@ -321,6 +322,9 @@ def spark_column(self) -> Column:
    __neg__ = column_op(Column.__neg__)

    def __add__(self, other) -> Union["Series", "Index"]:
+        if isinstance(other, (list, tuple)):
+            pindex_ops, other = check_same_length(self, other)
+            return ks.from_pandas(pindex_ops + other)  # type: ignore


Shall we avoid using # type: ignore as possible? We can use cast instead.

ueshin · 2021-02-18T20:55:22Z

databricks/koalas/base.py

+        if isinstance(other, (list, tuple)):
+            other = ks.Index(other, name=self.name)  # type: ignore


not needed?

ueshin · 2021-02-18T21:01:54Z

databricks/koalas/base.py

@@ -495,7 +495,7 @@ def __rsub__(self, other) -> Union["Series", "Index"]:
                return -column_op(F.datediff)(self, F.lit(other)).astype("long")
            else:
                raise TypeError("date subtraction can only be applied to date series.")
-        return column_op(Column.__rsub__)(self, other)
+        return column_op(lambda left, right: right - left)(self, other)


Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.

ueshin · 2021-02-18T21:03:06Z

databricks/koalas/base.py

+        if isinstance(other, (list, tuple)):
+            other = ks.Index(other, name=self.name)  # type: ignore


not needed?

xinrong-meng · 2021-08-05T21:55:00Z

https://issues.apache.org/jira/browse/SPARK-36437

Add tests

2e20719

itholic force-pushed the series_op branch from 8c16164 to 2e20719 Compare February 15, 2021 12:49

itholic marked this pull request as draft February 15, 2021 13:22

Fix rsub

922ba6a

itholic commented Feb 15, 2021

View reviewed changes

itholic marked this pull request as ready for review February 15, 2021 14:45

itholic requested review from xinrong-meng, ueshin and HyukjinKwon February 17, 2021 02:44

HyukjinKwon reviewed Feb 17, 2021

View reviewed changes

databricks/koalas/series.py Outdated Show resolved Hide resolved

itholic added 2 commits February 18, 2021 12:01

Add Index

02c3334

Fix test for mod and rmod

8df4ff9

itholic added 2 commits February 18, 2021 18:56

Use pandas

5d2d1c5

Fix test

0fd3666

xinrong-meng reviewed Feb 18, 2021

View reviewed changes

ueshin reviewed Feb 18, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling binary operations with list-like Python objects. #2054

Enabling binary operations with list-like Python objects. #2054

itholic commented Feb 15, 2021 •

edited

itholic Feb 15, 2021

HyukjinKwon Feb 17, 2021

HyukjinKwon Feb 17, 2021

itholic Feb 18, 2021

ueshin Feb 18, 2021 •

edited

HyukjinKwon Feb 17, 2021

itholic Feb 17, 2021 •

edited

codecov-io commented Feb 18, 2021 •

edited

xinrong-meng Feb 18, 2021 •

edited

ueshin left a comment •

edited

ueshin Feb 18, 2021

itholic Feb 19, 2021 •

edited

ueshin Feb 18, 2021

ueshin Feb 18, 2021

ueshin Feb 18, 2021

ueshin Feb 18, 2021 •

edited

ueshin Feb 18, 2021

xinrong-meng commented Aug 5, 2021

		if LooseVersion(pd.__version__) < LooseVersion("1.2.0"):
		right = pd.Index(right, name=pindex_ops.name)

		if isinstance(other, (list, tuple)):
		other = ks.Index(other, name=self.name) # type: ignore

Enabling binary operations with list-like Python objects. #2054

Are you sure you want to change the base?

Enabling binary operations with list-like Python objects. #2054

Conversation

itholic commented Feb 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Feb 18, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Feb 17, 2021 • edited

Choose a reason for hiding this comment

codecov-io commented Feb 18, 2021 • edited

Codecov Report

xinrong-meng Feb 18, 2021 • edited

Choose a reason for hiding this comment

ueshin left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Feb 19, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Feb 18, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinrong-meng commented Aug 5, 2021

itholic commented Feb 15, 2021 •

edited

ueshin Feb 18, 2021 •

edited

itholic Feb 17, 2021 •

edited

codecov-io commented Feb 18, 2021 •

edited

xinrong-meng Feb 18, 2021 •

edited

ueshin left a comment •

edited

itholic Feb 19, 2021 •

edited

ueshin Feb 18, 2021 •

edited