Implement Series.factorize() #1972

xinrong-meng · 2020-12-16T21:50:54Z

ref #1929

        >>> kser = ks.Series(['b', None, 'a', 'c', 'b'])
        >>> codes, uniques = kser.factorize()
        >>> codes
        0    1
        1   -1
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c'], dtype='object')

        >>> codes, uniques = kser.factorize(na_sentinel=None)
        >>> codes
        0    1
        1    3
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c', None], dtype='object')

        >>> codes, uniques = kser.factorize(na_sentinel=-2)
        >>> codes
        0    1
        1   -2
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c'], dtype='object')

codecov-io · 2020-12-16T22:22:19Z

Codecov Report

Merging #1972 (2e538a7) into master (0e44bc7) will increase coverage by 0.06%.
The diff coverage is 95.00%.

@@            Coverage Diff             @@
##           master    #1972      +/-   ##
==========================================
+ Coverage   94.52%   94.58%   +0.06%     
==========================================
  Files          50       50              
  Lines       10952    11041      +89     
==========================================
+ Hits        10352    10443      +91     
+ Misses        600      598       -2

Impacted Files	Coverage Δ
databricks/koalas/missing/series.py	`100.00% <ø> (ø)`
databricks/koalas/series.py	`96.75% <95.00%> (+<0.01%)`	⬆️
databricks/koalas/plot/matplotlib.py	`92.62% <0.00%> (-0.73%)`	⬇️
databricks/koalas/generic.py	`92.60% <0.00%> (-0.01%)`	⬇️
databricks/koalas/missing/frame.py	`100.00% <0.00%> (ø)`
databricks/koalas/tests/plot/test_frame_plot.py	`100.00% <0.00%> (ø)`
...s/koalas/tests/plot/test_series_plot_matplotlib.py	`100.00% <0.00%> (ø)`
databricks/koalas/groupby.py	`91.60% <0.00%> (+0.02%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e44bc7...2e538a7. Read the comment docs.

databricks/koalas/series.py

databricks/koalas/tests/test_series.py

databricks/koalas/series.py

ueshin · 2021-01-07T20:42:09Z

databricks/koalas/series.py

+            raise ValueError(
+                "Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' "
+                "to restrict the total number of unique values of the current Series."
+                "Note that, before changing the 'compute.max_rows', "
+                "this operation is considerably expensive."
+            )


In this case, we should just collect all the data? cc @HyukjinKwon

What does it mean by collect all the data?

Just do toPandas() without limits.

Got it! Modified to toPandas() without limits for now.

databricks/koalas/series.py

databricks/koalas/tests/test_series.py

ueshin

Otherwise, LGTM.

databricks/koalas/series.py

ueshin · 2021-01-08T20:53:56Z

databricks/koalas/series.py

+        if na_sentinel is not None:
+            # Drops the NaN from the uniques of the values
+            non_na_list = [x for x in uniques_list if not pd.isna(x)]
+            if len(non_na_list) == 0:
+                uniques = pd.Index(non_na_list)
+            else:
+                uniques = ks.Index(non_na_list)
+        else:
+            uniques = ks.Index(uniques_list)


I think we can always return pd.Index as uniques ..? cc @HyukjinKwon

Modified it to pd.Index for now.

databricks/koalas/series.py

ueshin · 2021-01-08T22:59:45Z

databricks/koalas/series.py

+            raise ValueError(
+                "Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' "
+                "to restrict the total number of unique values of the current Series."
+                "Note that, before changing the 'compute.max_rows', "
+                "this operation is considerably expensive."
+            )


Just do toPandas() without limits.

databricks/koalas/series.py

ueshin

LGTM, pending tests.

ueshin · 2021-01-11T22:19:52Z

@xinrong-databricks Could you try the following as well?

>>> kser = ks.Series([1, 2, np.nan, 4, 5])
>>> kser.loc[3] = np.nan
>>> kser.factorize(na_sentinel=None)
(0    0
1    1
2    4
3    4
4    2
dtype: int32, Float64Index([1.0, 2.0, 5.0, nan, nan], dtype='float64'))

>>> kser.to_pandas().factorize(na_sentinel=None)
(array([0, 1, 3, 3, 2]), Float64Index([1.0, 2.0, 5.0, nan], dtype='float64'))

xinrong-meng · 2021-01-11T22:30:41Z

@ueshin Good catch! Let me look into this.

databricks/koalas/tests/test_series.py

ueshin · 2021-01-13T19:41:04Z

Thanks! merging.

xinrong-meng · 2021-01-13T20:00:26Z

Thank you for reviewing and merging the PR! @ueshin :)

ueshin reviewed Dec 18, 2020

View reviewed changes

databricks/koalas/series.py Show resolved Hide resolved

ueshin reviewed Dec 18, 2020

View reviewed changes

databricks/koalas/tests/test_series.py Show resolved Hide resolved

xinrong-meng requested a review from ueshin December 21, 2020 17:32

xinrong-meng marked this pull request as ready for review December 21, 2020 17:34

xinrong-meng requested review from HyukjinKwon and itholic December 22, 2020 17:39

ueshin reviewed Dec 24, 2020

View reviewed changes

databricks/koalas/series.py Outdated Show resolved Hide resolved

Prototype

a86af14

xinrong-meng force-pushed the seriesFactorize branch from d26b899 to a86af14 Compare January 4, 2021 18:01

xinrong-meng requested a review from ueshin January 4, 2021 18:02

xinrong-meng and others added 2 commits January 4, 2021 10:03

Merge branch 'master' into seriesFactorize

4adae2f

int32

a4799a7

ueshin reviewed Jan 7, 2021

View reviewed changes

xinrong-meng added 4 commits January 7, 2021 15:42

Typo

0e2915e

Deal with None

414f86e

non-na unique_to_code

017b01d

Deal with None and np.nan

ba9c375

ueshin reviewed Jan 8, 2021

View reviewed changes

databricks/koalas/series.py Outdated Show resolved Hide resolved

xinrong-meng added 2 commits January 8, 2021 13:50

Codes as pd Series; return type

a011ff4

Alias for new col

d33d1d3

ueshin reviewed Jan 8, 2021

View reviewed changes

Fix type

e1ddc78

ueshin reviewed Jan 8, 2021

View reviewed changes

databricks/koalas/series.py Outdated Show resolved Hide resolved

xinrong-meng added 4 commits January 8, 2021 16:00

NaN for FloatType or DoubleType

6c89cb9

Type in doc

f1d4c3e

toPandas w/o limit

1e6a727

Map<null, null> adjust

be2e4df

ueshin reviewed Jan 11, 2021

View reviewed changes

databricks/koalas/series.py Outdated Show resolved Hide resolved

Branch earlier

fc94110

ueshin approved these changes Jan 11, 2021

View reviewed changes

xinrong-meng added 3 commits January 12, 2021 09:19

pandas takes NaN and null to np.nan

8f733e9

Adjust for pd version

3e3fac4

Comment

7c092c8

ueshin reviewed Jan 12, 2021

View reviewed changes

databricks/koalas/tests/test_series.py Outdated Show resolved Hide resolved

Compare w respective pd version

2e538a7

ueshin merged commit ce2d260 into databricks:master Jan 13, 2021

itholic mentioned this pull request Aug 2, 2023

[SPARK-43567][PS] Support use_na_sentinel for factorize apache/spark#42270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Series.factorize() #1972

Implement Series.factorize() #1972

xinrong-meng commented Dec 16, 2020 •

edited

codecov-io commented Dec 16, 2020 •

edited

ueshin Jan 7, 2021

xinrong-meng Jan 8, 2021

ueshin Jan 8, 2021

xinrong-meng Jan 9, 2021

ueshin left a comment

ueshin Jan 8, 2021

xinrong-meng Jan 8, 2021

ueshin Jan 8, 2021

ueshin left a comment •

edited

ueshin commented Jan 11, 2021

xinrong-meng commented Jan 11, 2021

ueshin commented Jan 13, 2021

xinrong-meng commented Jan 13, 2021

Implement Series.factorize() #1972

Implement Series.factorize() #1972

Conversation

xinrong-meng commented Dec 16, 2020 • edited

codecov-io commented Dec 16, 2020 • edited

Codecov Report

ueshin Jan 7, 2021

Choose a reason for hiding this comment

xinrong-meng Jan 8, 2021

Choose a reason for hiding this comment

ueshin Jan 8, 2021

Choose a reason for hiding this comment

xinrong-meng Jan 9, 2021

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

ueshin Jan 8, 2021

Choose a reason for hiding this comment

xinrong-meng Jan 8, 2021

Choose a reason for hiding this comment

ueshin Jan 8, 2021

Choose a reason for hiding this comment

ueshin left a comment • edited

Choose a reason for hiding this comment

ueshin commented Jan 11, 2021

xinrong-meng commented Jan 11, 2021

ueshin commented Jan 13, 2021

xinrong-meng commented Jan 13, 2021

xinrong-meng commented Dec 16, 2020 •

edited

codecov-io commented Dec 16, 2020 •

edited

ueshin left a comment •

edited