Implement 'weights' and 'axis' in sample at DataFrame and Series #1893

chi2liu · 2020-11-06T13:54:12Z

Implement sample. Resolves #1887

Support the remaining parameters of the sample function of DataFrame, such as n, axis, weights.

Now there are two unsupported situations:
1.does not support axis=1
2.If the value of the frac parameter > 1, the weights parameter is not supported

… such as n, axis, weights. Now there are two unsupported situations: 1.does not support axis=1 2.If the value of the frac parameter > 1, the weights parameter is not supported

…ample-dev

codecov-io · 2020-11-06T17:36:52Z

Codecov Report

Merging #1893 (4394f5f) into master (3237002) will increase coverage by 0.02%.
The diff coverage is 97.50%.

@@            Coverage Diff             @@
##           master    #1893      +/-   ##
==========================================
+ Coverage   94.20%   94.22%   +0.02%     
==========================================
  Files          40       41       +1     
  Lines        9939    10031      +92     
==========================================
+ Hits         9363     9452      +89     
- Misses        576      579       +3

Impacted Files	Coverage Δ
databricks/koalas/utils.py	`95.71% <ø> (-0.36%)`	⬇️
databricks/koalas/frame.py	`96.74% <97.43%> (+0.01%)`	⬆️
databricks/koalas/series.py	`96.97% <100.00%> (+<0.01%)`	⬆️
databricks/koalas/generic.py	`93.67% <0.00%> (-1.68%)`	⬇️
databricks/koalas/accessors.py	`93.00% <0.00%> (-0.04%)`	⬇️
databricks/koalas/internal.py	`96.46% <0.00%> (-0.04%)`	⬇️
databricks/koalas/base.py	`97.36% <0.00%> (-0.02%)`	⬇️
databricks/koalas/indexes.py	`96.81% <0.00%> (-0.01%)`	⬇️
databricks/koalas/indexing.py	`92.76% <0.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3237002...4394f5f. Read the comment docs.

HyukjinKwon · 2020-11-09T05:48:09Z

@itholic can you review this please?

itholic · 2020-11-11T13:37:34Z

Sure, let me take a look

databricks/koalas/frame.py

itholic · 2020-11-11T13:43:52Z

databricks/koalas/frame.py

+        Notes
+        -----
+        If `frac` > 1, `replacement` should be set to `True`.


itholic · 2020-11-11T13:51:44Z

databricks/koalas/frame.py

+            # Because ks.Series currently does not support the Series.__iter__ method,
+            # It cannot be initialized to the pandas Series, so here is to_pandas.
+            if isinstance(weights, ks.Series):
+                weights = pd.Series(weights.to_pandas(), dtype="float64")


Is weights always expected to small enough so that good to use to_pandas() ??
If not, I think we better don't support weights as a Series for now since it could occur serious performance degradation.

You are right!
If the series is relatively large, it may indeed cause performance problems.
The series weight is not supported here right now.

databricks/koalas/frame.py

itholic · 2020-11-11T14:00:06Z

databricks/koalas/frame.py

+                raise ValueError("weight vector may not include `inf` values")
+
+            if (weights < 0).any():
+                raise ValueError("weight vector many not include negative values")


nit: many -> may
Of course I understand that you've just followed pandas' message 👍 , but It looks obviously typo.

databricks/koalas/frame.py

ueshin · 2020-11-12T06:38:26Z

databricks/koalas/frame.py

+                withReplacement=replace, fraction=float(frac), seed=random_state
+            )
+            return DataFrame(self._internal.with_new_sdf(sdf))
+        locs = rs.choice(axis_length, size=n, replace=replace, p=weights)


What's the estimated size of locs? Will it also be much huge?
e.g., for the case A random 50% sample of the ``DataFrame`` with replacement?

The rs.choice method is similar to the numpy.random.RandomState.choice.
The return value of rs.choice is the size of the size parameter, in this case is n.
So the size of locs here is n.
If parameter n is set, locs is equal to n, otherwise locs is equal to n = int(round(frac * axis_length)).
For the case A random 50% sample of the ``DataFrame`` with replacement, the size of locs will be one-half of the size of the DataFrame, to be precise, it should be int(round(0.5 * DataFrame.nrows)).
If n is large, or frac is large, locs will be large.
Performance depends on the performance of the take method, which is actually the performance of the iloc method

In that case, I wouldn't recommend to use take or iloc for this.
Thinking of Koalas workload, the length of DataFrame could be so huge, then locs will be too huge for a single Driver node.
Also, row access by its row number is essentially heavy on Spark (to be exact, Spark doesn't provide the way to access rows by its row number), and so iloc is heavy in Koalas, especially if the locs is huge.

Yeah!
I agree with you, because spark does not have the concept of rowindex, iloc is a heavy operation.
But here the weights parameter of the sample function is the same as iloc, which must depend on the row index.
There may be no other way to support the weights parameter right now.
The weights parameter specifies the corresponding weight for the corresponding row, so it may be necessary to
access rows by its row number.
Just like iloc is a heavy operation, but it must also be implemented based on row index right now.
Maybe the current sample function supports weights operation and must also be based on row index.

…port str and series temporarily 2. Optimize part of the code based on review comments.

xinrong-meng · 2021-08-05T21:50:30Z

Hi @chi2liu , since Koalas is ported to Spark, would you like to migrate this PR to Spark repo? If not, I will port it next week.
https://issues.apache.org/jira/browse/SPARK-36436 is the ticket. Thanks!

chenkai02 added 5 commits November 6, 2020 13:51

support sample

c8c0d49

Support the remaining parameters of the sample function of DataFrame,…

1634428

… such as n, axis, weights. Now there are two unsupported situations: 1.does not support axis=1 2.If the value of the frac parameter > 1, the weights parameter is not supported

Merge branch 'sample-dev' of https://github.com/chi2liu/koalas into s…

b82a8f1

…ample-dev

Merge branch 'sample-dev' of https://github.com/chi2liu/koalas into s…

9eca5c0

…ample-dev

Merge branch 'sample-dev' of https://github.com/chi2liu/koalas into s…

9e9d854

…ample-dev

HyukjinKwon changed the title ~~implemnt sample~~ Implement 'weights' and 'axis' in sample at Dataframe and Series Nov 9, 2020

HyukjinKwon changed the title ~~Implement 'weights' and 'axis' in sample at Dataframe and Series~~ Implement 'weights' and 'axis' in sample at DataFrame and Series Nov 9, 2020

itholic reviewed Nov 11, 2020

View reviewed changes

ueshin reviewed Nov 12, 2020

View reviewed changes

1. For performance considerations, the weights parameter does not sup…

4394f5f

…port str and series temporarily 2. Optimize part of the code based on review comments.

amueller mentioned this pull request Nov 24, 2020

Implementing the full functionality of the 'sample' function #1887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement 'weights' and 'axis' in sample at DataFrame and Series #1893

Implement 'weights' and 'axis' in sample at DataFrame and Series #1893

chi2liu commented Nov 6, 2020 •

edited by HyukjinKwon

codecov-io commented Nov 6, 2020 •

edited

HyukjinKwon commented Nov 9, 2020

itholic commented Nov 11, 2020

itholic Nov 11, 2020

itholic Nov 11, 2020

chi2liu Nov 12, 2020

itholic Nov 11, 2020

ueshin Nov 12, 2020

chi2liu Nov 12, 2020

ueshin Nov 12, 2020 •

edited

chi2liu Nov 16, 2020 •

edited

xinrong-meng commented Aug 5, 2021

Implement 'weights' and 'axis' in sample at DataFrame and Series #1893

Are you sure you want to change the base?

Implement 'weights' and 'axis' in sample at DataFrame and Series #1893

Conversation

chi2liu commented Nov 6, 2020 • edited by HyukjinKwon

codecov-io commented Nov 6, 2020 • edited

Codecov Report

HyukjinKwon commented Nov 9, 2020

itholic commented Nov 11, 2020

itholic Nov 11, 2020

Choose a reason for hiding this comment

itholic Nov 11, 2020

Choose a reason for hiding this comment

chi2liu Nov 12, 2020

Choose a reason for hiding this comment

itholic Nov 11, 2020

Choose a reason for hiding this comment

ueshin Nov 12, 2020

Choose a reason for hiding this comment

chi2liu Nov 12, 2020

Choose a reason for hiding this comment

ueshin Nov 12, 2020 • edited

Choose a reason for hiding this comment

chi2liu Nov 16, 2020 • edited

Choose a reason for hiding this comment

xinrong-meng commented Aug 5, 2021

chi2liu commented Nov 6, 2020 •

edited by HyukjinKwon

codecov-io commented Nov 6, 2020 •

edited

ueshin Nov 12, 2020 •

edited

chi2liu Nov 16, 2020 •

edited