Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement 'weights' and 'axis' in sample at DataFrame and Series #1893

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

chi2liu
Copy link

@chi2liu chi2liu commented Nov 6, 2020

Implement sample. Resolves #1887

Support the remaining parameters of the sample function of DataFrame, such as n, axis, weights.

Now there are two unsupported situations:
1.does not support axis=1
2.If the value of the frac parameter > 1, the weights parameter is not supported

chenkai02 added 5 commits November 6, 2020 13:51
… such as n, axis, weights.

Now there are two unsupported situations:
1.does not support axis=1
2.If the value of the frac parameter > 1, the weights parameter is not supported
@codecov-io
Copy link

codecov-io commented Nov 6, 2020

Codecov Report

Merging #1893 (4394f5f) into master (3237002) will increase coverage by 0.02%.
The diff coverage is 97.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1893      +/-   ##
==========================================
+ Coverage   94.20%   94.22%   +0.02%     
==========================================
  Files          40       41       +1     
  Lines        9939    10031      +92     
==========================================
+ Hits         9363     9452      +89     
- Misses        576      579       +3     
Impacted Files Coverage Δ
databricks/koalas/utils.py 95.71% <ø> (-0.36%) ⬇️
databricks/koalas/frame.py 96.74% <97.43%> (+0.01%) ⬆️
databricks/koalas/series.py 96.97% <100.00%> (+<0.01%) ⬆️
databricks/koalas/generic.py 93.67% <0.00%> (-1.68%) ⬇️
databricks/koalas/accessors.py 93.00% <0.00%> (-0.04%) ⬇️
databricks/koalas/internal.py 96.46% <0.00%> (-0.04%) ⬇️
databricks/koalas/base.py 97.36% <0.00%> (-0.02%) ⬇️
databricks/koalas/indexes.py 96.81% <0.00%> (-0.01%) ⬇️
databricks/koalas/indexing.py 92.76% <0.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3237002...4394f5f. Read the comment docs.

@HyukjinKwon HyukjinKwon changed the title implemnt sample Implement 'weights' and 'axis' in sample at Dataframe and Series Nov 9, 2020
@HyukjinKwon HyukjinKwon changed the title Implement 'weights' and 'axis' in sample at Dataframe and Series Implement 'weights' and 'axis' in sample at DataFrame and Series Nov 9, 2020
@HyukjinKwon
Copy link
Member

@itholic can you review this please?

@itholic
Copy link
Contributor

itholic commented Nov 11, 2020

Sure, let me take a look

databricks/koalas/frame.py Outdated Show resolved Hide resolved
Comment on lines +7245 to +7247
Notes
-----
If `frac` > 1, `replacement` should be set to `True`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

# Because ks.Series currently does not support the Series.__iter__ method,
# It cannot be initialized to the pandas Series, so here is to_pandas.
if isinstance(weights, ks.Series):
weights = pd.Series(weights.to_pandas(), dtype="float64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is weights always expected to small enough so that good to use to_pandas() ??
If not, I think we better don't support weights as a Series for now since it could occur serious performance degradation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right!
If the series is relatively large, it may indeed cause performance problems.
The series weight is not supported here right now.

databricks/koalas/frame.py Show resolved Hide resolved
raise ValueError("weight vector may not include `inf` values")

if (weights < 0).any():
raise ValueError("weight vector many not include negative values")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: many -> may
Of course I understand that you've just followed pandas' message 👍 , but It looks obviously typo.

databricks/koalas/frame.py Outdated Show resolved Hide resolved
withReplacement=replace, fraction=float(frac), seed=random_state
)
return DataFrame(self._internal.with_new_sdf(sdf))
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the estimated size of locs? Will it also be much huge?
e.g., for the case A random 50% sample of the ``DataFrame`` with replacement?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rs.choice method is similar to the numpy.random.RandomState.choice.
The return value of rs.choice is the size of the size parameter, in this case is n.
So the size of locs here is n.
If parameter n is set, locs is equal to n, otherwise locs is equal to n = int(round(frac * axis_length)).
For the case A random 50% sample of the ``DataFrame`` with replacement, the size of locs will be one-half of the size of the DataFrame, to be precise, it should be int(round(0.5 * DataFrame.nrows)).
If n is large, or frac is large, locs will be large.
Performance depends on the performance of the take method, which is actually the performance of the iloc method

Copy link
Collaborator

@ueshin ueshin Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I wouldn't recommend to use take or iloc for this.
Thinking of Koalas workload, the length of DataFrame could be so huge, then locs will be too huge for a single Driver node.
Also, row access by its row number is essentially heavy on Spark (to be exact, Spark doesn't provide the way to access rows by its row number), and so iloc is heavy in Koalas, especially if the locs is huge.

Copy link
Author

@chi2liu chi2liu Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah!
I agree with you, because spark does not have the concept of rowindex, iloc is a heavy operation.
But here the weights parameter of the sample function is the same as iloc, which must depend on the row index.
There may be no other way to support the weights parameter right now.
The weights parameter specifies the corresponding weight for the corresponding row, so it may be necessary to
access rows by its row number.
Just like iloc is a heavy operation, but it must also be implemented based on row index right now.
Maybe the current sample function supports weights operation and must also be based on row index.

…port str and series temporarily

2. Optimize part of the code based on review comments.
@xinrong-meng
Copy link
Contributor

Hi @chi2liu , since Koalas is ported to Spark, would you like to migrate this PR to Spark repo? If not, I will port it next week.
https://issues.apache.org/jira/browse/SPARK-36436 is the ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implementing the full functionality of the 'sample' function
6 participants