Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301

Merged
merged 11 commits into from Sep 28, 2016

Conversation

nelson-liu
Copy link
Contributor

Reference Issue

Fixes #6945, previous PR at #6947

What does this implement/fix? Explain your changes.

min_weight_fraction_leaf should work even when sample_weight is not given (in which case samples are assumed to have equal weight). I also tweaked the description of min_weight_fraction_leaf to make its purpose a bit more clear.

Any other comments?

if sample_weight is None:
min_weight_leaf = int(ceil(self.min_weight_fraction_leaf *
n_samples))
min_samples_leaf = max(min_samples_leaf, min_weight_leaf)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it's better to have this block as what it is, or

min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples)))
min_weight_leaf = 0 # (seeing as min_weight_leaf is unnecessary when sample_weight is None)

or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean by:

or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.

But I think I am fine with the first two suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i'll just leave it as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max seems odd to me. Why not just remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I was trying to propose in my last sentence on my original comment (was probably unclear, sorry). I haven't tested it but it should work... And it seems a bit more logical too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it seems like that removing the call to max works as well. The max here is implicit because it's implemented in Criterion anyway.

@ogrisel
Copy link
Member

ogrisel commented Aug 31, 2016

LGTM.

@ogrisel ogrisel changed the title [MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided Aug 31, 2016
@ogrisel ogrisel added this to the 0.18 milestone Aug 31, 2016
@jnothman
Copy link
Member

You don't currently test interaction between min_samples_leaf and this

@jnothman jnothman modified the milestone: 0.18 Aug 31, 2016
@nelson-liu
Copy link
Contributor Author

You don't currently test interaction between min_samples_leaf and this

Good idea, their interaction is simply that the max is the bound, right?

@amueller
Copy link
Member

maybe test that with no sample weights, the two parameters have the same effect?

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Aug 31, 2016

@amueller so I'm actually of the opinion now that they shouldn't have the same effect... I pushed a test that checks if they are the same; it crashes and burns, but in a justified manner I think.

First, I changed the min_weight_leaf calculation formula to actually provide the minimum weight as opposed to using the same formula as min_samples_leaf. The parameter deals with weights, so we shouldn't be rounding up and turning it into an int (thus representing the min samples in the leaf). To further reinforce this, the Cython builder classes actually take min_weight_leaf as a double. This also mirrors the implementation in the else clause.

So onto why they shouldn't have the same effect:
When calculating whether a node is a leaf in _tree.pyx, the following conditional is used:

is_leaf = ((depth >= max_depth) or
                (n_node_samples < min_samples_split) or
                (n_node_samples < 2 * min_samples_leaf) or
                (weighted_n_node_samples < min_weight_leaf))

Say that we fit on a dataset with 5 samples, and provide no sample_weight at fit (thus uniform weighting). If we set min_weight_leaf = .1 and min_samples_leaf = .1, the comparisons will be (assuming this is the first split, so n_node_samples / weighted_n_node_samples = 5):

5 < 2 * int(ceil(.1*5)) = 2* 1 < 2
5 < (.1*5) = (.5)

Which is very different. This approach does have the downside that (in this case), setting min_weight_leaf = .1 essentially does nothing because the value of weighted_n_node_samples will always be greater than .1 (it's always greater than 1). However, I think that's staying true to the parameter definition.

I think we should take this approach, but I feel like a warning might be good if min_weight_leaf * sum(sample_weights) < min(sample_weights). What do you think?

edit: there are cases where the two parameters have the same effect, but they do not always.

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Sep 1, 2016

There are probably some hand-crafted values that could yield equal values...I suppose that those could serve as a useful test? Not sure if it is necessary, though. what do you guys think?

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Sep 4, 2016

since the two parameters don't give the same effect on uniform weighted data when frac*total_weight == int(ceil(frac*total_weight)) (and this seems incorrect), I've opened an issue for it @ #7338. If we decide to adopt the solutions in that issue to give the two parameters an equivalent interpretation, I'll verify that the test (such that the two trees are equal) works. If not, i don't think it's a suitable test...

It does seem reasonable for the two grown trees to be equal under this scenario, though.

@nelson-liu
Copy link
Contributor Author

this is ready to be looked at again; removing the +1 because there have been significant changes to the tests.

@nelson-liu nelson-liu changed the title [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided [MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 5, 2016
@raghavrv
Copy link
Member

raghavrv commented Sep 6, 2016

Say that we fit on a dataset with 5 samples, and provide no sample_weight at fit (thus uniform weighting). If we set min_weight_leaf = .1 and min_samples_leaf = .1

I am sorry but I don't get what you are trying to say. min_samples_leaf is an int and min_weight_leaf is a float. Why are you giving a float value to min_samples_leaf?

@@ -796,7 +796,8 @@ class RandomForestClassifier(ForestClassifier):

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by ``sample_weight`` provided
Copy link
Member

@raghavrv raghavrv Sep 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write it as -

The minimum weighted fraction of the sum total of weights (of all the input samples) required
 to be at a leaf node.

@nelson-liu
Copy link
Contributor Author

I am sorry but I don't get what you are trying to say. min_samples_leaf is an int and min_weight_leaf is a float. Why are you giving a float value to min_samples_leaf?

min_samples_leaf can actually be a float, from the documentation:

If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

@amueller
Copy link
Member

amueller commented Sep 6, 2016

On 08/31/2016 07:52 PM, Nelson Liu wrote:

Say that we fit on a dataset with 5 samples, and provide no
|sample_weight| at fit (thus uniform weighting). If we set
|min_weight_leaf = .1| and |min_samples_leaf = .1|, the comparisons
will be (assuming this is the first split, so |n_node_samples /
weighted_n_node_samples = 5|):

I was talking about the case where min_weight_leaf * n_samples is an
integer. It seems to me that in this case they should have the same
behavior.
I'm surprised about the 2* with min_samples_leaf but not with
min_weight_leaf. Why is that?

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Sep 6, 2016

I was talking about the case where min_weight_leaf * n_samples is an
integer. It seems to me that in this case they should have the same
behavior.

@jnothman and I discussed this a bit in #7338 (specifically #7338 (comment)) , could you check it out?

I'm surprised about the 2* with min_samples_leaf but not with
min_weight_leaf. Why is that?

I believe it's because weights can be split, but samples cannot

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

I'm surprised about the 2* with min_samples_leaf but not with min_weight_leaf. Why is that?

The min_weight_leaf condition shouldn't be there. It's plain wrong. The min_samples_leaf condition is only an optimisation, and the same kind of optimisation can't be easily achieved for weight. The real work happens in the splitter.

@jnothman
Copy link
Member

jnothman commented Sep 7, 2016

I'm tempted to follow the second option at #6945 and either raise an error or a warning if sample_weight is None and min_weight_fraction_leaf is not None.

@jnothman
Copy link
Member

jnothman commented Sep 7, 2016

But I'm okay with the "assume sample_weight=1" solution too..?

@nelson-liu
Copy link
Contributor Author

I'm tempted to follow the second option at #6945 and either raise an error or a warning if sample_weight is None and min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think that assuming sample_weight=1 is more intuitive. However, it's equally counterintuitive that the results for min_weight_fraction_leaf do not match up with those for min_samples_leaf in this case. Ideally, the best thing to do would be to fix what's going in the splitter / tree to make the two grown trees the same, but doing that seems a bit out of scope of this PR. I'm fine with leaving it as it is or raising a warning / error, how about others?

@jnothman
Copy link
Member

jnothman commented Sep 8, 2016

No, the difference is not a bug and should not be fixed.

On 8 September 2016 at 08:09, Nelson Liu notifications@github.com wrote:

I'm tempted to follow the second option at #6945
#6945 and either
raise an error or a warning if sample_weight is None and
min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think
that assuming sample_weight=1 is more intuitive. However, it's equally
counterintuitive that the results for min_weight_fraction_leaf do not
match up with those for min_samples_leaf in this case. Ideally, the best
thing to do would be to fix what's going in the splitter / tree to make the
two grown trees the same, but doing that seems a bit out of scope of this
PR. I'm fine with leaving it as it is or raising a warning / error, how
about others?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69Pm4wZqxPqPKal1akEPBQnEvDrjks5qnzYogaJpZM4JxNyX
.

@jnothman
Copy link
Member

jnothman commented Sep 8, 2016

I'm happy to accept what you implemented too.

On 8 September 2016 at 10:58, Joel Nothman joel.nothman@gmail.com wrote:

No, the difference is not a bug and should not be fixed.

On 8 September 2016 at 08:09, Nelson Liu notifications@github.com wrote:

I'm tempted to follow the second option at #6945
#6945 and either
raise an error or a warning if sample_weight is None and
min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think
that assuming sample_weight=1 is more intuitive. However, it's equally
counterintuitive that the results for min_weight_fraction_leaf do not
match up with those for min_samples_leaf in this case. Ideally, the best
thing to do would be to fix what's going in the splitter / tree to make the
two grown trees the same, but doing that seems a bit out of scope of this
PR. I'm fine with leaving it as it is or raising a warning / error, how
about others?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69Pm4wZqxPqPKal1akEPBQnEvDrjks5qnzYogaJpZM4JxNyX
.

@nelson-liu
Copy link
Contributor Author

No, the difference is not a bug and should not be fixed.

ok

I'm happy to accept what you implemented too.

In that vein, I've addressed @raghavrv 's comments. Perhaps this would be good for 0.18?

The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it worth noting, "Samples have equal weight when sample_weight is not provided, but min_samples_leaf is more efficient."

@jnothman jnothman changed the title [MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 26, 2016
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should now drop "but min_samples_leaf is more efficient" if we've realised we can use the 2 * min_weight_leaf change.

@jnothman
Copy link
Member

This actually has @ogrisel's +1 above as well. So it's got +2 assuming nothing substantial has changed since then. Also LGTM

@jnothman jnothman changed the title [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided [MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 26, 2016
@jnothman
Copy link
Member

Pending that minor change, I should say.

@nelson-liu
Copy link
Contributor Author

sorry for getting back to this so late, been a bit busy recently. I pushed the changes to the docstrings that were requested, is there anything else needed?

@jnothman
Copy link
Member

A what's new entry is needed. Please put under 0.19, as I think this has
missed the boat for 0.18.

On 28 September 2016 at 03:29, Nelson Liu notifications@github.com wrote:

sorry for getting back to this so late, been a bit busy recently. I pushed
the changes to the docstrings that were requested, is there anything else
needed?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz670gWKcRX5AD0RrbyXjjgn6aKfKdks5quVJxgaJpZM4JxNyX
.

@nelson-liu
Copy link
Contributor Author

@jnothman thanks for the reminder. I put this under "Enhancements"; do you think "Bug Fixes" would be better?

@jnothman
Copy link
Member

If you state "previously it was silently ignored", then it belongs in bug fixes :)

@nelson-liu
Copy link
Contributor Author

@jnothman good point, that info is important. added and moved to bugfixes.

@jnothman jnothman merged commit da118d0 into scikit-learn:master Sep 28, 2016
@jnothman
Copy link
Member

thanks @nelson-liu

@amueller, I've currently assumed this is not for 0.18. Feel free to backport and move the what's new if you disagree.

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior
amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 14, 2016
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior

# Conflicts:
#	doc/whats_new.rst
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

min_weight_fraction_leaf suggested improvements
6 participants