Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified BaseDecisionTree so that min_weight_fraction_leaf works when… #6947

Conversation

ben519
Copy link

@ben519 ben519 commented Jun 29, 2016

Reference Issue

Addresses #6945

Changes

min_weight_fraction_leaf should work even when sample_weight is not given (in which case samples are assumed to have equal weight). I also tweaked the description of min_weight_fraction_leaf to make its purpose a bit more clear.

Other comments

I'm fairly new to open-source collaboration as well as scikit-learn. I think my changes have room for improvement. For example, if sample_weight is given as None by the user, I reset it to an array of all 1s. I'm guessing this is a bad solution. However, I could use help figuring out where and how to implement the correct solution. I'm also slightly unsure the best way to test my changes. Would really appreciate someone offering 5 or 10 minutes of their time via Skype or Google Hangout so I can become a contributor and hopefully add more contributions in the future.

… sample_weight is None and improved parameter description

… sample_weight is None and improved parameter description
if self.min_weight_fraction_leaf != 0. and sample_weight is not None:
if self.min_weight_fraction_leaf != 0.:
if sample_weight is None:
sample_weight = np.repeat(1., n_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, we don't need to explicitly do this, since all we want is the sum of weights, i.e. "total weight". So you could just do min_weight_leaf = self.min_weight_fraction_leaf * n_samples.

Secondly, I see why you might want this, but it now replicates the function of min_samples_leaf. So if this is the way we go, i.e. rather than a warning, you can actually just use

min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples)))

In terms of testing, you should look at existing tests for min_samples_leaf and check that min_weight_fraction_leaf offers the same behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sample_weight is None, then samples are equally weighted in the Cython code. Does this addition really change anything? I believe behaviours should be identical. Do you have counter-examples where the trees that are built are actually different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sample_weight is None then min_weight_fraction_leaf has no effect. I think that's quite clear from the else case here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad, I was not aware of that else clause. Indeed. Ouch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we fix this in the cython code?

@jnothman
Copy link
Member

In terms of advice by skype et al, you might find the scikit-learn channel on Gitter helpful.

@@ -297,9 +297,13 @@ def fit(self, X, y, sample_weight=None, check_input=True,
sample_weight = expanded_class_weight

# Set min_weight_leaf from min_weight_fraction_leaf
if self.min_weight_fraction_leaf != 0. and sample_weight is not None:
if self.min_weight_fraction_leaf != 0.:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need this if

@jnothman
Copy link
Member

@ben519 do you intend to finish off your work on this?

@ben519
Copy link
Author

ben519 commented Aug 24, 2016

@jnothman I definitely can't this week. I can give it a shot this weekend, but I'd be happier if someone more knowledgeable than me took this over.

@nelson-liu
Copy link
Contributor

if no one else is available, i can do it. i'll be traveling the next week but starting sept 3 i'll have ~ a week free i can use to fixing this up. might be a good test to see if all that gsoc work w/ tree paid off

@raghavrv
Copy link
Member

@nelson-liu please go ahead and submit a PR!

@ogrisel
Copy link
Member

ogrisel commented Aug 31, 2016

Closing this in favor of #7301.

@ogrisel ogrisel closed this Aug 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants