Docs say parameter sample_weight of LinearRegression.fit must be array but number is also valid #28732

miguelcsilva · 2024-03-31T15:54:07Z

Describe the issue linked to the documentation

The documentation page for the fit method of the LinearRegression class mentions that the sample_weight parameter must be of type array_like or None (docs). However this is not entirely true since we can also pass float or int for this parameter. Floats or ints get transformed into an array of that same value repeating n times. Code snippet here:

scikit-learn/sklearn/utils/validation.py

Lines 2000 to 2003 in f59c503

    
           if sample_weight is None: 
        
               sample_weight = np.ones(n_samples, dtype=dtype) 
        
           elif isinstance(sample_weight, numbers.Number): 
        
               sample_weight = np.full(n_samples, sample_weight, dtype=dtype)

This makes it that a sample weight of float or int is essentially equal to None since they all have the same relative weight (not sure if I'm overseeing something, but could not think of any case where a float or int for sample_weight could be meaningful).

Suggest a potential alternative/fix

I see two possible fixes:

Change the documentation to address the fact that numbers are valid values for sample_weight however they have no effect since there is no difference in the relative weight of the samples.
Change the code so that an error or warning is raised if the sample_weight parameter is a float or an int.

The text was updated successfully, but these errors were encountered:

glemaitre · 2024-04-01T11:11:36Z

Indeed, PR welcome to fix the documentation.

miguelcsilva · 2024-04-01T12:08:16Z

Indeed, PR welcome to fix the documentation.

Ok, will start working on it later today

betatim · 2024-04-02T09:37:45Z

Given that providing the same value for all samples is a bit weird, does anyone remember why this is supported in the first place?

ogrisel · 2024-04-02T12:53:44Z

Given that providing the same value for all samples is a bit weird, does anyone remember why this is supported in the first place?

I guess we need to edit the code to raise an error and re-run the tests to check if passing a scalar int or float is used anywhere in our code base, and in particular as part of public facing API.

If it's not used anywhere, maybe we can deprecate this.

betatim · 2024-04-02T13:30:45Z

My guess is that _check_sample_weight supports int/float weights because it is used across many different estimators. For some of them it might make sense to use uniform integer weights. So we'd have to specifically check in LinearRegression if someone passed int/float weights.

Or maybe we leave this undocumented and don't introduce a check. Nothing bad/incorrect happens if you pass a int/float as weight, it is just unusual. Maybe we need to balance this with helping users who passed a int/float by mistake. For those it would be useful to raise an error.

kadarakos · 2024-04-02T19:04:28Z

Following @betatim's point that this technically works, but was probably a mistake, an intermediate solution could be to just warn for now:

Warning: a single number {sample_weight} was provided as "sample_weight", each sample will receive the same weight of {sample_weight}.

jeremiedbb · 2024-04-03T13:18:50Z

Raising an error in _check_sample_weight (which is not used everywhere so results might not show the whole picture) when passing a float makes the following tests fail:

sklearn/linear_model/_glm/tests/test_glm.py::test_sample_weights_validation FAILED                                                                               [  8%]
sklearn/linear_model/tests/test_base.py::test_raises_value_error_if_sample_weights_greater_than_1d[2-3] FAILED                                                   [ 16%]
sklearn/linear_model/tests/test_base.py::test_raises_value_error_if_sample_weights_greater_than_1d[3-2] FAILED                                                   [ 25%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-False-None] FAILED                                                  [ 33%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-False-csr_matrix] FAILED                                            [ 41%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-False-csr_array] FAILED                                             [ 50%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-True-None] FAILED                                                   [ 58%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-True-csr_matrix] FAILED                                             [ 66%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-True-csr_array] FAILED                                              [ 75%]
sklearn/linear_model/tests/test_ridge.py::test_raises_value_error_if_sample_weights_greater_than_1d FAILED                                                       [ 83%]
sklearn/utils/tests/test_validation.py::test_check_sample_weight FAILED

Some tests just check that passing a float works. Some check consistency, i.e. passing a float has the same effect as passing None.

jeremiedbb · 2024-04-03T13:28:04Z

To me keeping support for float can be handy if we want to extend our tests for consistent sample weight behavior as discussed in e.g. #15657. But it's not a big problem since we just pass an array with all equal elements.

There a use case where I can think supporting a float is convenient: if you want to learn on minibatches where for some reason you want to apply the same weight for all elements in a batch but different between batches. Again, doable by passing an array with all equal elements, but less convenient.

miguelcsilva · 2024-04-04T17:35:59Z

Some tests just check that passing a float works. Some check consistency, i.e. passing a float has the same effect as passing None.

Is that something that should be done, i.e., use the _check_sample_wieght function each time the fit method of an estimator accepts the sample_weight parameter? Suppose that could help ensure consistency of the API

I think if this was being done from scratch my inclination would be to only accept either the array or None. Seems a tiny bit more coherent with the underlying concept. But since float is currently accepted, also don't see much harm in sticking with that, since nothing is inherently wrong about the current logic.

Perhaps the only downside may be undetected bugs that could occur. Not sure if any of the maintainers as some intuition for how likely you believe a bug of this kind would be to occur in someone's code? Doesn't seem too likely to me, but I'm quite new to open-source

miguelcsilva added Documentation Needs Triage Issue requires triage labels Mar 31, 2024

glemaitre removed the Needs Triage Issue requires triage label Apr 1, 2024

glemaitre added the help wanted label Apr 1, 2024

miguelcsilva linked a pull request Apr 1, 2024 that will close this issue

Enhance docstring for LinearRegression.fit #28741

Open

lorentzenchr added the Needs Decision Requires decision label Apr 5, 2024

lucyleeow removed the help wanted label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs say parameter sample_weight of LinearRegression.fit must be array but number is also valid #28732

Docs say parameter sample_weight of LinearRegression.fit must be array but number is also valid #28732

miguelcsilva commented Mar 31, 2024

glemaitre commented Apr 1, 2024

miguelcsilva commented Apr 1, 2024

betatim commented Apr 2, 2024

ogrisel commented Apr 2, 2024

betatim commented Apr 2, 2024

kadarakos commented Apr 2, 2024 •

edited

jeremiedbb commented Apr 3, 2024

jeremiedbb commented Apr 3, 2024

miguelcsilva commented Apr 4, 2024

Docs say parameter sample_weight of LinearRegression.fit must be array but number is also valid #28732

Docs say parameter sample_weight of LinearRegression.fit must be array but number is also valid #28732

Comments

miguelcsilva commented Mar 31, 2024

Describe the issue linked to the documentation

Suggest a potential alternative/fix

glemaitre commented Apr 1, 2024

miguelcsilva commented Apr 1, 2024

betatim commented Apr 2, 2024

ogrisel commented Apr 2, 2024

betatim commented Apr 2, 2024

kadarakos commented Apr 2, 2024 • edited

jeremiedbb commented Apr 3, 2024

jeremiedbb commented Apr 3, 2024

miguelcsilva commented Apr 4, 2024

kadarakos commented Apr 2, 2024 •

edited