Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inf and NaN in weights #904

Open
aplavin opened this issue Nov 29, 2023 · 7 comments
Open

Inf and NaN in weights #904

aplavin opened this issue Nov 29, 2023 · 7 comments

Comments

@aplavin
Copy link
Contributor

aplavin commented Nov 29, 2023

Currently, weights throws an error whenever it encounters non-finite values: ArgumentError: weights cannot contain Inf or NaN values. That's highly surprising IMO, basically all other operations propagate NaNs and allow Infs – both basic arithmetics and aggregations.

# totally sensible and unambiguous results:
julia> median([1,2,3,Inf])
2.5
julia> mean([1,2,3,Inf])
Inf
julia> mean([1,2,3,Inf], weights([1,1,1,1]))
Inf

julia> mean([1,2,3,NaN])
NaN
julia> median([1,2,3,NaN])
NaN

# and suddenly:
julia> mean([1,2,3,4], weights([1,1,Inf,1]))
ERROR: ArgumentError: weights cannot contain Inf or NaN values
julia> mean([1,2,3,4], weights([1,1,NaN,1]))
ERROR: ArgumentError: weights cannot contain Inf or NaN values

I looked up as found that this error was introduced after the issue #671. I think throwing this exception denies perfectly sensible behavior, so that weighted aggregations worked exactly as their unweighted counterparts – by propagating NaNs. And the throw in weights() should be removed, maybe adding it in specific aggregations where Infs/NaNs really do not make any sense (eg, mean and median should work).

Numpy does propagate, of course:

In [2]: np.average([1,2,3], weights=[1,np.nan,3])
Out[2]: nan
@aplavin
Copy link
Contributor Author

aplavin commented Dec 27, 2023

Bump...

@aplavin
Copy link
Contributor Author

aplavin commented Jan 16, 2024

Gentle bump...
I understand that developer time is always scarce, so: would a PR fixing this be accepted?

@devmotion
Copy link
Member

I wasn't involved in the previous discussions in #671 and #814, but my impression was that the errors are intentional and the previous behaviour (no errors) was considered a bug. I think I tend to agree with this view since weights are supposed to sum to 1 (or at least it should be able to normalize them to such weights), which is not possible if a weight is NaN or Inf.

@aplavin
Copy link
Contributor Author

aplavin commented Jan 16, 2024

The whole point of NaN is that it propagates, and plain aggregations play nicely with that – mean of an array with NaN is NaN. This is natural and convenient when doing many aggregations, with a few of them having NaNs: they just appear in the correct locations in the results, instead of requiring try/catch or manual isnan checks everywhere.
It's surprising and inconvenient that as soon as one needs weighted aggregations, suddenly there's no way around – manual trycatch/isnan checks are required.

@devmotion
Copy link
Member

I don't see a clear inconsistency, there's no restriction on the values and Inf/NaN propagate in the same way for weighted operations:

julia> using StatsBase

julia> mean([1,2,3,Inf])
Inf

julia> mean([1,2,3,Inf], weights([1,1,1,1]))
Inf

julia> mean([1,2,3,NaN])
NaN

julia> mean([1,2,3,NaN], weights([1,1,1,1]))
NaN

The restriction only applies to the weights themselves because they have to sum up to 1.

@aplavin
Copy link
Contributor Author

aplavin commented Jan 16, 2024

The restriction only applies to the weights themselves because they have to sum up to 1.

And if they don't because there's a NaN – then the result is... Not a Number :)
Exactly the same as with mean([1, NaN]) that already returns NaN because there's no actual number that represents the result. I don't really see any difference in these situations, and the same motivation works in both cases.

Consider two scenarios:

Have an array of arrays, and want to compute mean of each inner array:

mean.(A)

and NaN are correctly propagated.

Have an array of arrays of values, and array of arrays of weights, and want to compute weighted means?

mean.(A, weight.(W))

doesn't work as soon as there are any NaNs inside W, and requires manual handling.

@aplavin
Copy link
Contributor Author

aplavin commented Jan 21, 2024

For those stumbling across the same issue: AbstractWeights and many functions that take them actually support NaNs. Defining another type MyWeights <: AbstractWeights without NaN checks in the constructor will work just fine!

Or, even easier – just overload the Weights type constructor by putting this code anywhere you like:

using StatsBase
@generated Weights{S,T,V}(values, sum) where {S<:Real, T<:Real, V<:AbstractVector{T}} = Expr(:new, Weights{S,T,V}, :values, :sum)

That's what I personally do.

It doesn't change any existing behavior, just adds support for non-finite values in weight vectors.
Then:

julia> mean([1,2], weights([1,NaN]))
NaN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants