Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add weights argument to sum #33310

Closed
wants to merge 1 commit into from
Closed

RFC: Add weights argument to sum #33310

wants to merge 1 commit into from

Conversation

nalimilan
Copy link
Member

This will be consistent with the weights argument to be added to similar functions in Statistics. Since that code is needed notably to computed the weighted mean in Statistics, I figured it is better to make it public rather than only defining internal helpers for mean.

This is based on code from StatsBase, of which one part needs to live in LinearAlgebra since it depends on BLAS and dot.

Extracted from #31395.

This will be consistent with the `weights` argument to be added to similar functions
in Statistics. Since that code is needed notably to computed the weighted `mean`
in Statistics, I figured it is better to make it public rather than only defining
internal helpers for `mean`.

This is based on code from StatsBase, of which one part needs to live in LinearAlgebra
since it depends on BLAS and `dot`.
If `dims` is provided, return an array of sums over these dimensions.
If `weights` is provided, return the weighted sum(s). `weights` must be
either an array of the same size as `A` if `dims` is omitted,
or a vector with the same length as `size(A, dims)` if `dims` is provided.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should weights have to be an array, as opposed to just an iterable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how things work currently in StatsBase. I guess we could make this more general at least for the simple case where dims=:. But that would require adding a separate method as @simd only works when using indices.

@stevengj
Copy link
Member

stevengj commented Sep 18, 2019

I wonder whether this should be an iterator instead? i.e. you could imagine:

sum(Iterators.multiply(x, weights))

s = add_sum(s0, s0)
@inbounds @simd for i in eachindex(A, w)
s += A[i] * w[i]
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be much less accurate than the non-weighted algorithm based on pairwise summation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably (I haven't considered this, that's just a port of the StatsBase code with some cleanup). We could improve this, but note that for BlasReal the BLAS-optimized dot is called instead, which has different properties anyway (not sure whether it's going to be more accurate or less).

If we want to preserve accuracy, that would be an argument in favor of this PR, since a naive broadcast call won't use pairwise summation.

@stevengj
Copy link
Member

stevengj commented Sep 18, 2019

In particular, I could imagine an algorithm in which Iterators.multiply(x, weights) produces either:

  1. An Iterators.Multiply(x, weights) object, similar to z[1]*z[2] for z in zip(x,weights) except having an eltype (allowing e.g. empty sums to work correctly).

  2. If x and weights are both abstract arrays, then a MultipliedArray(x, weights) <: AbstractArray object for which getindex(_, i) produces x[i]*weights[i].

Both of these could then be passed to the existing sum function (as well as other functions, e.g. external packages like xsum), and in the second case it would do an accurate pairwise summation.

(It could also be made even more generic by passing a binary op(x,y) to the iterator.)

Seems like it would take less code than what you are doing now, would be more accurate (for arrays), and would be more composable.

@stevengj
Copy link
Member

Or we could just wait for sum(@lazy x .* weights) ala #19198.

@stevengj
Copy link
Member

stevengj commented Sep 18, 2019

But with regards to adding this functionality to sum specifically, why do we need that when we have dot(weights, x) or @views dot(weights, x[...]), since it seems like that is basically the functionality here?

@stevengj stevengj added the domain:maths Mathematical functions label Sep 18, 2019
@simeonschaub
Copy link
Member

@stevengj Wouldn't sum(Base.broadcasted(*, x, weights)) do the exact same thing?

@nalimilan
Copy link
Member Author

nalimilan commented Sep 19, 2019

As I wrote in the description, the main point of this PR is that we're going to add the weights keyword argument to many statistics functions (mean, var, cor, quantile...), so adding it to sum would be consistent. I'm not too attached to it, we can keep it internal for use by mean in Statistics.

Another reason is that the current behavior (implemented in StatsBase) differs from broadcast in that you provide a weights vector corresponding to each slice defined according to the dims argument: row weights when summing over columns, or vice-versa. So the Iterators.multiply(x, weights) or @lazy x .* weights approach won't be enough to replicate this behavior. I can't say how useful it is for sum (and mean), but that sounds logical, and again consistent with what is needed for e.g. cor (where an observation is a slice, not a single value).

A final reason can be that the API from this PR allows using efficient and/or accurate algorithms under the hood, e.g. pairwise summation or BLAS-powered dot (as you noted above).

If we decide that the weighted sum/mean over dimensions use case isn't worth all that code, then yes, this PR isn't very useful.

@nalimilan nalimilan changed the title Add weights argument to sum RFC: Add weights argument to sum Sep 19, 2019
@tkf
Copy link
Member

tkf commented Sep 19, 2019

pairwise summation

This would be handled if we can use mapreduce on Broadcasted, right? This is implemented in #31020. It would be nice if we can reuse the code in #31020 here even if it were just for internals (#31020 is not about the surface syntax).

BLAS-powered dot

In principle, we can handle sum(@lazy x .* y) specially to call BLAS.dot directly if x and y are both DenseArray etc., because the information required to do this is accessible in the type of the Broadcasted object. Not sure if it is worth the trouble, tough.

@nalimilan
Copy link
Member Author

This [pairwise summation] would be handled if we can use mapreduce on Broadcasted, right?

Ah, yeah, I guess we could add special methods for that. Not sure whether that's OK given that it would give different results from the general algorithm, but since increased accuracy is a good thing maybe that's not a problem. The same applies to using dot.

@oxinabox
Copy link
Contributor

oxinabox commented Oct 24, 2019

We should have this.
It makes sense especially as sum in the generalization of count

If one wants a weighted tollerance for a cerain amount of missing data then one can right now get weighted relative confidance like:

mean(ismissing, xs, weights) < 0.1|| error("too much missing data > 10% under weighting")

But if you want to do this with an absolute count,
say 10 weighted missing values,
you can't.

sum would fix that.


This generalizes to over dimentions.
Real code I am using to drop rows will too much missing

    missings = ismissing.(X)  # no `mean(f, x, wv; dims)` function
    mask = vec(mean(missings, wv..., dims=1) .< 0.1)
    X[mask, :]

Currently harder to write the absolute tolerance version

@StefanKarpinski
Copy link
Sponsor Member

My worry is that this opens the door to adding weights support everywhere. Why can't this be done externally in say a Weights.jl package? The methods seem like they should be non-overlapping.

@nalimilan
Copy link
Member Author

Well keyword arguments cannot be handled in packages unfortunately...

The current design in StatsBase relies on dispatch, with methods like sum(::AbstractArray, ::AbstractWeights), requiring people to do things like sum(x, weights(w)). That gives some weird things sometimes, e.g. var(x, weights(w)) is the weighted variance of x but cov(x, weights(w)) is the covariance of x and w, since w isa AbstractWeights <: AbstractArray (JuliaStats/StatsBase.jl#409).

Also, people have often complained that they expected e.g. sample(x, w) to work, but instead they had to write sample(x, weights(w)) (JuliaStats/StatsBase.jl#335). A keyword argument would be more standard and probably less confusing. Overall the problem is that for some signatures there's no ambiguity so we could accept any AbstractVector as weights, but for others we have to require AbstractWeights for dispatch. In the end the result is a big inconsistent and confusing. Probably not the end of the world though. @ararslan will likely want to comment as he was a big supporter of the keyword argument approach.

@oxinabox
Copy link
Contributor

I also support the kwarg approach.
It just feels right, weights are like dims.

@tkf
Copy link
Member

tkf commented Oct 28, 2019

but cov(x, weights(w)) is the covariance of x and w

Maybe it was already considered, but why not use cov(weighted(x, w)) where weighted(x, w) is effectively a lazy x .* w?

@StefanKarpinski
Copy link
Sponsor Member

weighted(x, w) seems much more composable and can be implemented completely externally.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 28, 2019

theoretically, sum(for x .* w) is even more composable (and concise)

@StefanKarpinski
Copy link
Sponsor Member

I disagree: it's often convenient to associate values and weights together a single time and thereafter use them as though they were just values (implicitly weighted). Using a lazy construct like for x .* w can be consumed only once and doesn't support indexing, etc.

@nalimilan
Copy link
Member Author

As noted before, broadcasting wouldn't work as the weights vector is broadcast to the dimension over which reduction in performed, rather than always treated as a column vector.

But weighted(w, x) is interesting. However, note that for operations on multiple arguments it would differ from the unweighted syntax: cov(x, y) would become cov(weighted(w, x, y)). I guess that would be OK if we consider that weighted does something like hcat. But if we really want it to return an AbstractArray object it will have to take an additional argument specifying whether weights apply to rows or columns when the argument is a matrix, which will imply a repetition like sum(weighted(w, x, dims=1), dims=1), unless we want to automatically assume that you want to reduce along that dimension (which is the most useful operation). Need to think about it.

@pdeffebach
Copy link
Contributor

Is this getting merged? I support using keyword arguments instead of types, in which case this would need to live here as opposed to Statistics.

@nalimilan
Copy link
Member Author

For the record, something I hadn't realized when I wrote my comments above: weighted(w, x) absolutely cannot be a lazy equivalent of w .* x, as weighting is equivalent to multiplication by weights only in very peculiar cases like sum and mean. For quantile you certainly don't want to multiply each entry by its weight, but rather repeat it (at least for frequency weights). For var, the bias correction depends on the kind of weight. And so on.

So the wrapper approach would only be used for dispatch, and the resulting object would implement almost no methods. Each function would have to opt-in to support it as appropriate in its particular case.

@ViralBShah
Copy link
Member

ViralBShah commented Sep 6, 2022

Given that we want to move Statistics out (#46501), perhaps we should close this and adopt what we need in Statistics.jl?

@ViralBShah ViralBShah marked this pull request as draft September 6, 2022 03:13
@nalimilan nalimilan closed this Sep 6, 2022
@nalimilan nalimilan deleted the nl/wsum branch September 6, 2022 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:maths Mathematical functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants