RFC: Add weights argument to sum #33310

nalimilan · 2019-09-18T10:58:29Z

This will be consistent with the weights argument to be added to similar functions in Statistics. Since that code is needed notably to computed the weighted mean in Statistics, I figured it is better to make it public rather than only defining internal helpers for mean.

This is based on code from StatsBase, of which one part needs to live in LinearAlgebra since it depends on BLAS and dot.

Extracted from #31395.

This will be consistent with the `weights` argument to be added to similar functions in Statistics. Since that code is needed notably to computed the weighted `mean` in Statistics, I figured it is better to make it public rather than only defining internal helpers for `mean`. This is based on code from StatsBase, of which one part needs to live in LinearAlgebra since it depends on BLAS and `dot`.

stevengj · 2019-09-18T20:55:53Z

base/reducedim.jl

+If `dims` is provided, return an array of sums over these dimensions.
+If `weights` is provided, return the weighted sum(s). `weights` must be
+either an array of the same size as `A` if `dims` is omitted,
+or a vector with the same length as `size(A, dims)` if `dims` is provided.


Why should weights have to be an array, as opposed to just an iterable?

That's how things work currently in StatsBase. I guess we could make this more general at least for the simple case where dims=:. But that would require adding a separate method as @simd only works when using indices.

stevengj · 2019-09-18T20:57:24Z

I wonder whether this should be an iterator instead? i.e. you could imagine:

sum(Iterators.multiply(x, weights))

stevengj · 2019-09-18T20:59:15Z

base/reducedim.jl

+    s = add_sum(s0, s0)
+    @inbounds @simd for i in eachindex(A, w)
+        s += A[i] * w[i]
+    end


This will be much less accurate than the non-weighted algorithm based on pairwise summation.

Yes, probably (I haven't considered this, that's just a port of the StatsBase code with some cleanup). We could improve this, but note that for BlasReal the BLAS-optimized dot is called instead, which has different properties anyway (not sure whether it's going to be more accurate or less).

If we want to preserve accuracy, that would be an argument in favor of this PR, since a naive broadcast call won't use pairwise summation.

stevengj · 2019-09-18T21:04:48Z

In particular, I could imagine an algorithm in which Iterators.multiply(x, weights) produces either:

An Iterators.Multiply(x, weights) object, similar to z[1]*z[2] for z in zip(x,weights) except having an eltype (allowing e.g. empty sums to work correctly).
If x and weights are both abstract arrays, then a MultipliedArray(x, weights) <: AbstractArray object for which getindex(_, i) produces x[i]*weights[i].

Both of these could then be passed to the existing sum function (as well as other functions, e.g. external packages like xsum), and in the second case it would do an accurate pairwise summation.

(It could also be made even more generic by passing a binary op(x,y) to the iterator.)

Seems like it would take less code than what you are doing now, would be more accurate (for arrays), and would be more composable.

stevengj · 2019-09-18T21:09:46Z

Or we could just wait for sum(@lazy x .* weights) ala #19198.

stevengj · 2019-09-18T21:12:28Z

But with regards to adding this functionality to sum specifically, why do we need that when we have dot(weights, x) or @views dot(weights, x[...]), since it seems like that is basically the functionality here?

simeonschaub · 2019-09-18T21:36:12Z

@stevengj Wouldn't sum(Base.broadcasted(*, x, weights)) do the exact same thing?

nalimilan · 2019-09-19T08:16:09Z

As I wrote in the description, the main point of this PR is that we're going to add the weights keyword argument to many statistics functions (mean, var, cor, quantile...), so adding it to sum would be consistent. I'm not too attached to it, we can keep it internal for use by mean in Statistics.

Another reason is that the current behavior (implemented in StatsBase) differs from broadcast in that you provide a weights vector corresponding to each slice defined according to the dims argument: row weights when summing over columns, or vice-versa. So the Iterators.multiply(x, weights) or @lazy x .* weights approach won't be enough to replicate this behavior. I can't say how useful it is for sum (and mean), but that sounds logical, and again consistent with what is needed for e.g. cor (where an observation is a slice, not a single value).

A final reason can be that the API from this PR allows using efficient and/or accurate algorithms under the hood, e.g. pairwise summation or BLAS-powered dot (as you noted above).

If we decide that the weighted sum/mean over dimensions use case isn't worth all that code, then yes, this PR isn't very useful.

tkf · 2019-09-19T09:41:55Z

pairwise summation

This would be handled if we can use mapreduce on Broadcasted, right? This is implemented in #31020. It would be nice if we can reuse the code in #31020 here even if it were just for internals (#31020 is not about the surface syntax).

BLAS-powered dot

In principle, we can handle sum(@lazy x .* y) specially to call BLAS.dot directly if x and y are both DenseArray etc., because the information required to do this is accessible in the type of the Broadcasted object. Not sure if it is worth the trouble, tough.

nalimilan · 2019-09-19T09:58:26Z

This [pairwise summation] would be handled if we can use mapreduce on Broadcasted, right?

Ah, yeah, I guess we could add special methods for that. Not sure whether that's OK given that it would give different results from the general algorithm, but since increased accuracy is a good thing maybe that's not a problem. The same applies to using dot.

oxinabox · 2019-10-24T12:21:42Z

We should have this.
It makes sense especially as sum in the generalization of count

If one wants a weighted tollerance for a cerain amount of missing data then one can right now get weighted relative confidance like:

mean(ismissing, xs, weights) < 0.1|| error("too much missing data > 10% under weighting")

But if you want to do this with an absolute count,
say 10 weighted missing values,
you can't.

sum would fix that.

This generalizes to over dimentions.
Real code I am using to drop rows will too much missing

    missings = ismissing.(X)  # no `mean(f, x, wv; dims)` function
    mask = vec(mean(missings, wv..., dims=1) .< 0.1)
    X[mask, :]

Currently harder to write the absolute tolerance version

StefanKarpinski · 2019-10-28T14:01:10Z

My worry is that this opens the door to adding weights support everywhere. Why can't this be done externally in say a Weights.jl package? The methods seem like they should be non-overlapping.

nalimilan · 2019-10-28T14:25:39Z

Well keyword arguments cannot be handled in packages unfortunately...

The current design in StatsBase relies on dispatch, with methods like sum(::AbstractArray, ::AbstractWeights), requiring people to do things like sum(x, weights(w)). That gives some weird things sometimes, e.g. var(x, weights(w)) is the weighted variance of x but cov(x, weights(w)) is the covariance of x and w, since w isa AbstractWeights <: AbstractArray (JuliaStats/StatsBase.jl#409).

Also, people have often complained that they expected e.g. sample(x, w) to work, but instead they had to write sample(x, weights(w)) (JuliaStats/StatsBase.jl#335). A keyword argument would be more standard and probably less confusing. Overall the problem is that for some signatures there's no ambiguity so we could accept any AbstractVector as weights, but for others we have to require AbstractWeights for dispatch. In the end the result is a big inconsistent and confusing. Probably not the end of the world though. @ararslan will likely want to comment as he was a big supporter of the keyword argument approach.

oxinabox · 2019-10-28T14:39:56Z

I also support the kwarg approach.
It just feels right, weights are like dims.

tkf · 2019-10-28T14:42:34Z

but cov(x, weights(w)) is the covariance of x and w

Maybe it was already considered, but why not use cov(weighted(x, w)) where weighted(x, w) is effectively a lazy x .* w?

StefanKarpinski · 2019-10-28T16:12:52Z

weighted(x, w) seems much more composable and can be implemented completely externally.

vtjnash · 2019-10-28T16:19:22Z

theoretically, sum(for x .* w) is even more composable (and concise)

StefanKarpinski · 2019-10-28T16:25:08Z

I disagree: it's often convenient to associate values and weights together a single time and thereafter use them as though they were just values (implicitly weighted). Using a lazy construct like for x .* w can be consumed only once and doesn't support indexing, etc.

nalimilan · 2019-10-29T21:30:51Z

As noted before, broadcasting wouldn't work as the weights vector is broadcast to the dimension over which reduction in performed, rather than always treated as a column vector.

But weighted(w, x) is interesting. However, note that for operations on multiple arguments it would differ from the unweighted syntax: cov(x, y) would become cov(weighted(w, x, y)). I guess that would be OK if we consider that weighted does something like hcat. But if we really want it to return an AbstractArray object it will have to take an additional argument specifying whether weights apply to rows or columns when the argument is a matrix, which will imply a repetition like sum(weighted(w, x, dims=1), dims=1), unless we want to automatically assume that you want to reduce along that dimension (which is the most useful operation). Need to think about it.

pdeffebach · 2020-03-18T14:00:16Z

Is this getting merged? I support using keyword arguments instead of types, in which case this would need to live here as opposed to Statistics.

nalimilan · 2021-09-27T15:35:54Z

For the record, something I hadn't realized when I wrote my comments above: weighted(w, x) absolutely cannot be a lazy equivalent of w .* x, as weighting is equivalent to multiplication by weights only in very peculiar cases like sum and mean. For quantile you certainly don't want to multiply each entry by its weight, but rather repeat it (at least for frequency weights). For var, the bias correction depends on the kind of weight. And so on.

So the wrapper approach would only be used for dispatch, and the resulting object would implement almost no methods. Each function would have to opt-in to support it as appropriate in its particular case.

ViralBShah · 2022-09-06T03:13:01Z

Given that we want to move Statistics out (#46501), perhaps we should close this and adopt what we need in Statistics.jl?

stevengj reviewed Sep 18, 2019

View reviewed changes

stevengj added the domain:maths Mathematical functions label Sep 18, 2019

nalimilan changed the title ~~Add weights argument to sum~~ RFC: Add weights argument to sum Sep 19, 2019

nalimilan mentioned this pull request Sep 28, 2019

Import StatsBase into Statistics JuliaStats/Statistics.jl#2

Draft

21 tasks

tkf mentioned this pull request Nov 14, 2019

Add dropdims(f, args..; dims, kwargs..) for reductions to drop dims #33130

Closed

nalimilan mentioned this pull request May 27, 2020

WIP Progress on incorporating StatsBase into statistics. JuliaStats/Statistics.jl#31

Closed

This was referenced Sep 27, 2021

What to port from StatsBase JuliaStats/Statistics.jl#87

Open

Missing values and weighting JuliaStats/Statistics.jl#88

Open

nalimilan mentioned this pull request Mar 26, 2022

Weighted Arrays? JuliaStats/StatsBase.jl#776

Open

ViralBShah marked this pull request as draft September 6, 2022 03:13

nalimilan closed this Sep 6, 2022

nalimilan deleted the nl/wsum branch September 6, 2022 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Add weights argument to sum #33310

RFC: Add weights argument to sum #33310

nalimilan commented Sep 18, 2019

stevengj Sep 18, 2019

nalimilan Sep 19, 2019

stevengj commented Sep 18, 2019 •

edited

stevengj Sep 18, 2019

nalimilan Sep 19, 2019

stevengj commented Sep 18, 2019 •

edited

stevengj commented Sep 18, 2019

stevengj commented Sep 18, 2019 •

edited

simeonschaub commented Sep 18, 2019

nalimilan commented Sep 19, 2019 •

edited

tkf commented Sep 19, 2019

nalimilan commented Sep 19, 2019

oxinabox commented Oct 24, 2019 •

edited

StefanKarpinski commented Oct 28, 2019

nalimilan commented Oct 28, 2019

oxinabox commented Oct 28, 2019

tkf commented Oct 28, 2019

StefanKarpinski commented Oct 28, 2019

vtjnash commented Oct 28, 2019

StefanKarpinski commented Oct 28, 2019

nalimilan commented Oct 29, 2019

pdeffebach commented Mar 18, 2020

nalimilan commented Sep 27, 2021

ViralBShah commented Sep 6, 2022 •

edited

RFC: Add weights argument to sum #33310

RFC: Add weights argument to sum #33310

Conversation

nalimilan commented Sep 18, 2019

stevengj Sep 18, 2019

Choose a reason for hiding this comment

nalimilan Sep 19, 2019

Choose a reason for hiding this comment

stevengj commented Sep 18, 2019 • edited

stevengj Sep 18, 2019

Choose a reason for hiding this comment

nalimilan Sep 19, 2019

Choose a reason for hiding this comment

stevengj commented Sep 18, 2019 • edited

stevengj commented Sep 18, 2019

stevengj commented Sep 18, 2019 • edited

simeonschaub commented Sep 18, 2019

nalimilan commented Sep 19, 2019 • edited

tkf commented Sep 19, 2019

nalimilan commented Sep 19, 2019

oxinabox commented Oct 24, 2019 • edited

StefanKarpinski commented Oct 28, 2019

nalimilan commented Oct 28, 2019

oxinabox commented Oct 28, 2019

tkf commented Oct 28, 2019

StefanKarpinski commented Oct 28, 2019

vtjnash commented Oct 28, 2019

StefanKarpinski commented Oct 28, 2019

nalimilan commented Oct 29, 2019

pdeffebach commented Mar 18, 2020

nalimilan commented Sep 27, 2021

ViralBShah commented Sep 6, 2022 • edited

stevengj commented Sep 18, 2019 •

edited

stevengj commented Sep 18, 2019 •

edited

stevengj commented Sep 18, 2019 •

edited

nalimilan commented Sep 19, 2019 •

edited

oxinabox commented Oct 24, 2019 •

edited

ViralBShah commented Sep 6, 2022 •

edited