Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pairwise multi-threaded and make corkendall and corspearman wrap pairwise. #923

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

PGS62
Copy link
Contributor

@PGS62 PGS62 commented Mar 31, 2024

This pull request follows up on #849. In that discussion I envisaged amending corkendall to be multi-threaded. Instead I have done the following:

  1. Made pairwise and pairwise! multi-threaded, and in the case of skipmissing = :pairwise, greatly reduced allocations which is needed for good multi-threaded performance. A nice illustration of the performance improvements this leads to is obtained with f = LinearAlgebra.dot. The example here shows a 31.5 times speedup on a 12 core PC.
  2. corkendall now simply wraps pairwise but with added methods of _pairwise! for typeof(corkendall) to retain the speedups to corkendall that were a result of Speeding up Kendall Correlation corkendall #634.
  3. corspearman also wraps pairwise with added methods of _pairwise! for typeof(corspearman). The purpose of these added methods is to reduce the number of calls to tiedrank from 2nm to 2(n+m) when n and m are respectively the number of columns in the arguments x and y to the call to corspearman(x,y....

Since corkendall and corspearman now wrap pairwise they have a skipmissing argument, which introduces an additional inconsistency between corkandall/corspearman and cov/cor.

I developed the new code in my package KendallTau. Please see the README file for further examples of the performance improvements achieved, which for all but the smallest input data exceed the number of cores on the PC (thanks to reduced allocations).

I have amended the tests as needed, and I believe that test coverage is as close to 100% as possible.

I apologise that there is quite a lot for maintainers to review, and I'll be happy to answer questions.

Copy link
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great. Thanks for contributing such a PR. I'm happy with it but lets keep if open for a couple of days to see if anybody else would like to comment.

test/pairwise.jl Outdated Show resolved Hide resolved
src/rankcorr.jl Outdated
Xjrank = tiedrank(Xj)
C[j,1] = cor(Xjrank, yrank)
function corspearman(x::AbstractMatrix, y::AbstractVector; skipmissing::Symbol=:none)
return corspearman(x, reshape(y, (length(y), 1)); skipmissing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 1.0 tests are failing because of this way of passing the keyword argument. I'd be okay with bumping the minimum version for StatsBase since 1.0 is ancient but maybe it's better to just make it skipmissing=skipmissing here for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing and testing the code (in KendallTau) I was unaware that support for Julia 1.0.5 was necessary, so the test failures when I submitted the PR came as a surprise. Changing how keyword arguments are handled would be easy, as would avoiding trailing backslashes within string literals. But the fact that 1.0.5 does not support eachcol is a bit more difficult. eachcol was added in 1.1. For the time being I will work on responding to @devmotion's comments and wait for your guidance as to whether bumping the minimum version might happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some more investigation:
Other popular packages (the compat section of Project.toml) have the following minimum versions:
1.9.4 for Statistics
1.6 for Plots, DataFrames, Revise, Optim, ForwardDiff, OhMyREPL, CSV
1.3 for Distributions
1 for StatsBase
So upping the minimum version for StatsBase seems sensible.

But as I remarked above, eachcol is crucial to how corkendall and corspearman are mere wrappers to pairwise. eachcol was introduced in Julia 1.1, so I imagined that upping the minimum Julia version for StatsBase to 1.1 would suffice. But no - the code appears to require Julia 1.9.4, and as of now does not work with Julia 1.8.5. For example the method check_vectors fails because:

# Julia 1.9.4
julia> keys((eachcol([1 2;3 4]))[2])
2-element LinearIndices{1, Tuple{Base.OneTo{Int64}}}:
 1
 2
# Julia 1.8.5
julia> keys((eachcol([1 2;3 4]))[2])
ERROR: MethodError: no method matching getindex(::Base.Generator{Base.OneTo{Int64}, Base.var"#242#243"{Matrix{Int64}}}, ::Int64)
Stacktrace:
 [1] top-level scope
   @ REPL[1]:1

So the alternatives are:

  • A. We increase the minimum version to 1.9.4, which happens to be that for Statistics, but is more modern than the current LTS.
  • B. I find a workaround so that this PR requires either no increase in the minimum version number or an increase to 1.6 which is more in line with other popular packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now the case that increasing the minimum version to 1.6 would work. See here.

.vscode/settings.json Outdated Show resolved Hide resolved
src/pairwise.jl Outdated
Comment on lines 14 to 15
#cor and friends are special-cased.
iscor = (f in (corkendall, corspearman, cor))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit unfortunate that this means other functions can't benefit from the improvements (?) since it is not extendable. Maybe an additional argument would be better?

Copy link
Contributor Author

@PGS62 PGS62 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I check that you've understood the purpose of the iscor variable (which might be better named diag_is_1)?

It's a Boolean that answers the question "Does x === y imply f(x,y) == 1.0, irrespective of NaN, missing and Inf entries, and that therefore there is a speedup to be had by not bothering to call f when x[i] === y[i]".

The upshot is that if f was some new correlation estimator for which the answer to the question above was "yes" but where we failed to update line 15 then we'd miss out on the "diagonal elements are near-zero-cost to calculate" but still have the benefit of multithreading. Of course, an additional argument is possible, but it makes the interface to pairwise a bit more complicated.

There's also the fact that the author of funky_cor could simply include the line x===y && return 1.0 in its definition, which remark has got me wondering whether I should do that for corspearman and corkendall...

src/pairwise.jl Outdated Show resolved Hide resolved
src/pairwise.jl Outdated Show resolved Hide resolved
src/pairwise.jl Show resolved Hide resolved
src/pairwise.jl Show resolved Hide resolved
src/pairwise.jl Show resolved Hide resolved
src/pairwise.jl Show resolved Hide resolved
src/pairwise.jl Outdated Show resolved Hide resolved
src/pairwise.jl Outdated
[26, 25, 16, 15, 6, 5]
```
"""
function equal_sum_subsets(n::Int, num_subsets::Int)::Vector{Vector{Int}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this function is kept (see comments above), is it necessary to return a vector of vectors or could one return an iterator of vectors instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reasonably sure that equal_sum_subsets works well in the sense of doing a good job of load balancing. The only reason it currently returns a vector of vectors is that's what I know how to do, so I would need to read up on iterators.

Copy link
Contributor Author

@PGS62 PGS62 Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit a1d9ae9 I have replaced the function equal_sum_subsets with an iterator EqualSumSubsets whose element type is another iterator TwoStepRange. It seems to work well and leads to a small reduction in allocations. But please review as this is the first time I've written code to create an iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't work on 32 bit. Maybe I need to replace use of Int64 with Int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 951b391 fixes EqualSumSubsets on 32-bit.

@PGS62
Copy link
Contributor Author

PGS62 commented Apr 5, 2024

I've now made changes so that pairwise, corkendall and corspearman work with all tests passing on Julia 1.6 or later. I've also changed ci.yml to test against 1.6 instead of 1.0 and the compat section of Project.toml now puts the minimum Julia version as 1.6.

Would this meet with approval from the StatsBase maintainers?

@PGS62
Copy link
Contributor Author

PGS62 commented Apr 10, 2024

The failed checks after commit [46d5655] are all in the codecov step, with error:

Commit creating failed: {"detail":"Tokenless has reached GitHub rate limit. Please upload using a token: https://docs.codecov.com/docs/adding-the-codecov-token. Expected available in 1549 seconds."}

IIUC (this page) I need to wait and try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants