Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new correlation coefficient (Chatterjee) #314

Open
vincentarelbundock opened this issue Apr 11, 2024 · 8 comments
Open

A new correlation coefficient (Chatterjee) #314

vincentarelbundock opened this issue Apr 11, 2024 · 8 comments
Labels
feature idea 🔥 New feature or request

Comments

@vincentarelbundock
Copy link

Have not read yet, but this looks fun: https://arxiv.org/pdf/1909.10140.pdf

chatterjee

@IndrajeetPatil IndrajeetPatil transferred this issue from easystats/easystats Apr 11, 2024
@IndrajeetPatil IndrajeetPatil added the feature idea 🔥 New feature or request label Apr 11, 2024
@bwiernik
Copy link
Contributor

Neat! Looks straightforward!

@mattansb
Copy link
Member

From here

ksaai <- function(X, Y, ties = TRUE){
  n <- length(X)
  r <- rank(Y[order(X)], ties.method = "random")
  set.seed(42)
  if(ties){
    l <- rank(Y[order(X)], ties.method = "max")
    return( 1 - n*sum( abs(r[-1] - r[-n]) ) / (2*sum(l*(n - l))) )
  } else {
    return( 1 - 3 * sum( abs(r[-1] - r[-n]) ) / (n^2 - 1) )    
  }
}

I don't like that it's not symmetrical - shouldn't correlation coefficients be symmetrical?

x <- rnorm(100, sd = 4)
y <- sin(x) + rnorm(100, sd = 0.2)

plot(x, y)

ksaai(x, y)
#> [1] 0.6306631
ksaai(y, x)
#> [1] -0.1710171

Also the maximal value isn't 1 and seems to depend on the sample size?

z10 <- runif(10)
z100 <- runif(100)
z1000 <- runif(1000)

ksaai(z10, z10)
#> [1] 0.7272727
ksaai(z100, z100)
#> [1] 0.970297
ksaai(z1000, z1000)
#> [1] 0.997003

Created on 2024-04-14 with reprex v2.1.0

@vincentarelbundock
Copy link
Author

Your note about sample size is presumably what he means by "converges to a limit" in point 4 of the screenshot in my original post. Since there's theory to provide confidence intervals, maybe that's not a big deal? Maybe even good?

And on symmetry:

(1) Unlike most coefficients, ξn is not symmetric in X and Y .
But that is intentional. We would like to keep it that way because we may
want to understand if Y is a function X, and not just if one of the variables
is a function of the other. If we want to understand whether X is a function
of Y , we should use ξn(Y, X) instead of ξn(X, Y ). A symmetric measure
of dependence, if required, can be easily obtained by taking the maximum
of ξn(X, Y ) and ξn(Y, X).

@mattansb
Copy link
Member

Cool (👍

  1. I don't see any mention of a confidence interval - should we just use Fisher's Z?
  2. In theory, xi is non-negative, but it sometimes is - should we return 0 in such cases?

@vincentarelbundock
Copy link
Author

I don’t see any mention of a confidence interval

Sorry, I misread about the CI. The XICOR package does provide a SD, but it feels wrong to just compute a symmetric interval using that.

should we just use Fisher’s Z?

I’ve only really skimmed the paper, and don’t truly understand it. Until I grok this better (realistically: never), I would be reticent to report a quantity not explicitly endorsed by the author.

In theory, xi is non-negative, but it sometimes is - should we return 0 in such cases?

“In the limit” != “In theory”. I’d say report the actual output of the equation, rather than an ad hoc hack.

I ran into some errors with your ksaai() function with large N. However, the paper authors have published a XICOR package on CRAN. It seems fast and is published under Apache License which, I believe, is compatible with GPL3.

library(XICOR)
N <- 100
x <- rnorm(N, sd = 4)
y <- sin(x) + rnorm(N, sd = 0.2)
xicor(y, x, pvalue = TRUE)

    $xi
    [1] 0.03840384

    $sd
    [1] 0.06325978

    $pval
    [1] 0.2718984

@mattansb
Copy link
Member

In theory == I mean the estimand is non-positive.

I'll run some simulations to see if the Fisher Z CIs work well enough.

@bwiernik
Copy link
Contributor

The author did a small simulation in section 4.2 and concluded that sqrt(n) * xi is asymptomatically normal (when n = 1000). That's not unexpected, but also not very helpful for more realistic sample sizes.

The author's XICOR package defaults to using the specified mean and SD values with a normal distribution. They also offer a permutation test.

I'd be okay with reporting normal-theory intervals and p values to start given that's what the author does, but we should ideally do some simulations to confirm good performance of the intervals at smaller n (or use a z transform if that works nicely).

I don't compare the code above from the blog post and the XICOR package to be sure they aligned, but we should follow XICOR https://github.com/cran/XICOR/blob/master/R/xicor.R

@TarandeepKang
Copy link

Hi All,

Just to mention that this preprint has now been published:

Chatterjee, S. (2021). A New Coefficient of Correlation. Journal of the American Statistical Association, 116(536), 2009–2022. https://doi.org/10.1080/01621459.2020.1758115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature idea 🔥 New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants