Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbiased squared distance covariance increases as sample size becomes large #353

Open
loremarchi opened this issue Feb 16, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@loremarchi
Copy link

loremarchi commented Feb 16, 2023

Dear team, first of all thank you for maintaining this useful package. I was trying to run the independence test by Shen et al. (2022), but I received some weird results, such as a rejection of independence even though I knew that the two one-dimensional vectors I was testing were independent. It was only when working with large samples (more than 70,000 observations) that I noticed something was off.

Although I have not yet identified the source of the problem in the code, I have a script available that reproduces the issue I am referring to. Please note that I have modified the method statistic in the dcorr.py to output both "stat" and "covar".

Reproducing code example:

t = hyppo.independence.Dcorr(); t.is_fast = True
n_samples = [100,1000,10000,50000,70000,100000]
for n in n_samples:
    U1 = np.random.rand(n,1)
    U2 = np.random.rand(n,1)
    S1 = np.sqrt(-2*np.log(U1))*np.cos(2*np.pi*U2)
    S2 = np.sqrt(-2*np.log(U1))*np.sin(2*np.pi*U2)
    print(f'current n is: {n}')
    print(f'Implementation according to R/original paper (unbiased squared distance covariance): {_r_distance_corr(S1, S2, mode = "squared_cov", unbiased = True)}')
    print(f'Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): {t.statistic(S1,S2)[1]}')
    print(f'Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): {t.statistic(S1,S2)[0]}\n')

Please, see that as the sample size increases (n = 70000 and n = 100000) the hyppo unbiased squared distance covariance becomes unintuitive.

Results

current n is: 100
Implementation according to R/original paper (unbiased squared distance covariance): 0.017224788665771484
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 0.017225187792305974
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.052031725347361675

current n is: 1000
Implementation according to R/original paper (unbiased squared distance covariance): 3.993511199951172e-05
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 4.020496229051318e-05
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 9.702530109483742e-05

current n is: 10000
Implementation according to R/original paper (unbiased squared distance covariance): 8.52346420288086e-05
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 8.585070210065382e-05
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.000211508104872705

current n is: 50000
Implementation according to R/original paper (unbiased squared distance covariance): -7.987022399902344e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): -8.067143002499222e-06
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): -1.9942861479842804e-05

current n is: 70000
Implementation according to R/original paper (unbiased squared distance covariance): -6.318092346191406e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 4.23422066658965
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.9132199516986869

current n is: 100000
Implementation according to R/original paper (unbiased squared distance covariance): 1.430511474609375e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 15.134514035440011
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.9742237257727073

Version information

  • OS: Any
  • Python Version: Any
  • Package Version 0.3.2 and 0.4.0
@loremarchi loremarchi added the bug Something isn't working label Feb 16, 2023
@vibhorpro
Copy link

vibhorpro commented Jul 1, 2023

hello @loremarchi I reviewed your code and error and found out may be it is because of Implementation Differences or Statistical Power . To resolve this you may use another library or TRY THIS CODE:


import numpy as np
from scipy.spatial.distance import squareform, pdist

def _r_distance_corr(X, Y, mode="squared_cov", unbiased=True):
X = X.flatten()
Y = Y.flatten()
n = len(X)

a = squareform(pdist(X[:, np.newaxis]))
b = squareform(pdist(Y[:, np.newaxis]))

A = a - a.mean(axis=0)[np.newaxis, :] - a.mean(axis=1)[:, np.newaxis] + a.mean()
B = b - b.mean(axis=0)[np.newaxis, :] - b.mean(axis=1)[:, np.newaxis] + b.mean()

if mode == "squared_cov":
    cov = (A * B).mean()
    if unbiased:
        var_X = np.var(X, ddof=1)
        var_Y = np.var(Y, ddof=1)
        return cov / np.sqrt(var_X * var_Y)
    else:
        return cov

elif mode == "cov":
    cov = (A * B).mean()
    return cov

elif mode == "corr":
    corr = (A * B).mean() / np.sqrt((A ** 2).mean() * (B ** 2).mean())
    return corr

t = hyppo.independence.Dcorr()
t.is_fast = True

n_samples = [100, 1000, 10000, 50000, 70000, 100000]
for n in n_samples:
U1 = np.random.rand(n, 1)
U2 = np.random.rand(n, 1)
S1 = np.sqrt(-2 * np.log(U1)) * np.cos(2 * np.pi * U2)
S2 = np.sqrt(-2 * np.log(U1)) * np.sin(2 * np.pi * U2)
print(f'current n is: {n}')
print(f'Implementation according to R/original paper (unbiased squared distance covariance): {_r_distance_corr(S1, S2, mode="squared_cov", unbiased=True)}')
print(f'Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): {t.statistic(S1, S2)[1]}')
print(f'Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): {t.statistic(S1, S2)[0]}\n')


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants