Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Nan Values for correlation and cross correlation #14414

Closed
Jeanselme opened this issue Sep 3, 2019 · 3 comments
Closed

[Feature Request]: Nan Values for correlation and cross correlation #14414

Jeanselme opened this issue Sep 3, 2019 · 3 comments

Comments

@Jeanselme
Copy link

Would it be possible to automatically ignore the nan values when computing np.corrcoef or np.correlate ? We could create a function like np.nan_correlate.

In the case of corrcoef it is straight forward and can be solved by ignoring the nan values of both arrays, however in the convolution setting, it might have different lag on the two series which would create unwanted results.

Current Behavior

import numpy as np

a = np.array([0, 1, 2, 3, 4])
b = np.array([0, 1, np.nan, 3, 4])

np.correlate(a, b, 'full')

Returns array([ 0., 4., nan, nan, nan, nan, nan, 4., 0.])

It would be useful in some case to return :
Returns array([ 0., 4., 11., 18., 26., 18., 11., 4., 0.])

This is just ignoring any nan in the summation

@Harry-Kwon
Copy link

I'd like to work on adding an optional argument to those functions to ignore NaN values.

Harry-Kwon added a commit to Harry-Kwon/numpy that referenced this issue Oct 26, 2019
Added a new user facing function nancov, which calculates the covariance
of variables while ignoring nan values. Partially addresses features
requested in issue numpy#14414 and improves upon PR numpy#14688.
@aleksejs-fomins
Copy link

First of all, I would appreciate this functionality as a user. Thank you for bringing this up

Secondly, please be careful when implementing this. For many applications, the interesting quantity is not the correlation coefficient itself, but the mean correlation coefficient corr(x, y) / len(x). So far, the users have manually removed nan's before processing, which is hard, but correct. Now, if a user has nan's in the data and they are implicitly dropped by the correlate function, the user might proceed to unsuspectingly divide the correlation by len(x), whereas they should only be dividing by the length of the non-nan part of the sum. So firstly, I suggest that correlate should throw a warning if there are Nan's in the arguments. Secondly, perhaps it makes sense to implement a mean correlation function (e.g. correlate_mean(x,y,...) ), which would divide the overlap by its non-nan length.

@rossbar rossbar changed the title Nan Values for correlation and cross correlation [Feature Request]: Nan Values for correlation and cross correlation Jul 23, 2020
@rossbar
Copy link
Contributor

rossbar commented Jul 23, 2020

The addition of more nan* functions has been discussed and the current consensus is against doing so, so I will close this for now. If you are interested in pursuing the feature request, consider bringing it up on the mailing list (you can link to the issues for context).

@rossbar rossbar closed this as completed Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants