New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Nan Values for correlation and cross correlation #14414
Comments
I'd like to work on adding an optional argument to those functions to ignore NaN values. |
Added a new user facing function nancov, which calculates the covariance of variables while ignoring nan values. Partially addresses features requested in issue numpy#14414 and improves upon PR numpy#14688.
First of all, I would appreciate this functionality as a user. Thank you for bringing this up Secondly, please be careful when implementing this. For many applications, the interesting quantity is not the correlation coefficient itself, but the mean correlation coefficient corr(x, y) / len(x). So far, the users have manually removed nan's before processing, which is hard, but correct. Now, if a user has nan's in the data and they are implicitly dropped by the correlate function, the user might proceed to unsuspectingly divide the correlation by len(x), whereas they should only be dividing by the length of the non-nan part of the sum. So firstly, I suggest that correlate should throw a warning if there are Nan's in the arguments. Secondly, perhaps it makes sense to implement a mean correlation function (e.g. correlate_mean(x,y,...) ), which would divide the overlap by its non-nan length. |
The addition of more |
Would it be possible to automatically ignore the nan values when computing np.corrcoef or np.correlate ? We could create a function like np.nan_correlate.
In the case of corrcoef it is straight forward and can be solved by ignoring the nan values of both arrays, however in the convolution setting, it might have different lag on the two series which would create unwanted results.
Current Behavior
Returns array([ 0., 4., nan, nan, nan, nan, nan, 4., 0.])
It would be useful in some case to return :
Returns array([ 0., 4., 11., 18., 26., 18., 11., 4., 0.])
This is just ignoring any nan in the summation
The text was updated successfully, but these errors were encountered: