Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add optional parameter to ignore nan values in np.cov and np.corrcoef #14688

Closed
wants to merge 4 commits into from

Conversation

Harry-Kwon
Copy link

Addresses one of the features suggesting in issue #14414.

Adds an optional parameter ignore_nan=False to np.cov and np.corrcoef that allows any observations where any variable is np.nan to be ignored.

Example case:

a = np.array([0, 1, np.nan, 0])
b = np.array([1, 0, 1, 1])
np.corrcoef(a, b, ignore_nan=True)

Previous output (without ignore_nan):

array([[nan, nan],
       [nan,  1.]])

Added behavior (with ignore_nan=True):

array([[ 1., -1.],
       [-1.,  1.]])

@eric-wieser
Copy link
Member

eric-wieser commented Oct 13, 2019

Are you aware of np.nancov?

Perhaps this feature should be called nancov

@Harry-Kwon
Copy link
Author

Are you aware of np.nancov?

Perhaps this feature should be called nancov

I'll work on adding a np.nancov function with an option to either ignore entire observations with a nan value or to use pariwise rows with no nan values.

@bashtage
Copy link
Contributor

nancov also seems clearer to me.

The hard part of designing multivariate nan-aware APIs is the decision of how nan's should be treated if not uniform. If I have three variables

x = np.array([
  [1,np.nan,2],
  [np.nan,3,1],
  [5,2,np.nan],
  [9,4,2],
  [3, 8, 9]
])

and I call nancorr(x.T), what should I get? Should I get an array of 1s since there are only 2 rows that have all non-missing values. Or should I get

  1.00000 -0.500000 -0.277350
 -0.50000  1.000000  0.997176
 -0.27735  0.997176  1.000000

which is the pairwise nan-dropped correlation? This is what pandas returns, for example. Similarly, should nancov operate element by element, producing nanvar(axis=0) along the diagonal and the pair-wise covariance estimate on the off diagonals (also pandas behavior).

@Harry-Kwon
Copy link
Author

@bashtage I was thinking of making the default behavior use only rows with all non-missing values, and add an optional pairwise parameter to operate element by element, which is what I think MATLAB's nancov does.

So given

x = np.array([
  [1,np.nan,2],
  [np.nan,3,1],
  [5,2,np.nan],
  [9,4,2],
  [3, 8, 9]
])

Calling nancov(x.T) would be the same as calling

nancov(np.array([
  [9,4,2],
  [3,8,9]
]).T)

Whereas nancov(x.T, pairwise=True) would give the pairwise covariance between each of the variables.

non-nan values

If all rows in x have at least 1 nan values, np.cov(x, ignore_nan=True)
will return a n by n array of nans where n is the number of
rows(variables) in x.
@Harry-Kwon
Copy link
Author

I've pushed my version of nancov to this branch. I would greatly appreciate some input on the documentation and default behavior from someone who understand the subject and use cases better than I do. (feel free to fork the branch and make a new PR if needed)

Harry-Kwon added a commit to Harry-Kwon/numpy that referenced this pull request Oct 26, 2019
Added a new user facing function nancov, which calculates the covariance
of variables while ignoring nan values. Partially addresses features
requested in issue numpy#14414 and improves upon PR numpy#14688.
@Harry-Kwon Harry-Kwon closed this Oct 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants