ENH: Add optional parameter to ignore nan values in np.cov and np.corrcoef #14688

Harry-Kwon · 2019-10-13T08:01:14Z

Addresses one of the features suggesting in issue #14414.

Adds an optional parameter ignore_nan=False to np.cov and np.corrcoef that allows any observations where any variable is np.nan to be ignored.

Example case:

a = np.array([0, 1, np.nan, 0])
b = np.array([1, 0, 1, 1])
np.corrcoef(a, b, ignore_nan=True)

Previous output (without ignore_nan):

array([[nan, nan],
       [nan,  1.]])

Added behavior (with ignore_nan=True):

array([[ 1., -1.],
       [-1.,  1.]])

eric-wieser · 2019-10-13T08:04:19Z

~~Are you aware of np.nancov?~~

Perhaps this feature should be called nancov

Harry-Kwon · 2019-10-13T09:11:13Z

~~Are you aware of np.nancov?~~

Perhaps this feature should be called nancov

I'll work on adding a np.nancov function with an option to either ignore entire observations with a nan value or to use pariwise rows with no nan values.

bashtage · 2019-10-16T07:12:33Z

nancov also seems clearer to me.

The hard part of designing multivariate nan-aware APIs is the decision of how nan's should be treated if not uniform. If I have three variables

x = np.array([
  [1,np.nan,2],
  [np.nan,3,1],
  [5,2,np.nan],
  [9,4,2],
  [3, 8, 9]
])

and I call nancorr(x.T), what should I get? Should I get an array of 1s since there are only 2 rows that have all non-missing values. Or should I get

  1.00000 -0.500000 -0.277350
 -0.50000  1.000000  0.997176
 -0.27735  0.997176  1.000000

which is the pairwise nan-dropped correlation? This is what pandas returns, for example. Similarly, should nancov operate element by element, producing nanvar(axis=0) along the diagonal and the pair-wise covariance estimate on the off diagonals (also pandas behavior).

Harry-Kwon · 2019-10-17T03:48:27Z

@bashtage I was thinking of making the default behavior use only rows with all non-missing values, and add an optional pairwise parameter to operate element by element, which is what I think MATLAB's nancov does.

So given

x = np.array([
  [1,np.nan,2],
  [np.nan,3,1],
  [5,2,np.nan],
  [9,4,2],
  [3, 8, 9]
])

Calling nancov(x.T) would be the same as calling

nancov(np.array([
  [9,4,2],
  [3,8,9]
]).T)

Whereas nancov(x.T, pairwise=True) would give the pairwise covariance between each of the variables.

non-nan values If all rows in x have at least 1 nan values, np.cov(x, ignore_nan=True) will return a n by n array of nans where n is the number of rows(variables) in x.

Harry-Kwon · 2019-10-17T09:50:18Z

I've pushed my version of nancov to this branch. I would greatly appreciate some input on the documentation and default behavior from someone who understand the subject and use cases better than I do. (feel free to fork the branch and make a new PR if needed)

Added a new user facing function nancov, which calculates the covariance of variables while ignoring nan values. Partially addresses features requested in issue numpy#14414 and improves upon PR numpy#14688.

Harry-Kwon added 2 commits October 13, 2019 07:45

ENH: Add 'ignore_nan' optional paramter to np.cov

6f27bae

ENH: Add 'ignore_nan' optional paramter to np.corrcoef

92d944d

Harry-Kwon added 2 commits October 17, 2019 05:41

ENH: Allow np.cov with ignore_nan=True to accept no rows with all

e7bb766

non-nan values If all rows in x have at least 1 nan values, np.cov(x, ignore_nan=True) will return a n by n array of nans where n is the number of rows(variables) in x.

ENH: Add np.nancov in lib/nanfunctions

f80da6f

Harry-Kwon mentioned this pull request Oct 26, 2019

ENH: Add new function nancov #14784

Closed

Harry-Kwon closed this Oct 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add optional parameter to ignore nan values in np.cov and np.corrcoef #14688

ENH: Add optional parameter to ignore nan values in np.cov and np.corrcoef #14688

Harry-Kwon commented Oct 13, 2019

eric-wieser commented Oct 13, 2019 •

edited

Harry-Kwon commented Oct 13, 2019

bashtage commented Oct 16, 2019

Harry-Kwon commented Oct 17, 2019

Harry-Kwon commented Oct 17, 2019

ENH: Add optional parameter to ignore nan values in np.cov and np.corrcoef #14688

ENH: Add optional parameter to ignore nan values in np.cov and np.corrcoef #14688

Conversation

Harry-Kwon commented Oct 13, 2019

eric-wieser commented Oct 13, 2019 • edited

Harry-Kwon commented Oct 13, 2019

bashtage commented Oct 16, 2019

Harry-Kwon commented Oct 17, 2019

Harry-Kwon commented Oct 17, 2019

eric-wieser commented Oct 13, 2019 •

edited