Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupBy method ignores the rejectNA option #205

Open
eepstein opened this issue Jul 2, 2013 · 2 comments
Open

groupBy method ignores the rejectNA option #205

eepstein opened this issue Jul 2, 2013 · 2 comments

Comments

@eepstein
Copy link

eepstein commented Jul 2, 2013

Seems this is a problem with how the sum(), and in turn mean() and possibly other methods are implemented. They don't seem to detect non-numerics Except as the very first element of an array.

Use case: grouping across rows where some rows have null (or NaN) values for certain columns. Average should be across the non-null, numeric values.

It would seem from the docs that this is a feature. The code seems to indicate otherwise.

@iros
Copy link
Member

iros commented Oct 1, 2013

Part of the problem is what one should do in this situation. How do you sum up rows that have NA's in them? Is it still valid to sum up those rows that don't have values? We cannot assume that those values should be counted as zeroes. Do you have a use case that you can suggest?

@protobi
Copy link

protobi commented Oct 11, 2013

An example would be average systolic blood pressure reading across multiple patient visits. It might not be measured every time, but the patient presumably still had one that was simply unobserved.

In R, it's handled this way:

  • mean( c ( 0, 5, NULL, 10, NULL, 15)) -> 7.5
  • sum( c ( 0, 5, NULL, 10, NULL, 15)) -> 30

R also differentiates NULL from NA (analogous to null and undefined):

  • mean( c ( 0, 5, NA, 10, NA, 15)) -> NA
  • sum( c ( 0, 5, NA, 10, NA, 15)) -> NA

Surveys can have cases where you might want missing values to be treated as zero in a mean, such as average wait time in a survey with skip patterns, e.g.

  • "Q1. Did you have to wait for the representative [yes, no]. IF Q1='yes' then ask:
  • "Q2. How many minutes did you wait?"

But then the analyst would be expected to explicitly recode missings as zero, and would not expect a second kind of parameter for handling NA in the operand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants