Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion of new function: describe_missing() #454

Open
rempsyc opened this issue Sep 2, 2023 · 1 comment
Open

Suggestion of new function: describe_missing() #454

rempsyc opened this issue Sep 2, 2023 · 1 comment

Comments

@rempsyc
Copy link
Sponsor Member

rempsyc commented Sep 2, 2023

When writing (psychology) scientific papers, great care must be taken in reporting the state of item-level missing data for each psychological questionnaire. For example, Parent (2013) writes:

I recommend that authors (a) state their tolerance level for missing data by scale or subscale (e.g., “We calculated means for all subscales on which participants gave at least 75% complete data”) and then (b) report the individual missingness rates by scale per data point (i.e., the number of missing values out of all data points on that scale for all participants) and the maximum by participant (e.g., “For Attachment Anxiety, a total of 4 missing data points out of 100 were observed, with no participant missing more than a single data point”).

In order to comply with this recommandation, I have developed the function nice_na(), which nicely summarizes NA values according to those guidelines. The function describes both absolute and percentage values of specified column lists and supports specifying scales through regex. Reprex:

library(rempsyc)

# If the questionnaire items start with the same name, e.g.,
set.seed(15)
fun <- function() {
  c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
  ID = c("idz", NA),
  open_1 = fun(), open_2 = fun(), open_3 = fun(),
  extrovert_1 = fun(), extrovert_2 = fun(), extrovert_3 = fun(),
  agreeable_1 = fun(), agreeable_2 = fun(), agreeable_3 = fun()
)

head(df, 3)
#>     ID open_1 open_2 open_3 extrovert_1 extrovert_2 extrovert_3 agreeable_1
#> 1  idz      4     NA      1           5           6           1           7
#> 2 <NA>      9      4      3           1          10          NA           7
#> 3  idz      1      4      1           9           2          NA           8
#>   agreeable_2 agreeable_3
#> 1           7           9
#> 2           7           2
#> 3           7           8

# One can list the scale names directly:
nice_na(df, scales = c("ID", "open", "extrovert", "agreeable"))
#>                       var items na cells na_percent na_max na_max_percent
#> 1                   ID:ID     1  7    14      50.00      1            100
#> 2           open_1:open_3     3 11    42      26.19      3            100
#> 3 extrovert_1:extrovert_3     3 17    42      40.48      3            100
#> 4 agreeable_1:agreeable_3     3 10    42      23.81      3            100
#> 5                   Total    10 45   140      32.14     10            100
#>   all_na
#> 1      7
#> 2      3
#> 3      3
#> 4      3
#> 5      2

Created on 2023-09-02 with reprex v2.0.2


Would you like this function to migrate from rempsyc to datawizard?

For the name, I was thinking data_missing_items or just data_missing since it also works without scale items and it is similar to our other data_ functions like data_duplicated. It could also be describe_missing in line with describe_distribution (actually that one makes more sense I think).

@DominiqueMakowski
Copy link
Member

describe_missing() is good I think. + a report() method in report to have a text version would be neat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants