Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Option to load missing reasons / codes as separate columns #1525

Open
khusmann opened this issue Nov 27, 2023 · 1 comment
Open

FR: Option to load missing reasons / codes as separate columns #1525

khusmann opened this issue Nov 27, 2023 · 1 comment

Comments

@khusmann
Copy link

In many datasets, missing values are interlaced with data as codes or strings. read_delim() et al., presently have an option to replace these values with NA, but do not have an easy way to do anything else with these values. For example:

na_strings_csv <- "
person_id,age,favorite_color
1,20,BLUE
1,_DECLINED_ANSWER_,BLUE
2,21,RED
3,30,_TECHNICAL_ERROR_
4,31,_DECLINED_ANSWER_
5,41,RED
6,50,_TECHNICAL_ERROR_
"

To load these data we'd run read_csv(na_strings_csv, na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"))

But if we want to load the missing reasons as its own dataframe, or load the missing reasons in separate columns with a suffix, it requires loading the entire dataframe as text, filtering for the missing reasons and joining to the original dataframe. Given the extensive number of datasets that interlace values and missing reasons, it would be really nice to have an extra option on read_* to make this common task more ergonomic. I propose having a channels arg having the following functionality:

  1. channels="values" gives the default behavior, but channels="missing" loads the missing reasons.
read_csv(
  na_strings_csv,
  na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"),
  channels="missing"
)

## # A tibble: 7 × 3
##   person_id age               favorite_color   
##   <chr>     <chr>             <chr>            
## 1 <NA>      <NA>              <NA>             
## 2 <NA>      _DECLINED_ANSWER_ <NA>             
## 3 <NA>      <NA>              <NA>             
## 4 <NA>      <NA>              _TECHNICAL_ERROR_
## 5 <NA>      <NA>              _DECLINED_ANSWER_
## 6 <NA>      <NA>              <NA>             
## 7 <NA>      <NA>              _TECHNICAL_ERROR_
  1. Both values and missingness can be loaded simultaneously as separate columns by passing channels=c("values", "missing")
read_csv(
  na_strings_csv,
  na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"),
  channels=c("values", "missing"),
)

## # A tibble: 7 × 6
##   person_id_values age_values favorite_color_values person_id_missing
##              <int>      <dbl> <chr>                 <chr>            
## 1                1         20 BLUE                  <NA>             
## 2                1         NA BLUE                  <NA>             
## 3                2         21 RED                   <NA>             
## 4                3         30 <NA>                  <NA>             
## 5                4         31 <NA>                  <NA>             
## 6                5         41 RED                   <NA>             
## 7                6         50 <NA>                  <NA>             
## # ℹ 2 more variables: age_missing <chr>, favorite_color_missing <chr>
  1. Using a named vector for channels controls the suffixes of the columns
read_csv(
  na_strings_csv,
  na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"),
  channels=c(values="", missing="_missing"),
)
## # A tibble: 7 × 6
##   person_id   age favorite_color person_id_missing age_missing      
##       <int> <dbl> <chr>          <chr>             <chr>            
## 1         1    20 BLUE           <NA>              <NA>             
## 2         1    NA BLUE           <NA>              _DECLINED_ANSWER_
## 3         2    21 RED            <NA>              <NA>             
## 4         3    30 <NA>           <NA>              <NA>             
## 5         4    31 <NA>           <NA>              <NA>             
## 6         5    41 RED            <NA>              <NA>             
## 7         6    50 <NA>           <NA>              <NA>             
## # ℹ 1 more variable: favorite_color_missing <chr>

Loading values and missing reasons in separate columns like this greatly facilitates manipulation with tidyverse aggregation & filtering functions. For example, to find the average age of the of individuals that had technical errors reporting their favorite color:

read_csv(
  na_strings_csv,
  na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"),
  channels=c(values="", missing="_missing"),
) %>%
  filter(favorite_color_missing == "_TECHNICAL_ERROR_") %>%
  summarize(
    age = mean(age, na.rm=T)
  )

I provide some more examples here here (I also include an example / naive implementation of the above api).

Please let me know if this idea is of interest, and I'd be happy to work on a PR. Cheers!

@khusmann
Copy link
Author

I've spun this idea into its own package here: https://kylehusmann.com/interlacer/

The main limitation of the package is around column-level missing values. It forces me to load files as character vectors and then use type_convert() which doesn't have complete parity with the vroom read_* functions.

So even though this package works as a sort of stop-gap, I still think this kind of functionality would still be better built-in to readr / vroom.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant