Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for loading missing values in resources as additional cols #161

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

khusmann
Copy link
Contributor

@khusmann khusmann commented Nov 8, 2023

Presently, all missing values loaded by read_resource become NAs in the resulting tibble. This means that when missing values encode reasons for missingness (e.g. "Participant refused item", "Participant absent"), these reasons are lost. In a lot of applications, we want access to these missing reasons because of the important contextual info they provide.

This pull request adds the ability to include missing reasons as separate columns when loading resources by adding an argument in read_resource to select which data "channel" the user wants to load: values, missing, or both. Resulting columns can be named via values_channel_suffix and missing_channel_suffix.

I'm using the term channel here because I think it's a powerful way of conceptualizing missing value data & metadata that can generalize to other types of data & formats. In the same way a color image has multiple channels for red, green, and blue pixel values, we can think of tabular data with missing reasons as having a channel for "values" and a channel for "missing reasons". What's nice about thinking of values and missingness as separate channels is that it enables us to work with them as separate types: when we interlace them as most formats do, everything becomes a string and we lose that useful type info.

Unfortunately, no packages in R exist yet (that I'm aware of) to work with values & missingness as multichannel structures, and frictionless doesn't have support for multichannel tabular formats (yet). So this is why I add them as separate columns here. The closest we have to this ability in R is the tagged_na and labelled classes in haven, but these are limited to the particular ways SPSS / Stata / SAS encode their missing values, rather than enabling support for arbitrary missing reasons, as frictionless is able to represent.

In the long run it'd be nice to have an R package that could provide full support multichannel missingness, but I think extra columns are the best we can do for now. In the meantime, we might also consider at some point adding support for converting to the tagged_na and haven_labelled types, in the special cases when missing reasons conform to the peculiarities of SPSS / Stata / SAS formats.

@khusmann
Copy link
Contributor Author

khusmann commented Nov 8, 2023

Another way I just thought of we might handle the API on this that could be pretty slick could be just taking unnamed and named vectors for channel selection & renaming.

Then we could select channels via:

channels = c("values")
channels = c("values", "missing")

and add column suffixes via:

channels = c(values = "", missing = "__missing")

This would generalize better to other multichannel formats down the road, e.g.:

channels = c("r", "g", "b")
channels = c(r = "__red", g = "__green", b = "__blue")

Thoughts? Other ideas?

@khusmann
Copy link
Contributor Author

khusmann commented Nov 8, 2023

Just implemented the new API idea above, so read_resource now only uses added channels arg.

channels = c("values") -> load values (normal, default behavior)
channels = c("missing") -> load missing reasons

channels = c("values", "missing") -> load values AND missing reasons. Append "__values" and "__missing" to columns respectively
channels = c(values="__value_suffix", missing="__missing_suffix") -> same as above, but with custom suffixes.

I think what's nice about appending both __values and __missing to columns when loading both values and missingness is that it makes pattern-based pivoting & other data wrangling a little easier by default (e.g. via ends_with("__values") and ends_with("__missing") dplyr selectors).

@peterdesmet
Copy link
Member

@khusmann interesting use case

  1. Can you provide a small example dataset that can be used to test different approaches?

  2. Can you provide a reproducible example using the example above. I learn a lot from just seeing how it the output looks like. :-)

  3. I use readr as a barometer for what functionality to consider when reading data. This use case gets us quite far from that. So I'm reluctant to implement (and maintain) something complicated that isn't adopted in other packages. 😅

  4. That said, I do understand that improved missing value interpretation would be very useful. I haven't thought about this as long as you have, but what about the following approach:

  • Have an argument in read_resource() to include (rather than convert) missing values
  • Missing values are included with a prefix:missing:NA, missing:Participant refused item
  • Columns get converted to string
  • User can manipulate data further

@peterdesmet peterdesmet added enhancement New feature or request function:read_resource Function read_resource() labels Nov 13, 2023
@khusmann
Copy link
Contributor Author

Thanks for the feedback!

  1. Sure! I think this approach is deserving of a full vignette. I'll put one together... :)

  2. ^^

  3. For this feature, I'm thinking more along the lines of an analogy of this package to the haven package, in how it captures SPSS/Stata/SAS missing values with custom types / attributes. Until now I've been thinking of readr as more of a lower-level lib in this context, but you make a good point that this functionality may be a candidate for inclusion into readr proper instead of here... I'll have to think about that.

  4. My hesitation with that approach is how it loses type information (by converting everything to string). So subsequent manipulations end up relying on a lot of string operations with "magic" tags (like "missing:") and type conversions rather than working with the pure data, which adds a lot of brittle boilerplate to common manipulation tasks. I can show some of the pros / cons in the aforementioned vignette, once I get it together...

@khusmann
Copy link
Contributor Author

Hi again! I've put together a vignette outlining my thoughts / justifications for this approach, framed as a proposal for addition to read_delim in the tidyverse. (I think you're right, it would be most ideal for it to be supported there, if possible). Any thoughts / feedback / other perspectives on this would be greatly appreciated! :)

@peterdesmet
Copy link
Member

Hi @khusmann, nice work on the vignette! I would add a chunk at the beginning to load the packages you use (I think readr, stringr, dplyr), so it becomes repeatable for others.

Since we both agree this is a better feature for readr, I suggest you suggest and clarify it as a feature there: https://github.com/tidyverse/readr/issues. The vignette will be useful.

@khusmann
Copy link
Contributor Author

Thanks! Just updated my vignette with your suggestion. I'll make a post to readr with my vignette after the (USA) holiday to hopefully get more eyes on it.

Also updated this branch to relegate the channel select logic into utils in read_delim_ext in utils.R. This way if readr eventually does implement this feature, it'll be a drop-in replacement here. Also my plan is to use this as the basis for my implementation of value / missing labels (#148). One of the key features important to me in the value / missing label implementation is the ability to keep the value and missing labels separate. Otherwise, you get factor levels polluted with a bunch of missing reasons, and again rely on brittle string manipulations (& type conversions) to distinguish. Keeping them separate gives the user much more flexibility -- in general combining/interlacing channels is always trivial, but separating already interlaced channels requires context & gymnastics.

@khusmann
Copy link
Contributor Author

I mentioned this on slack, but putting here for reference: I've created a package for reading interlaced values & missing reasons that might be useful here: https://kylehusmann.com/interlacer/

Instead of appending "_values" and "_missing" like I did above, value columns retain their original names, and missing columns are surrounded by dots (e.g. .name.)

It wraps & extends readr's read_* functions and col_* types, so it'd be really easy to incorporate into read_resource(). I'm imagining we could add a flag deinterlace = TRUE that would load the missing values in a deinterlaced data frame, whereas deinterlace = FALSE (the default) would keep the original behavior.

It also handles field-level missing values via the extended icol_* collector types.

Anyway, the package is still in its infancy so it's not ready to be dropped in just yet -- but would appreciate any thoughts & feedback on the approach!

@peterdesmet peterdesmet added this to the 1.2.0 milestone Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request function:read_resource Function read_resource()
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants