Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature suggestion: most() and assert_count_true() #539

Open
billdenney opened this issue Apr 27, 2023 · 6 comments
Open

Feature suggestion: most() and assert_count_true() #539

billdenney opened this issue Apr 27, 2023 · 6 comments

Comments

@billdenney
Copy link
Collaborator

I sometimes get dirty data that has multiple values that I need to choose one from. In a recent example, I received a dataset where an individual had multiple values for their sex (both male and female when they definitely did not undergo gender reassignment between the measurements).

To work with these types of issues, I think that two different types of functions can help:

most(x) is a companion to any() and all() from base R. It takes in a vector, x, and returns true if more than half of the values are isTRUE(x).

assert_count_true(x, n) takes in a logical vector x and an expected count that should be isTRUE(x), n. If sum(isTRUE(x)) == n, then it returns x. If a different number are TRUE, then it returns an error indicating the mismatch in count.

@matanhakim
Copy link
Contributor

Is this suggestion still relevant?
If so, I might take a crack at most().

@billdenney
Copy link
Collaborator Author

These are different from some of the typical janitor functions, so I'd like @sfirke to weigh in on if they feel like a good fit.

@sfirke
Copy link
Owner

sfirke commented Jan 30, 2024

I'm fine with adding most(). In your example, might you use it like:

dat %>%
  group_by(id) %>%
  filter(most(gender == "male"))

To get the data for all participants for whom most of their gender values are male ? Just checking my understanding.

Should it take an option cutoff value that defaults to 0.5? Then maybe we would be talking about calling it at_least() ...

Not trying to muddy the waters, just want to get precise on design and use cases.

@sfirke
Copy link
Owner

sfirke commented Jan 30, 2024

Could you share an example of using assert_count_true()? With mtcars or similar? I don't quite grasp how I would use it. There have been some talks about assertive checks in janitor, I can't remember what approach we landed on, but in general I support them. janitor::compare_df_cols_same is made for assertion.

@billdenney
Copy link
Collaborator Author

billdenney commented Jan 30, 2024

I like at_least(x, fraction = 0.5) more than most() as it covers a more general case with no more user effort.

For assert_count_true(), I often use something like it in my data cleaning routines. I have data where I know that one particular row is bad. I want to make sure that I only match that one row and no more or fewer (or maybe 5 rows or...). My use case looks like:

cleaned_data <-
  data |>
  mutate(
    age =
      case_when(
        assert_count_true(Person == "Bill" & Age == 40, count = 1) ~ 29, # The fountain of youth :)
        TRUE ~ age
      )
  )

My implementation looks something like (if I'm using deparse() correctly, I typed directly into the issue it is not tested code-- the idea is to tell the user the actual called value for x):

assert_count_true <- function(x, n = 1) {
  stopifnot(is.logical(x))
  if (any(is.na(x)) {
    stop(deparse(x), " has NA values")
  }
  if (sum(x) != n) {
    stop(deparse(x), " expected ", n, " TRUE values, but ", sum(x), " were found")
  }
  x
}

@billdenney
Copy link
Collaborator Author

billdenney commented Jan 31, 2024

Here's some better, working code for assert_count_true() with more helpful and grammatically correct error messages:

assert_count_true <- function(x, n = 1) {
  stopifnot(is.logical(x))
  if (any(is.na(x))) {
    stop(deparse(substitute(x)), " has NA values")
  }
  if (sum(x) != n) {
    stop_message <-
      sprintf(
        "`%s` expected %g `TRUE` %s but %g %s found.",
        deparse(substitute(x)),
        n,
        ngettext(n, "value", "values"),
        sum(x),
        ngettext(sum(x), "was", "were")
      )
    stop(stop_message)
  }
  x
}

foo <- c(TRUE, TRUE, FALSE)
assert_count_true(foo, n = 1)
#> Error in assert_count_true(foo, n = 1): `foo` expected 1 `TRUE` value but 2 were found.

bar <- c("Bill", "Sam", "Matan")
assert_count_true(bar == "Bill", n = 1)
#> [1]  TRUE FALSE FALSE

bar <- c("Bill", "Sam", "Matan")
assert_count_true(bar == "Bill", n = 2)
#> Error in assert_count_true(bar == "Bill", n = 2): `bar == "Bill"` expected 2 `TRUE` values but 1 was found.

Created on 2024-01-31 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants