Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: validate column name existence #87

Open
1 task done
nick-youngblut opened this issue Sep 15, 2023 · 0 comments
Open
1 task done

[Feature]: validate column name existence #87

nick-youngblut opened this issue Sep 15, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@nick-youngblut
Copy link

Guidelines

  • I agree to follow this project's Contributing Guidelines.

Description

A modified version of the README example:

library(magrittr)
library(data.validator)

report <- data_validation_report()

validate(mtcars, name = "Verifying cars dataset") %>%
  validate_if(drat > 0, description = "Column drat has only positive values") %>%
  validate_cols(in_set(c(0, 2)), WRONG_COLUMN_NAME, vs, am, description = "vs and am values equal 0 or 2 only") %>%
  validate_cols(within_n_sds(1), mpg, description = "mpg within 1 sds") %>%
  validate_rows(num_row_NAs, within_bounds(0, 2), vs, am, mpg, description = "not too many NAs in rows") %>%
  validate_rows(maha_dist, within_n_mads(10), everything(), description = "maha dist within 10 mads") %>%
  add_results(report)

print(report)

The error:

> validate(mtcars, name = "Verifying cars dataset") %>%
+   validate_if(drat > 0, description = "Column drat has only positive values") %>%
+   validate_cols(in_set(c(0, 2)), WRONG_COLUMN_NAME, vs, am, description = "vs and am values equal 0 or 2 only") %>%
+   validate_cols(within_n_sds(1), mpg, description = "mpg within 1 sds") %>%
+   validate_rows(num_row_NAs, within_bounds(0, 2), vs, am, mpg, description = "not too many NAs in rows") %>%
+   validate_rows(maha_dist, within_n_mads(10), everything(), description = "maha dist within 10 mads") %>%
+   add_results(report)
Error in `dplyr::select()` at assertr/R/assertions.R:102:2:
! Can't subset columns that don't exist.
✖ Column `WRONG_COLUMN_NAME` doesn't exist.

As far as I can tell, if the user provides a table in which a validated column doesn't exist, then the validate workflow throws an error instead of producing a report stating validation failed due to missing required columns.

Problem

No checks that the validated columns exist in the provided data.frame.
So, the column-exists check must be placed outside of the generate-validation-report workflow.
The feedback to the user is then split into at least 2 validations: 1) a check for the required columns and 2) the validation report -- instead of just one all-encompassing validation report.

Proposed Solution

Include assertr::has_all_names in the validation report, or if that is already possible, provide an example in the package README.

Alternatives Considered

I'm currently validating the existence of the required columns prior to using data.validator, and providing user feedback on the column existence via shiny::showNotification()

@nick-youngblut nick-youngblut added the enhancement New feature or request label Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant