Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: make flexible validations #76

Open
1 task done
nbbn opened this issue Jul 27, 2023 · 1 comment
Open
1 task done

[Feature]: make flexible validations #76

nbbn opened this issue Jul 27, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@nbbn
Copy link

nbbn commented Jul 27, 2023

Guidelines

  • I agree to follow this project's Contributing Guidelines.

Description

Data.validator strongly follow idea of table and validations running on the table.
IMO it doesn't fit most of use cases.

E.g. I do:

validate(data.frame(), name = "Comparing testing vs postgres data") |>
  validate_if(
    identical(
      names(get_cols(...)),
      names(get_cols(...))
    ),
    description = "Column names are the same in 1 table"
  ) |>
  validate_if(
    identical(
      as.vector(get_cols(...)),
      as.vector(get_cols(...))
    ),
    description = "Column types are the same in 1 table"
  ) |>
  add_results(report)

As you can see, I have to pass empty data frame to validate() but I don't use it.

Then when I do print(report)
I see:

|table_name                         |description                                       |type    | total_violations|
|:----------------------------------|:-------------------------------------------------|:-------|----------------:|
|Comparing testing vs ci data       |Column names are the same in 1 table |success |               NA|
|Comparing testing vs ci data       |Column names are the same in 1 table         |success |               NA|

Name of column table_name doesn't make sense for me in this situation. It should be maybe Group?

Also Violated data doesn't work with this flexible approach.

Another example from practice

We used data.validator to show rows, that are returned by queries. Queries were built in the way that they return only invalid rows, and there is nothing returned if there is no invalid data. More documentation about how to hack data.validator for this cases would be nice.

Problem

My use of this package doesn't fit standard use of the package. I think package should be more flexible and allow validations based on multiple data frames without specifing them explicitly in validate call.

Proposed Solution

  1. Change column names in report object.
  2. Remove requirement of dataframe in validate()
  3. Update docs with examples of more advanced and customized use-cases.

Alternatives Considered

Stick to what you have. Write in docs explicitly that it is dedicated to working with data frames.

@nbbn nbbn added the enhancement New feature or request label Jul 27, 2023
@D3SL
Copy link

D3SL commented Aug 2, 2023

It's not just that data.validator works only with dataframes, it's that it works only with columns and rows. I may have missed an obvious solution but from what I've seen there's no way to do something like your names() check that operates at the dataframe level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants