You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I agree to follow this project's Contributing Guidelines.
Description
Data.validator strongly follow idea of table and validations running on the table.
IMO it doesn't fit most of use cases.
E.g. I do:
validate(data.frame(), name = "Comparing testing vs postgres data") |>
validate_if(
identical(
names(get_cols(...)),
names(get_cols(...))
),
description = "Column names are the same in 1 table"
) |>
validate_if(
identical(
as.vector(get_cols(...)),
as.vector(get_cols(...))
),
description = "Column types are the same in 1 table"
) |>
add_results(report)
As you can see, I have to pass empty data frame to validate() but I don't use it.
Then when I do print(report)
I see:
|table_name |description |type | total_violations|
|:----------------------------------|:-------------------------------------------------|:-------|----------------:|
|Comparing testing vs ci data |Column names are the same in 1 table |success | NA|
|Comparing testing vs ci data |Column names are the same in 1 table |success | NA|
Name of column table_name doesn't make sense for me in this situation. It should be maybe Group?
Also Violated data doesn't work with this flexible approach.
Another example from practice
We used data.validator to show rows, that are returned by queries. Queries were built in the way that they return only invalid rows, and there is nothing returned if there is no invalid data. More documentation about how to hack data.validator for this cases would be nice.
Problem
My use of this package doesn't fit standard use of the package. I think package should be more flexible and allow validations based on multiple data frames without specifing them explicitly in validate call.
Proposed Solution
Change column names in report object.
Remove requirement of dataframe in validate()
Update docs with examples of more advanced and customized use-cases.
Alternatives Considered
Stick to what you have. Write in docs explicitly that it is dedicated to working with data frames.
The text was updated successfully, but these errors were encountered:
It's not just that data.validator works only with dataframes, it's that it works only with columns and rows. I may have missed an obvious solution but from what I've seen there's no way to do something like your names() check that operates at the dataframe level.
Guidelines
Description
Data.validator strongly follow idea of table and validations running on the table.
IMO it doesn't fit most of use cases.
E.g. I do:
As you can see, I have to pass empty data frame to
validate()
but I don't use it.Then when I do
print(report)
I see:
Name of column
table_name
doesn't make sense for me in this situation. It should be maybeGroup
?Also
Violated data
doesn't work with this flexible approach.Another example from practice
We used data.validator to show rows, that are returned by queries. Queries were built in the way that they return only invalid rows, and there is nothing returned if there is no invalid data. More documentation about how to hack data.validator for this cases would be nice.
Problem
My use of this package doesn't fit standard use of the package. I think package should be more flexible and allow validations based on multiple data frames without specifing them explicitly in
validate
call.Proposed Solution
validate()
Alternatives Considered
Stick to what you have. Write in docs explicitly that it is dedicated to working with data frames.
The text was updated successfully, but these errors were encountered: