Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] expect_column_most_common_value_to_be_in_set handling of ties #258

Open
2 tasks done
lookslikeitsnot opened this issue May 17, 2023 · 0 comments
Open
2 tasks done

Comments

@lookslikeitsnot
Copy link
Contributor

Is this a new bug in dbt-expectations?

  • I believe this is a new bug in dbt-expectations
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When most common values are non-unique, running expect_column_most_common_value_to_be_in_set only works for 1 of the most common values.

Expected Behavior

expect_column_most_common_value_to_be_in_set should match all most common values

Steps To Reproduce

in dbt-expectations/integration_tests/models/schema_tests/schema.yml, add following column test for model data_test column col_string_b.:

          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['ab']
              top_n: 1

Although all values are distinct in that model column (i.e. all column values are "the most common"), any value in value_set different from 'b' (the first result naturally ordered) will fail.

Relevant log output

with value_counts as (
    select
        col_string_b as value_field,
        count(*) as value_count
    from dbt_expectations_integration_tests.data_test
    group by col_string_b
),
value_counts_ranked as (
    select
        *,
        row_number() over(order by value_count desc) as value_count_rank
    from
        value_counts
),
value_count_top_n as (
    select
        value_field
    from
        value_counts_ranked
    where
        value_count_rank = 1
),
set_values as (
    select
        'ab' as value_field
),
unique_set_values as (
    select distinct value_field
    from
        set_values
),
validation_errors as (
    -- values from the model that are not in the set
    select
        value_field
    from
        value_count_top_n
    where
        value_field not in (select value_field from unique_set_values)
)
select *
from validation_errors

Environment

- OS: Manjaro Linux 22.0.2
- Python: 3.10.9
- dbt: 1.5.0
- dbt-expectations: 0.8.5

Which database adapter are you using with dbt?

Postgres

Note: dbt-expectations currently does not support database adapters other than the ones listed below.

  • Postgres
  • Snowflake
  • BigQuery

Additional Context

Great expectations has an additional ties_okay argument for partial matches. We could also add support for that.

So following tests would succeed

            # Expect error if not all most common values are in the set
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b']
              top_n: 1
              config:
                error_if: "=0"
                warn_if: "<3"
            # Expect success if not all most common values are in the set but ties_okay is set
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b']
              top_n: 1
              ties_okay: true
            # Expect error if none of the most common values are in the set and ties_okay is set
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['invalid_value']
              top_n: 1
              ties_okay: true
              config:
                error_if: "=0"
                warn_if: "<4"
            # Expect success if not all most common values are in the set but ties_okay is set
            # and the set contains extra values 
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b', 'invalid_value']
              top_n: 1
              ties_okay: true
            # Expect success if not all most common values are in the set but ties_okay is set
            # and value is not first one of the column naturally ordered
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['ab']
              top_n: 1
              ties_okay: true
            # Expect success if all most common values are in the set
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b', 'ab', 'abc', 'abcd']
              top_n: 1
            # Expect success if all most common values are in the set 
            # and the set contains extra values 
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['b', 'ab', 'abc', 'abcd', 'invalid_value']
              top_n: 1
            # Expect error if none of the most common values are in the set 
            # and the set contains extra values 
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['invalid_value1', 'invalid_value2', 'invalid_value3', 'invalid_value4', 'invalid_value5']
              top_n: 1
              config:
                error_if: "=0"
                warn_if: "<4"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant