Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL for samples query mishandles OR condition in filter #2021

Open
abargar opened this issue Feb 14, 2024 · 4 comments
Open

SQL for samples query mishandles OR condition in filter #2021

abargar opened this issue Feb 14, 2024 · 4 comments

Comments

@abargar
Copy link

abargar commented Feb 14, 2024

When adding a SodaCL missing_count check with an OR condition in the filter, we get inaccurate results for the samples. In particular, we discovered that this OR condition can override the dataset filter SQL.

We found this stems from how the samples SQL is generated. The SQL to get the number of failed rows is not affected because it uses COUNT(CASE WHEN... rather than placing all filters on a single line.

Example Soda check:

filter db.table [dataset_filter]:
    where: db_filter = '${variable}'

checks for db.table [dataset_filter]:
    - missing_count(value_column) = 0:
            filter: column_a IS NOT NULL OR column_b IS NOT NULL

Example samples query:

SELECT * FROM db.table 
 WHERE db_filter = ‘variable’ AND value_column IS NULL AND column_a IS NOT NULL OR column_b IS NOT NULL 
 LIMIT 5

This leads to us getting results where column_b IS NOT NULL is true regardless of the other conditions.

The accurate query is as follows:

SELECT * FROM db.table 
 WHERE db_filter = ‘variable’ AND value_column IS NULL AND (column_a IS NOT NULL OR column_b IS NOT NULL)
 LIMIT 5

The correct SQL can be generated by adding parentheses to the check filter, as so:

checks for db.table [dataset_filter]:
    - missing_count(value_column) = 0:
            filter: (column_a IS NOT NULL OR column_b IS NOT NULL)

However, this is an unexpected behavior and ideally the parentheses would not be required.

@tools-soda
Copy link

SAS-2863

@denisdallinga
Copy link

denisdallinga commented Feb 14, 2024

from pyspark.sql import SparkSession
from soda.scan import Scan
from soda.sampler.sampler import Sampler, SampleContext

# custom sampler to print the samples
class CountRowsSampler(Sampler):
    def store_sample(self, sample_context: SampleContext):
        print(f'samples: {sample_context.sample.get_rows()}')

spark_session = SparkSession.builder.config('spark.app.name', 'pytest-spark').getOrCreate()

data = [[ 'banana', 10, -1 ]] * 3 + [[ 'apples', 8, -1 ]] * 3
df = spark_session.createDataFrame(data=data, schema=['fruit', 'sweetness_score', 'umami_score'])
df.createOrReplaceTempView('dataframe')

# setup the scan
scan = Scan()
scan.add_spark_session(spark_session=spark_session)
scan.set_data_source_name('spark_df')

check = """
filter dataframe [bananas_only]:
    where: fruit = 'banana'

checks for dataframe [bananas_only]:
    - missing_count(umami_score) = 0:
        missing values: [-1]
        filter: sweetness_score = 8 OR sweetness_score = 10
"""
scan.add_sodacl_yaml_str(check)
scan.sampler = CountRowsSampler()

scan.execute()

output:

samples: (['banana', 10, -1], ['banana', 10, -1], ['banana', 10, -1], ['apples', 8, -1], ['apples', 8, -1], ['apples', 8, -1])

You can see here the sampler includes apples where the dataset wide filter should filter for just bananas

@m1n0
Copy link
Contributor

m1n0 commented Feb 14, 2024

Thanks for reporting @abargar ! This indeed seems to be easily fixable by putting the check filter into parentheses, would you be open to contribute a fix here? It should be very straightforward - add parentheses to all check types when filter is applied + add a test for each test type.

@abargar
Copy link
Author

abargar commented Feb 15, 2024

Hi, sure I'd be happy to! I'm more likely to have availability for this next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants