SQL for samples query mishandles OR condition in filter #2021

abargar · 2024-02-14T13:20:40Z

When adding a SodaCL missing_count check with an OR condition in the filter, we get inaccurate results for the samples. In particular, we discovered that this OR condition can override the dataset filter SQL.

We found this stems from how the samples SQL is generated. The SQL to get the number of failed rows is not affected because it uses COUNT(CASE WHEN... rather than placing all filters on a single line.

Example Soda check:

filter db.table [dataset_filter]:
    where: db_filter = '${variable}'

checks for db.table [dataset_filter]:
    - missing_count(value_column) = 0:
            filter: column_a IS NOT NULL OR column_b IS NOT NULL

Example samples query:

SELECT * FROM db.table 
 WHERE db_filter = ‘variable’ AND value_column IS NULL AND column_a IS NOT NULL OR column_b IS NOT NULL 
 LIMIT 5

This leads to us getting results where column_b IS NOT NULL is true regardless of the other conditions.

The accurate query is as follows:

SELECT * FROM db.table 
 WHERE db_filter = ‘variable’ AND value_column IS NULL AND (column_a IS NOT NULL OR column_b IS NOT NULL)
 LIMIT 5

The correct SQL can be generated by adding parentheses to the check filter, as so:

checks for db.table [dataset_filter]:
    - missing_count(value_column) = 0:
            filter: (column_a IS NOT NULL OR column_b IS NOT NULL)

However, this is an unexpected behavior and ideally the parentheses would not be required.

The text was updated successfully, but these errors were encountered:

tools-soda · 2024-02-14T13:21:14Z

SAS-2863

denisdallinga · 2024-02-14T13:30:16Z

from pyspark.sql import SparkSession
from soda.scan import Scan
from soda.sampler.sampler import Sampler, SampleContext

# custom sampler to print the samples
class CountRowsSampler(Sampler):
    def store_sample(self, sample_context: SampleContext):
        print(f'samples: {sample_context.sample.get_rows()}')

spark_session = SparkSession.builder.config('spark.app.name', 'pytest-spark').getOrCreate()

data = [[ 'banana', 10, -1 ]] * 3 + [[ 'apples', 8, -1 ]] * 3
df = spark_session.createDataFrame(data=data, schema=['fruit', 'sweetness_score', 'umami_score'])
df.createOrReplaceTempView('dataframe')

# setup the scan
scan = Scan()
scan.add_spark_session(spark_session=spark_session)
scan.set_data_source_name('spark_df')

check = """
filter dataframe [bananas_only]:
    where: fruit = 'banana'

checks for dataframe [bananas_only]:
    - missing_count(umami_score) = 0:
        missing values: [-1]
        filter: sweetness_score = 8 OR sweetness_score = 10
"""
scan.add_sodacl_yaml_str(check)
scan.sampler = CountRowsSampler()

scan.execute()

output:

samples: (['banana', 10, -1], ['banana', 10, -1], ['banana', 10, -1], ['apples', 8, -1], ['apples', 8, -1], ['apples', 8, -1])

You can see here the sampler includes apples where the dataset wide filter should filter for just bananas

m1n0 · 2024-02-14T18:10:20Z

Thanks for reporting @abargar ! This indeed seems to be easily fixable by putting the check filter into parentheses, would you be open to contribute a fix here? It should be very straightforward - add parentheses to all check types when filter is applied + add a test for each test type.

abargar · 2024-02-15T14:49:28Z

Hi, sure I'd be happy to! I'm more likely to have availability for this next week.

abargar mentioned this issue Mar 12, 2024

Add parenthesis to check filter #2033

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL for samples query mishandles OR condition in filter #2021

SQL for samples query mishandles OR condition in filter #2021

abargar commented Feb 14, 2024 •

edited

tools-soda commented Feb 14, 2024

denisdallinga commented Feb 14, 2024 •

edited

m1n0 commented Feb 14, 2024

abargar commented Feb 15, 2024

SQL for samples query mishandles OR condition in filter #2021

SQL for samples query mishandles OR condition in filter #2021

Comments

abargar commented Feb 14, 2024 • edited

tools-soda commented Feb 14, 2024

denisdallinga commented Feb 14, 2024 • edited

m1n0 commented Feb 14, 2024

abargar commented Feb 15, 2024

abargar commented Feb 14, 2024 •

edited

denisdallinga commented Feb 14, 2024 •

edited