New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL for samples query mishandles OR condition in filter #2021
Comments
SAS-2863 |
from pyspark.sql import SparkSession
from soda.scan import Scan
from soda.sampler.sampler import Sampler, SampleContext
# custom sampler to print the samples
class CountRowsSampler(Sampler):
def store_sample(self, sample_context: SampleContext):
print(f'samples: {sample_context.sample.get_rows()}')
spark_session = SparkSession.builder.config('spark.app.name', 'pytest-spark').getOrCreate()
data = [[ 'banana', 10, -1 ]] * 3 + [[ 'apples', 8, -1 ]] * 3
df = spark_session.createDataFrame(data=data, schema=['fruit', 'sweetness_score', 'umami_score'])
df.createOrReplaceTempView('dataframe')
# setup the scan
scan = Scan()
scan.add_spark_session(spark_session=spark_session)
scan.set_data_source_name('spark_df')
check = """
filter dataframe [bananas_only]:
where: fruit = 'banana'
checks for dataframe [bananas_only]:
- missing_count(umami_score) = 0:
missing values: [-1]
filter: sweetness_score = 8 OR sweetness_score = 10
"""
scan.add_sodacl_yaml_str(check)
scan.sampler = CountRowsSampler()
scan.execute() output:
You can see here the sampler includes apples where the dataset wide filter should filter for just bananas |
Thanks for reporting @abargar ! This indeed seems to be easily fixable by putting the check filter into parentheses, would you be open to contribute a fix here? It should be very straightforward - add parentheses to all check types when filter is applied + add a test for each test type. |
Hi, sure I'd be happy to! I'm more likely to have availability for this next week. |
When adding a SodaCL
missing_count
check with an OR condition in the filter, we get inaccurate results for the samples. In particular, we discovered that this OR condition can override the dataset filter SQL.We found this stems from how the samples SQL is generated. The SQL to get the number of failed rows is not affected because it uses
COUNT(CASE WHEN...
rather than placing all filters on a single line.Example Soda check:
Example samples query:
This leads to us getting results where
column_b IS NOT NULL
is true regardless of the other conditions.The accurate query is as follows:
The correct SQL can be generated by adding parentheses to the check filter, as so:
However, this is an unexpected behavior and ideally the parentheses would not be required.
The text was updated successfully, but these errors were encountered: