Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QualityReport with CorrelationSimilarity to a column that contains only nans generates a ValueError #351

Open
pvk-developer opened this issue May 25, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@pvk-developer
Copy link
Member

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.10 and below
  • Python version: Any
  • Operating System: Any

Error Description

When running quality report we are expecting it to be fault tolerant, meaning that if a single metric crashes during computation the report should catch those errors and continue with the other metrics and just report NaN for that metric. However when you have a column full of nans or nulls, the following error occurs for the CorrelationSimilarity:

ValueError: x and y must have length at least 2.

For some reason that ValueError is not being captured by the quality report:

try:
self._metric_results[metric.__name__] = metric.compute_breakdown(
real_data, synthetic_data, metadata)
except IncomputableMetricError:
# Metric is not compatible with this dataset.
self._metric_results[metric.__name__] = {}

Steps to reproduce

import pandas as pd
import numpy as np

real = pd.DataFrame({'a': [np.nan, np.nan, np.nan], 'b': [1, 2, 3]})
synth = pd.DataFrame({'a': [0, 1, 2], 'b': [1, 2, 3]})

from sdmetrics.reports.single_table import QualityReport
report = QualityReport()
metadata = {'columns': {'a': {'sdtype': 'numerical'}, 'b': {'sdtype': 'numerical'}}}

report.generate(real, synth, metadata)
Creating report:  50%|████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                            | 2/4 [00:00<00:00, 329.25it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

----> 1 report.generate(real, synth, metadata)

File ~/Projects/SDV/SDMetrics/sdmetrics/reports/single_table/quality_report.py:77, in QualityReport.generate(self, real_data, synthetic_data, metadata, verbose)
     75 for metric in tqdm.tqdm(metrics, desc='Creating report', disable=(not verbose)):
     76     try:
---> 77         self._metric_results[metric.__name__] = metric.compute_breakdown(
     78             real_data, synthetic_data, metadata)
     79     except IncomputableMetricError:
     80         # Metric is not compatible with this dataset.
     81         self._metric_results[metric.__name__] = {}

File ~/Projects/SDV/SDMetrics/sdmetrics/single_table/multi_column_pairs.py:129, in MultiColumnPairsMetric.compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
    127     real = real_data[list(sorted_columns)]
    128     synthetic = synthetic_data[list(sorted_columns)]
--> 129     breakdown[sorted_columns] = cls.column_pairs_metric.compute_breakdown(
    130         real, synthetic, **kwargs)
    132 return breakdown

File ~/Projects/SDV/SDMetrics/sdmetrics/column_pairs/statistical/correlation_similarity.py:103, in CorrelationSimilarity.compute_breakdown(cls, real_data, synthetic_data, coefficient)
     99 else:
    100     raise ValueError(f'requested coefficient {coefficient} is not valid. '
    101                      'Please choose either Pearson or Spearman.')
--> 103 correlation_real, _ = correlation_fn(real_data[column1], real_data[column2])
    104 correlation_synthetic, _ = correlation_fn(synthetic_data[column1], synthetic_data[column2])
    106 if np.isnan(correlation_real) or np.isnan(correlation_synthetic):

File ~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4411, in pearsonr(x, y, alternative)
   4408     raise ValueError('x and y must have the same length.')
   4410 if n < 2:
-> 4411     raise ValueError('x and y must have length at least 2.')
   4413 x = np.asarray(x)
   4414 y = np.asarray(y)

ValueError: x and y must have length at least 2.
@pvk-developer pvk-developer added bug Something isn't working new Label applied to new issues and removed new Label applied to new issues labels May 25, 2023
@npatki
Copy link
Contributor

npatki commented Jun 7, 2023

Requirements:

  1. The base metric for CorrelationSimilarity should produce an error when there are all NaN values, as the correlation is not defined in this case.
  2. The Quality Report should do a better job at catching the error, potentially surfacing it as a warning and then moving on with the other metrics. The report should not crash.

I believe (2) will be taken care of by the updated Column Pair Trends property, as described in issue #356 (single table) and #358 (multi table).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants