Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Wilson Score Confidence Interval Strategy #567

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

zeotuan
Copy link
Contributor

@zeotuan zeotuan commented May 6, 2024

Fixes #563

Description of changes:

  • Update RetainCompletenessRules and FractionalCategoricalRangeRule to accept and configure ConfidenceIntervalStrategy parameter
  • Add Wilson Score Confidence Interval Strategy and Wald Interval Strategy (current default)
  • Make Wilson Score Confidence Interval Strategy the new default method

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@zeotuan zeotuan changed the title Tpm/interval strategy Adding Wilson Score Confidence Interval Strategy May 6, 2024
@zeotuan
Copy link
Contributor Author

zeotuan commented May 13, 2024

@rdsharma26 Hi can you help review this PR.

object ConfidenceIntervalStrategy {
val defaultConfidence = 0.95

case class ConfidenceInterval(lowerBound: Double, upperBound: Double)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently also calculate upperBound for these ConfidenceInterval. At the moment we don't actually make use of the upperBound though

@rdsharma26
Copy link
Contributor

Thank you for the PR! @zeotuan
The changes look good to me. Can you fix the failing build ?

One point I would like to discuss is making the Wilson Score Confidence Interval Strategy the new default. Could this potentially break backwards compatibility in terms of behavior? If so, should we stick to the Wald Interval Strategy as the default and update the documentation so that users of Deequ can choose which one they want?

@zeotuan
Copy link
Contributor Author

zeotuan commented May 20, 2024

Thank you for the PR! @zeotuan The changes look good to me. Can you fix the failing build ?

One point I would like to discuss is making the Wilson Score Confidence Interval Strategy the new default. Could this potentially break backwards compatibility in terms of behavior? If so, should we stick to the Wald Interval Strategy as the default and update the documentation so that users of Deequ can choose which one they want?

I think making Wilson Score Confidence Interval default right now might introduce some surprising changes to existing data quality pipelines. I will add example usage documentation for this.
We can potentially introduce plan to change the default in a major version update and add "deprecation" message for now so user migrate themselves.

@rdsharma26
Copy link
Contributor

rdsharma26 commented May 21, 2024

We can potentially introduce plan to change the default in a major version update and add "deprecation" message for now so user migrate themselves.

This seems like a safe approach. Could we configure the default to be the Wald strategy in the following line?

 private val defaultIntervalStrategy: ConfidenceIntervalStrategy = WilsonScoreIntervalStrategy()

The build also failed due to:

error file=/home/runner/work/deequ/deequ/src/test/scala/com/amazon/deequ/suggestions/rules/interval/IntervalStrategyTest.scala message=expected start of definition, but was Token(VAL,val,1285,val)

@@ -23,16 +23,17 @@ import com.amazon.deequ.metrics.DistributionValue
import com.amazon.deequ.profiles.ColumnProfile
import com.amazon.deequ.suggestions.ConstraintSuggestion
import com.amazon.deequ.suggestions.ConstraintSuggestionWithValue
import com.amazon.deequ.suggestions.rules.FractionalCategoricalRangeRule.defaultIntervalStrategy
import com.amazon.deequ.suggestions.rules.interval.{ConfidenceIntervalStrategy, WilsonScoreIntervalStrategy}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we avoid grouped imports and use one import per line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, do we prefer separate import or single import but with each on a single line

import com.amazon.deequ.suggestions.rules.interval.ConfidenceIntervalStrategy
import com.amazon.deequ.suggestions.rules.interval.WilsonScoreIntervalStrategy

or

import com.amazon.deequ.suggestions.rules.interval{
  ConfidenceIntervalStrategy,
  WilsonScoreIntervalStrategy
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former. It helps with automatic resolution of merge conflicts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@rdsharma26
Copy link
Contributor

Left a comment and a unit test needs fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support Wilson Score Interval for RetainCompletenessRule
2 participants