Configurable RetainCompletenessRule by zeotuan · Pull Request #564 · awslabs/deequ

zeotuan · 2024-04-19T01:07:20Z

Close #340
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zeotuan · 2024-04-19T01:20:22Z

src/main/scala/com/amazon/deequ/suggestions/rules/RetainCompletenessRule.scala

  */
-case class RetainCompletenessRule() extends ConstraintRule[ColumnProfile] {
-
+case class RetainCompletenessRule(minCompleteness: Double = 0.2, maxCompleteness: Double = 1.0) extends ConstraintRule[ColumnProfile] {


Decided not to Parameterize z-value likes in original implementation. Due to the fact that it is related to a specific Interval Calculation Techniques. If possible, we can work into parameterize the strategy use to calculating the interval #563

Thanks @zeotuan
Can you trim this line to below 120 characters? It is failing checkstyle and failing the build.

Can we also store the values 0.2 and 1.0 as constants ?

zeotuan · 2024-04-22T05:05:54Z

@rdsharma26 Hi, Please help review this PR.

rdsharma26

Thank you for addressing the feedback. LGTM.

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <jzexter@amazon.com> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <svanvari@amazon.com> * Updated version in pom.xml to 2.0.8-spark-3.4 --------- Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com> Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com> Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com> Co-authored-by: Joshua Zexter <jzexter@amazon.com> Co-authored-by: bojackli <478378663@qq.com> Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com> Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com> Co-authored-by: Shriya Vanvari <svanvari@amazon.com>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <jzexter@amazon.com> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <svanvari@amazon.com> * Match Breeze version with spark 3.3 (#562) * Updated version in pom.xml to 2.0.8-spark-3.3 --------- Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com> Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com> Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com> Co-authored-by: Joshua Zexter <jzexter@amazon.com> Co-authored-by: bojackli <478378663@qq.com> Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com> Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com> Co-authored-by: Shriya Vanvari <svanvari@amazon.com>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <jzexter@amazon.com> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <svanvari@amazon.com> * Updated version in pom.xml to 2.0.8-spark-3.2 --------- Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com> Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com> Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com> Co-authored-by: Joshua Zexter <jzexter@amazon.com> Co-authored-by: bojackli <478378663@qq.com> Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com> Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com> Co-authored-by: Shriya Vanvari <svanvari@amazon.com>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <jzexter@amazon.com> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <svanvari@amazon.com> * pdated version in pom.xml to 2.0.8-spark-3.1 --------- Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com> Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com> Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com> Co-authored-by: Joshua Zexter <jzexter@amazon.com> Co-authored-by: bojackli <478378663@qq.com> Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com> Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com> Co-authored-by: Shriya Vanvari <svanvari@amazon.com>

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

zeotuan added 2 commits April 19, 2024 11:06

Configurable RetainCompletenessRule

3b41e4c

Add doc string

ac337ea

zeotuan commented Apr 19, 2024

View reviewed changes

Add default completeness const

db9b764

zeotuan requested a review from rdsharma26 May 1, 2024 08:21

rdsharma26 approved these changes May 6, 2024

View reviewed changes

rdsharma26 merged commit 49e970c into awslabs:master May 6, 2024

zeotuan deleted the TPM/RetainCompleteness branch May 9, 2024 01:26

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

Configurable RetainCompletenessRule (awslabs#564)

e935e18

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

Configurable RetainCompletenessRule (awslabs#564)

b474a37

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

Configurable RetainCompletenessRule (awslabs#564)

56053d9

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

Configurable RetainCompletenessRule (awslabs#564)

9a6713b

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

arsenalgunnershubert777 pushed a commit to arsenalgunnershubert777/deequ that referenced this pull request Nov 8, 2024

Configurable RetainCompletenessRule (awslabs#564)

572d776

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

rdsharma26 pushed a commit that referenced this pull request Dec 18, 2024

Configurable RetainCompletenessRule (#564)

6bf48da

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

SamPom100 pushed a commit that referenced this pull request Jan 16, 2025

Configurable RetainCompletenessRule (#564)

be1eb56

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable RetainCompletenessRule#564

Configurable RetainCompletenessRule#564
rdsharma26 merged 3 commits intoawslabs:masterfrom
zeotuan:TPM/RetainCompleteness

zeotuan commented Apr 19, 2024

Uh oh!

zeotuan Apr 19, 2024

Uh oh!

rdsharma26 Apr 30, 2024

Uh oh!

rdsharma26 Apr 30, 2024

Uh oh!

zeotuan commented Apr 22, 2024

Uh oh!

rdsharma26 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zeotuan commented Apr 19, 2024

Uh oh!

zeotuan Apr 19, 2024

Choose a reason for hiding this comment

Uh oh!

rdsharma26 Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

rdsharma26 Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

zeotuan commented Apr 22, 2024

Uh oh!

rdsharma26 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants