Skip to content

Configurable RetainCompletenessRule#564

Merged
rdsharma26 merged 3 commits intoawslabs:masterfrom
zeotuan:TPM/RetainCompleteness
May 6, 2024
Merged

Configurable RetainCompletenessRule#564
rdsharma26 merged 3 commits intoawslabs:masterfrom
zeotuan:TPM/RetainCompleteness

Conversation

@zeotuan
Copy link
Contributor

@zeotuan zeotuan commented Apr 19, 2024

Close #340
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

*/
case class RetainCompletenessRule() extends ConstraintRule[ColumnProfile] {

case class RetainCompletenessRule(minCompleteness: Double = 0.2, maxCompleteness: Double = 1.0) extends ConstraintRule[ColumnProfile] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided not to Parameterize z-value likes in original implementation. Due to the fact that it is related to a specific Interval Calculation Techniques. If possible, we can work into parameterize the strategy use to calculating the interval #563

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zeotuan
Can you trim this line to below 120 characters? It is failing checkstyle and failing the build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also store the values 0.2 and 1.0 as constants ?

@zeotuan
Copy link
Contributor Author

zeotuan commented Apr 22, 2024

@rdsharma26 Hi, Please help review this PR.

@zeotuan zeotuan requested a review from rdsharma26 May 1, 2024 08:21
Copy link
Contributor

@rdsharma26 rdsharma26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the feedback. LGTM.

@rdsharma26 rdsharma26 merged commit 49e970c into awslabs:master May 6, 2024
@zeotuan zeotuan deleted the TPM/RetainCompleteness branch May 9, 2024 01:26
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <jzexter@amazon.com>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <svanvari@amazon.com>

* Updated version in pom.xml to 2.0.8-spark-3.4

---------

Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com>
Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com>
Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com>
Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com>
Co-authored-by: Joshua Zexter <jzexter@amazon.com>
Co-authored-by: bojackli <478378663@qq.com>
Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com>
Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com>
Co-authored-by: Shriya Vanvari <svanvari@amazon.com>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <jzexter@amazon.com>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <svanvari@amazon.com>

* Match Breeze version with spark 3.3 (#562)

* Updated version in pom.xml to 2.0.8-spark-3.3

---------

Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com>
Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com>
Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com>
Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com>
Co-authored-by: Joshua Zexter <jzexter@amazon.com>
Co-authored-by: bojackli <478378663@qq.com>
Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com>
Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com>
Co-authored-by: Shriya Vanvari <svanvari@amazon.com>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <jzexter@amazon.com>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <svanvari@amazon.com>

* Updated version in pom.xml to 2.0.8-spark-3.2

---------

Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com>
Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com>
Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com>
Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com>
Co-authored-by: Joshua Zexter <jzexter@amazon.com>
Co-authored-by: bojackli <478378663@qq.com>
Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com>
Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com>
Co-authored-by: Shriya Vanvari <svanvari@amazon.com>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <jzexter@amazon.com>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <svanvari@amazon.com>

* pdated version in pom.xml to 2.0.8-spark-3.1

---------

Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com>
Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com>
Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com>
Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com>
Co-authored-by: Joshua Zexter <jzexter@amazon.com>
Co-authored-by: bojackli <478378663@qq.com>
Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com>
Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com>
Co-authored-by: Shriya Vanvari <svanvari@amazon.com>
arsenalgunnershubert777 pushed a commit to arsenalgunnershubert777/deequ that referenced this pull request Nov 8, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const
rdsharma26 pushed a commit that referenced this pull request Dec 18, 2024
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const
SamPom100 pushed a commit that referenced this pull request Jan 16, 2025
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants