Conversation
|
Hi Valentin, Great to see you work on Deequ! We have something like this already in the profiler, but it does not use the aggregator API, maybe you want to include it in your PR as well. Here is the corresponding place in the code: |
|
Hi Sebastian :-) thanks for the pointer! This would fit in well I think. Ideally, we could support any of the existing metrics on a per-group bases. I'll play around to see if I can integrate this with the existing Analyzer API. The other option is to move all current code into the Aggregator API, but this would be a bigger change and I'm not sure if there are any downsides with this. Do you have thoughts on this? Also one would need a user API to specify e.g. constraints per group similar to the |
|
I don't think we should run all aggregations via the aggregator API, because there are some aggregations which might be run on high-cardinality columns (e.g. testing whether a key column contains no duplicates). |
|
Thanks. Just to clarify, we can use the aggregator api without any grouping and it will work exactly like the existing metrics that we compute. The problem I mentioned only occurs if you use it to specifically compute a metric for each group. Anyway, I'll iterate on this a bit. |
Description of changes:
This is a small work in progress POC to test how we can compute metrics per group in one pass without doing an actual groupBy (see e.g. #149).
The idea is to use an Aggregator that keeps track of aggregations per group of a column. For this example I used the spark Aggregator api. It could perhaps also be implemented using the existing
Analyzerwith some changes to the current code (one has to be able to access the underlying aggregation function, the state etc in the wrapper that aggregates per group).One issue could be that the aggregator state may become big if there are many values in the groupBy column, but I think the use case is low-cardinality columns anyway.
Let me know your thoughts.
Running the example gives
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.