Fix iceberg column pruning by sudsali · Pull Request #685 · awslabs/deequ

sudsali · 2026-03-25T17:51:51Z

Issue #, if available:

Description of changes:

AnalysisRunner.runScanningAnalyzers runs .agg() on the full DataFrame without column pruning. V2 DataSource connectors (Iceberg, Delta Lake) make scan-planning decisions before Spark's optimizer simplifies the plan, causing full table reads on wide tables.

Add columnsReferenced() method to Analyzer trait (default None : safe fallback)
Override in all scanning analyzers to declare their column dependencies
Add pruneColumns() in AnalysisRunner that selects only needed columns before .agg()
Pruning is automatically disabled when any analyzer has a WHERE clause or free-form SQL predicate
All 1018 tests pass including new tests for column pruning, where clause fallback, multi-column analyzers, and duplicate column handling

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

fix: Add column pruning in AnalysisRunner for V2 DataSource connectors

ec5fbd6

sudsali force-pushed the fix-iceberg-column-pruning branch from a097ae7 to ec5fbd6 Compare March 25, 2026 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix iceberg column pruning#685

Fix iceberg column pruning#685
sudsali wants to merge 1 commit intoawslabs:masterfrom
sudsali:fix-iceberg-column-pruning

sudsali commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sudsali commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant