Skip to content

Fix iceberg column pruning#685

Open
sudsali wants to merge 1 commit intoawslabs:masterfrom
sudsali:fix-iceberg-column-pruning
Open

Fix iceberg column pruning#685
sudsali wants to merge 1 commit intoawslabs:masterfrom
sudsali:fix-iceberg-column-pruning

Conversation

@sudsali
Copy link
Contributor

@sudsali sudsali commented Mar 25, 2026

Issue #, if available:

#667

Description of changes:

AnalysisRunner.runScanningAnalyzers runs .agg() on the full DataFrame without column pruning. V2 DataSource connectors (Iceberg, Delta Lake) make scan-planning decisions before Spark's optimizer simplifies the plan, causing full table reads on wide tables.

  • Add columnsReferenced() method to Analyzer trait (default None : safe fallback)
  • Override in all scanning analyzers to declare their column dependencies
  • Add pruneColumns() in AnalysisRunner that selects only needed columns before .agg()
  • Pruning is automatically disabled when any analyzer has a WHERE clause or free-form SQL predicate
  • All 1018 tests pass including new tests for column pruning, where clause fallback, multi-column analyzers, and duplicate column handling

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sudsali sudsali force-pushed the fix-iceberg-column-pruning branch from a097ae7 to ec5fbd6 Compare March 25, 2026 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant