Skip to content

OptimizeProjections: prune struct-only UNNEST when outputs are unused and ancestors are multiplicity-insensitive#20668

Draft
kosiew wants to merge 7 commits intoapache:mainfrom
kosiew:logical-prune-20118
Draft

OptimizeProjections: prune struct-only UNNEST when outputs are unused and ancestors are multiplicity-insensitive#20668
kosiew wants to merge 7 commits intoapache:mainfrom
kosiew:logical-prune-20118

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Mar 3, 2026

Which issue does this PR close?


Rationale for this change

Logical UNNEST can sometimes be proven unnecessary when it does not contribute any required output columns and when removing it cannot change query results.

However, UNNEST is not always safe to remove:

  • List/array unnest can change row cardinality (e.g., empty lists / nulls) even if the unnested column is not referenced.
  • Some operators/expressions (e.g., count(*), certain windows) are multiplicity-sensitive, meaning they can observe row-count changes.
  • Volatile expressions (non-deterministic / time-dependent) must not have their evaluation frequency changed by cardinality rewrites.

This PR adds strict, logical-level gating so UNNEST is eliminated only when these semantic hazards are ruled out.


What changes are included in this PR?

  • Track multiplicity/volatility context through projection pruning

    • Extend RequiredIndices with:

      • multiplicity_sensitive: whether ancestor operators can observe row-multiplicity changes.
      • has_volatile_ancestor: whether any ancestor expression is volatile.
    • Propagate volatility presence from each plan node (plan.expressions().any(Expr::is_volatile)) into child requirements.

    • Mark children as multiplicity-sensitive/insensitive depending on the operator context (e.g., aggregate/window/join paths).

  • Logical pruning of LogicalPlan::Unnest when safe

    • Add can_eliminate_unnest + helpers:

      • Require not multiplicity-sensitive and no volatile ancestors.
      • Disallow elimination when list/array unnest is present (conservative due to empty-list/null row-dropping semantics).
      • Ensure all required output indices are passthrough mappings to input columns via dependency_indices and qualified-field equality.
    • If eligible, replace Unnest with its input and compute child requirements using passthrough dependencies.

    • Otherwise, keep existing behavior and make the child requirement explicitly multiplicity-sensitive.

  • Tests

    • Unit tests:

      • Eliminate struct unnest when only group keys are required.
      • Keep list unnest even when only group keys are required.
      • Keep unnest when aggregates depend on multiplicity (count(1) / count(*)).
      • Keep unnest when preserve_nulls is disabled.
    • SQLLogicTest coverage (optimizer_unnest_prune.slt):

      • EXPLAIN asserts Unnest is removed only for safe struct case.
      • EXPLAIN asserts Unnest remains for list case and for multiplicity-sensitive count.
      • Correctness checks validate empty-list/null behavior and multiplicity-sensitive counts.

Are these changes tested?

Yes.

  • Added unit tests in the optimizer module covering both positive (safe elimination) and negative (must keep) scenarios.

  • Added SQLLogicTests validating:

    • Explain-plan rewrites (Unnest absent only in the safe struct case).
    • Result correctness for empty-list/null semantics and multiplicity-sensitive aggregates.

Are there any user-facing changes?

Yes (query-plan / performance behavior).

  • In eligible queries, EXPLAIN will no longer show Unnest and execution may be faster due to avoiding unnecessary expansion.
  • No SQL syntax or API changes.
  • Semantics are preserved by construction via strict safety checks; list/array cases remain conservative.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 7 commits March 3, 2026 20:02
- Added handling for volatile expressions to impact the optimization process within the `optimize_projections` function.
- Introduced checks for volatile expressions in both plan and ancestor nodes to adjust required indices accordingly.
- Updated `RequiredIndices` struct to track whether it encounters volatile expressions and to handle multiplicity sensitivity.
- Implemented new utility functions to streamline the processing of child requirements and eliminate unnecessary unnesting when certain conditions are met.
- Added unit tests to validate the new functionality related to unnesting and aggregation on volatile expression scenarios.
- Allow elimination of unnest operation for empty lists while preserving nulls.
- Modify the `eliminate_unnest_when_only_group_keys_are_required` test case to specify struct unnest conditions.
- Introduce a new test case `keep_list_unnest_when_group_keys_are_only_required_outputs` to verify unnest behavior when only group keys are required.
- Ensure that the optimization logic correctly handles different unnest scenarios based on list and struct types.
- Introduced new SQL Logic Tests to validate unnest pruning behavior in DataFusion.
- Tests include scenarios with empty lists and null values to ensure correct handling of cardinality-sensitive cases.
- Added explanations for expected logical plans for both aggregation and selection queries.
…projections

- Removed repetitive code for handling volatile ancestors across different input plans.
- Introduced a new helper function `with_volatile_if_needed` to encapsulate the logic of conditionally adding a volatile ancestor.
- Improved code readability and maintainability by reducing duplication in `optimize_projections` method.
…city and volatility

- Introduced methods `for_multiplicity_sensitive_child` and `for_multiplicity_insensitive_child` for better handling of child requirements in `RequiredIndices`.
- Replaced usage of `with_volatile_if_needed` with `with_plan_volatile` and `with_volatile_ancestor_if` for clearer logic when managing volatile context.
- Updated `optimize_projections` function to use new methods, improving code readability and maintainability.
…ts for unnest pruning

- Updated the `rewrite_projection_given_requirements` function to enhance handling of projection requirements based on additional conditions such as projected benefit, multiplicity sensitivity, and volatile ancestors.
- Added a new SQL logic test to validate the pruning of struct unnest in cases where it is cardinality-preserving and outputs are irrelevant.
- Improved comments for clarity on unnest semantics regarding null preservation.
…te_projection_given_requirements function

This change simplifies the logic in the rewrite_projection_given_requirements function by removing the check for projection benefit, which was deemed unnecessary. This helps streamline the code and improve readability.
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant