Skip to content

Propagate table constraints through physical plans to optimize sort operations#14111

Merged
berkaysynnada merged 43 commits intoapache:mainfrom
gokselk:feature/physical-planner-functional-dependence
Jan 16, 2025
Merged

Propagate table constraints through physical plans to optimize sort operations#14111
berkaysynnada merged 43 commits intoapache:mainfrom
gokselk:feature/physical-planner-functional-dependence

Conversation

@gokselk
Copy link
Copy Markdown
Contributor

@gokselk gokselk commented Jan 13, 2025

Which issue does this PR close?

Closes #14110.

Rationale for this change

This PR extends the physical planner to propagate table constraints (PRIMARY KEY and UNIQUE) through the query plan. This allows us to optimize sort operations by recognizing when ordering requirements are already satisfied by existing constraints.

What changes are included in this PR?

  • Added Constraints propagation through physical plans including:
    • File scan executors (CSV, Parquet, Arrow, Avro, JSON)
    • Memory table executor
    • Aggregate executor
  • Added constraint projection logic
  • Updated EquivalenceProperties to consider constraints when evaluating sort requirements
  • Added tests for constraint propagation and sort optimization
  • Updated protobuf definitions to include constraints in physical plan serialization

Are these changes tested?

Yes, the changes include:

  • Tests for constraint projection and validation
  • Tests for sort optimization with constraints
  • sqllogictests verifying correct plan optimization

Are there any user-facing changes?

The changes are mostly internal optimizations, but users will see:

  • Improved query plans that eliminate redundant sorts
  • Updated EXPLAIN output that shows constraints in physical plans
  • More efficient execution of queries with ORDER BY on primary key columns

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jan 13, 2025
@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Jan 13, 2025
@gokselk
Copy link
Copy Markdown
Contributor Author

gokselk commented Jan 13, 2025

cc: @berkaysynnada @ozankabak

@ozankabak
Copy link
Copy Markdown
Contributor

I left my reviews here: synnada-ai#53 (review)

Copy link
Copy Markdown
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very close to the finish line. Let's iterate over my comments

Comment thread datafusion/core/src/datasource/physical_plan/file_scan_config.rs Outdated
Comment thread datafusion/physical-expr/src/equivalence/properties.rs Outdated
Comment thread datafusion/physical-expr/src/equivalence/properties.rs Outdated
match constraint {
Constraint::PrimaryKey(indices) => {
let new_indices =
update_elements_with_matching_indices(indices, proj_indices);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you refactor the update_elements_with_matching_indices function to take two impl Iterator's (you probably need to replace the looping order to do that), this function can also accept proj_indices as an impl Iterator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update_elements_with_matching_indices function uses .position() on proj_indices, which makes it necessary to clone it if we take it as an impl Iterator. I think this defeats the whole purpose of this refactoring.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind was to swap the loop order (iterate on proj_indices on the outer loop). That may enable us to use an impl Iterator for proj_indices. We probably will need to keep the type of entries as a slice because it does not have an ordering (though we can enforce that in a future PR). Had entries was ordered, I think we could have also taken it in as an impl Iterator -- but let's leave the latter for a future PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked into this with @berkaysynnada and it seems to have some intricacies. Let's leave this to another PR

.map(|col| col.index())
.collect::<Vec<_>>();
debug_assert_eq!(mapping.map.len(), indices.len());
self.constraints.project(&indices)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will not need to collect and materialize indices if you refactor project to accept an iterator. See my comment in functional_dependencies.rs.

Comment thread datafusion/physical-plan/src/memory.rs Outdated
Copy link
Copy Markdown
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this carefully (twice) and it LGTM

@berkaysynnada
Copy link
Copy Markdown
Contributor

I'll merge this PR once the main branch is all green

@berkaysynnada berkaysynnada merged commit 3cd31af into apache:main Jan 16, 2025
@berkaysynnada
Copy link
Copy Markdown
Contributor

Great efforts! Thank you @gokselk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leverage common prefix ordering constraints in physical planner

3 participants