Skip to content

Conversation

@gene-bordegaray
Copy link
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Feb 9, 2026
Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR makes sense. My main comments:

  1. You may add document about pruning and filtering and why we need partition-index if not yet documented somewhere
  2. I think there is a typo in test data
  3. I am unclear whether recursively searching RepartitionExec is the best strategy and whether it will introduce new bug. Maybe some examples and explanation will help.
  4. I would ask someone that know dynamic filtering well to review this. Adrian and Lia? Maybe they will help explain the recursive walk, too

/// Determines whether partition-index routing should be used instead of CASE hash routing.
///
/// Partition-index routing is enabled when:
/// 1. The join is in `Partitioned` mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether adding a comment here saying we do not have problem with CollectLeft with mamy partitions on the probe side is useful or not. It is because there is only one hash table and it will be used for pruning and filtering to all partitions of the probe side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add this, will do 👀

}
}
false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In wonder if this is the right walk. Should we only check that there is RepartitionExec right before the join? Would we introduce bugs here? Maybe drawing some examples on paper will help you know whether this is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't believe so because there can be many operators between the join and the DataSourceExec. All that matters is seeing if there is some RepartitionExec in between (we then want to use CASE) or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there can be many operators between the join and the DataSourceExec but it is always the case that if we find any RepartitionExec, it will be for the join? What happens if that repartition is for group-by not the join?

I do not know the details of how dynamic filtering is implemented but this recursive walk worries me. Is there a way to identify that the RepartitionExec is for the join? E.g. it is repartitioned on the join key?

Copy link
Contributor

@LiaCastaneda LiaCastaneda Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a simpler way to know if we are preserving file partitioning, if we are preserving file partitioning I'd say we should store this optimizer decision in the HashJoinExec node instead of recursing through the plan, similar to how we store the PartitionMode in HashJoinExec to make decisions during execution. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into this, but at a furst glance I think this is a great suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the logic in enforce_distribution.rs but it is a bit more involved than the decision using a parititoned hash join. I have added documentation with explanation of cases and the method used


// Probe side: each partition has matching and non-matching rows.
// Partition 0: ("aa","ba",10.0) matches p0, ("zz","zz",20.0) does not match p0
// Partition 1: ("zz","zz",30.0) matches p1, ("aa","ba",40.0) does not match p1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your tests, I think these data is ok. However, it does not clear in the context of partitioned hash join. You may want to have data that clearly define partitions for both build and probe sides and make build side smaller and scatter so you can filter data from the probe side

@github-actions github-actions bot added documentation Improvements or additions to documentation optimizer Optimizer rules common Related to common crate proto Related to proto crate labels Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants