-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: change Expr OuterReferenceColumn and Alias to Box type for reducing expr struct size #16771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuqi-lucas -- this looks quite nice
I think it is an API change so I will mark the PR as such and I think it is a good improvement
However, given it is an API change I think we should
- Wait until DataFusion 50.0.0 (let's not make any more changes in DataFusion 49 that we are starting to prepare for)
- Update the upgrading guide in https://datafusion.apache.org/library-user-guide/upgrading.html to explain to people hw to change things
| 10)--------------Projection: generate_series(Int64(1), CAST(outer_ref(t1.t1_int) AS Int64)) AS __unnest_placeholder(generate_series(Int64(1),outer_ref(t1.t1_int))) | ||
| 11)----------------EmptyRelation | ||
| physical_plan_error This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(UInt32, Column { relation: Some(Bare { table: "t1" }), name: "t1_int" }) | ||
| physical_plan_error This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn((UInt32, Column { relation: Some(Bare { table: "t1" }), name: "t1_int" })) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why the plan changes? I don't think the extra () really adds much -- can we restore the original?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to OuterReference in latest PR, so it will use OuterReference instead of this change now, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next step, we may can add debug/fmt for OuterReference to restore the original behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore all slt in latest PR, no diff from main branch.
datafusion/expr/src/expr.rs
Outdated
| // `Box`ing the fields to make `Expr` smaller | ||
| // See https://github.com/apache/datafusion/issues/16199 for details | ||
| assert_eq!(size_of::<Expr>(), 128); | ||
| assert_eq!(size_of::<Expr>(), 112); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change saves 16 bytes per Expr. Nice.
I will run some planning benchmarks and see if we can see any difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb , i am doing more amazing experiment, try to reduce from 128 to 80, so we can save 48 bytes per Expr!
| pub enum Expr { | ||
| /// An expression with a specific name. | ||
| Alias(Alias), | ||
| Alias(Box<Alias>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another change that might be a bit less impactful would be to box the fields of Alias instead
So Alias {
expr: Box
..
}
It may be just as bad / worse though
datafusion/expr/src/expr.rs
Outdated
| /// A placeholder which holds a reference to a qualified field | ||
| /// in the outer query, used for correlated sub queries. | ||
| OuterReferenceColumn(DataType, Column), | ||
| OuterReferenceColumn(Box<(DataType, Column)>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to make this change anyways, can we also pull this into a named struct rather than a unnamed tuple
like
struct OuterReference {
// fields here
}
enum Expr {
...
OuterReferenceColumn(OuterReference),
...
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for this good suggestion, addressed in latest PR.
|
🤖 |
|
🤖: Benchmark completed Details
|
Thank you @alamb for review and i agree that we can do this for Datafusion 50.0.0, and since we have enough time, i want to do more improvement, try to reduce Expr more, from size 128 to 80, i am in progress now. |
|
Updated: Successfully changed the size from 128 to 80 in latest PR. |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
Thank you @alamb for benchmark, interesting, it seems the performance decrease from 112 size to 80, i will investigate it, if i can't find the root cause, i will revert to 112 bytes first! |
|
Strange, i can't reproduce this in my local: with_param_values_many_columns 1.00 129.8±0.74µs ? ?/sec 1.06 137.2±0.79µs |
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Just some nit for your review.
| Expr::Alias(alias_box) => { | ||
| // alias_box: Box<Alias> | ||
| let Alias { expr, name, .. } = *alias_box; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - use boxed_alias instead, for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kosiew , addressed in latest PR.
| if let Expr::Alias(alias_box) = expr { | ||
| // alias_box: &Box<Alias>, so alias_box.as_ref() is &Alias | ||
| let alias: &Alias = alias_box.as_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above - boxed_alias
datafusion/expr/src/expr_schema.rs
Outdated
| Expr::Alias(alias_box) => { | ||
| let Alias { expr, .. } = alias_box.as_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boxed_alias
datafusion/expr/src/utils.rs
Outdated
| stack.push(*left); | ||
| } | ||
| Expr::Alias(Alias { expr, .. }) => stack.push(*expr), | ||
| Expr::Alias(alias_box) => stack.push(*alias_box.expr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boxed_alias
Thank you @kosiew for review and suggestion, addressed in latest PR. |
|
After opening the DF50.0.0 release issue, you can add it to the list |
I will rerun |
|
🤖 |
Box three large variants in the Expr enum to reduce its memory footprint: - Alias(Alias) -> Alias(Box<Alias>) - Column(Column) -> Column(Box<Column>) - OuterReferenceColumn(FieldRef, Column) -> OuterReferenceColumn(Box<OuterReference>) This reduces the Expr enum size from 112 bytes to 80 bytes (28% reduction), which improves cache locality and reduces memory usage across the query engine since Expr is one of the most frequently used types. Introduces a new OuterReference named struct to replace the tuple variant, improving readability and maintainability. Based on apache#16771, rebased and updated against the latest master. https://claude.ai/code/session_018FV5mepXRLnBVzzWv8ihgb
569079b to
b5716d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR reduces the size of the Expr enum from 128 bytes to 80 bytes by boxing the Alias and OuterReferenceColumn variants. This is a continuation of ongoing efforts to optimize memory usage in DataFusion's expression system.
Changes:
- Changed
Expr::Alias(Alias)toExpr::Alias(Box<Alias>) - Changed
Expr::OuterReferenceColumn(FieldRef, Column)toExpr::OuterReferenceColumn(Box<OuterReference>)whereOuterReferenceis a new struct containing the field and column - Updated all pattern matching and construction sites throughout the codebase to work with the boxed types
- Updated test expectations for error messages that display
OuterReferenceColumn - Updated the size test to verify the new 80-byte size
Reviewed changes
Copilot reviewed 58 out of 58 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| datafusion/expr/src/expr.rs | Core type changes: boxed Alias and OuterReferenceColumn, added OuterReference struct, updated size test |
| datafusion/expr/src/lib.rs | Exported OuterReference type |
| datafusion/expr/src/*.rs | Updated pattern matching and expression construction for boxed types |
| datafusion/optimizer/src/*.rs | Updated optimizer rules to work with boxed types |
| datafusion/sql/src/*.rs | Updated SQL planning to construct boxed expressions |
| datafusion/substrait/src/*.rs | Updated Substrait conversion to handle boxed types |
| datafusion/proto/src/*.rs | Updated protobuf serialization/deserialization |
| datafusion/physical-expr/src/*.rs | Updated physical expression planning |
| datafusion/core/src/*.rs | Updated core dataframe and execution code |
| datafusion/functions*/src/*.rs | Updated function implementations |
| datafusion/**/tests/*.rs | Updated test code and test expectations |
| datafusion/sqllogictest/test_files/*.slt | Updated expected error messages for OuterReferenceColumn display format |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
run sql_planner |
|
run benchmark sql_planner |
|
🤖 |
|
I revive this PR by claude help. cc @alamb And let's see if benchmark will show improvement. |
|
🤖: Benchmark completed Details
|
|
Seems still no improvement in latest benchmark. |
edf1eaf to
3941aba
Compare
|
run benchmark sql_planner |
|
run benchmark sql_planner |
|
show benchmark queue |
|
🤖 Hi @alamb, you asked to view the benchmark queue (#16771 (comment)).
|
|
Thanks @alamb, i see, there are already 2 in queue.
|
👍 -- I just wanted to double check that the runner hadn't gotten stuck again (it sometimes does break / get restarted and needs manual intervention) |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
Continue to reduce the Expr struct size.
Rationale for this change
Continue to reduce the Expr struct size.
What changes are included in this PR?
Continue to reduce the Expr struct size.
Are these changes tested?
Yes
The size reduce from 128 to 112
Updated:
Continue reduce to 80 now!
Are there any user-facing changes?
No