-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: improve performance of array_union/array_intersect with batched row conversion
#20243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| let r_start = r_offsets[i].as_usize(); | ||
| let r_end = r_offsets[i + 1].as_usize(); | ||
|
|
||
| let mut count = 0usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count can be declared out of cycle and reused?
perhaps we can find a better name and clarify count of what the variable is storing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count can be determined solely from the seen size too I believe? instead of incrementing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, count is redundant here since we can just use seen size.
| seen.clear(); | ||
| r_set.clear(); | ||
|
|
||
| match set_op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get more performance gains by moving this match outside the hot loop? Or making it a const generic for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, using const generics could definitely give us a nice performance boost.
group optimized optimized_const_generic
----- --------- ----------
array_intersect/high_overlap/10 1.79 800.7±51.79µs ? ?/sec 1.00 446.9±15.17µs ? ?/sec
array_intersect/high_overlap/100 1.77 8.2±0.13ms ? ?/sec 1.00 4.6±0.08ms ? ?/sec
array_intersect/high_overlap/50 1.77 4.0±0.06ms ? ?/sec 1.00 2.3±0.07ms ? ?/sec
array_intersect/low_overlap/10 1.70 570.4±53.84µs ? ?/sec 1.00 335.3±4.74µs ? ?/sec
array_intersect/low_overlap/100 1.62 6.7±0.27ms ? ?/sec 1.00 4.2±0.07ms ? ?/sec
array_intersect/low_overlap/50 1.71 3.4±0.44ms ? ?/sec 1.00 1993.1±23.05µs ? ?/sec
array_union/high_overlap/10 1.62 548.4±30.79µs ? ?/sec 1.00 337.6±8.12µs ? ?/sec
array_union/high_overlap/100 2.06 7.5±2.17ms ? ?/sec 1.00 3.6±0.10ms ? ?/sec
array_union/high_overlap/50 1.53 2.8±0.06ms ? ?/sec 1.00 1805.2±72.23µs ? ?/sec
array_union/low_overlap/10 1.88 718.8±148.49µs ? ?/sec 1.00 382.7±15.45µs ? ?/sec
array_union/low_overlap/100 1.67 6.9±0.21ms ? ?/sec 1.00 4.1±0.13ms ? ?/sec
array_union/low_overlap/50 1.70 3.5±0.10ms ? ?/sec 1.00 2.0±0.06ms ? ?/sec
| let r_start = r_offsets[i].as_usize(); | ||
| let r_end = r_offsets[i + 1].as_usize(); | ||
|
|
||
| let mut count = 0usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count can be determined solely from the seen size too I believe? instead of incrementing it
| ); | ||
| } | ||
|
|
||
| fn invoke_array_intersect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function look exactly the same as invoke_array_union(). Maybe drop one of them ?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. thanks for the suggestion.
| ); | ||
| } | ||
|
|
||
| for &array_size in ARRAY_SIZES { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the two for &array_size in ARRAY_SIZES loops could be simplified into one by using another outer/inner loop: for (overlap_label, overlap_ratio) in &[("high_overlap", 0.8), ("low_overlap", 0.2)] { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
|
||
| let mut result_offsets = Vec::with_capacity(l.len() + 1); | ||
| result_offsets.push(OffsetSize::usize_as(0)); | ||
| let mut final_rows = Vec::with_capacity(rows_l.num_rows()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for SetOp::Intercept the capacity could be optimised to min(rows_l.num_rows(), rows_r.num_rows())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the capacity for SetOp::Intersect to min(rows_l.num_rows(), rows_r.num_rows()).
| return internal_err!("{set_op}: failed to get array from rows"); | ||
| SetOp::Intersect => { | ||
| // Build hash set from right array for lookup table | ||
| // then iterator left array to find common elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // then iterator left array to find common elements. | |
| // then iterate left array to find common elements. |
| let overlap_positions = &positions[..overlap_count]; | ||
|
|
||
| for i in 0..array_size { | ||
| if overlap_positions.contains(&i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slice::contains() is O(n) (linear search). Using a HashSet would be O(1), but create_arrays_with_overlap() is called before group.bench_with_input(...), so maybe it is OK.
let overlap_positions: std::collections::HashSet<_> =
positions[..overlap_count].iter().copied().collect();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| return internal_err!("{set_op}: failed to get array from rows"); | ||
| SetOp::Intersect => { | ||
| // Build hash set from right array for lookup table | ||
| // then iterator left array to find common elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be faster to create the HashSet from the shorter array and iterate over the longer one. This would minimise the memory usage for the hash set and can reduce the number of hash operations, especially when there's a significant size difference between the two arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now building the HashSet from the shorter array and iterating over the longer one, good catch.
| )] | ||
| #[derive(Debug, PartialEq, Eq, Hash)] | ||
| pub(super) struct ArrayIntersect { | ||
| pub struct ArrayIntersect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now public to be able to use it in the benchmark test (a separate crate).
Maybe annotate it with #[doc(hidden)] to hide it from the end users, since it is not really supposed to be part of the public APIs ?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayIntersect already has #[user_doc], and keeping it visible aligns with how other user-facing SQL functions are exposed?
|
Thanks everyone for the reviews. I've addressed all the feedback. Please let me know if anything else needs adjustment. |
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lyne7-sc looks like it is fine now
test changed because array_ doesnt preserve order like in https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_intersect.html?utm_source=chatgpt.com
However in future it could be a reason for flaky tests.
Thanks @martin-g and @Jefffrey for the review
Which issue does this PR close?
Rationale for this change
The current implementation of
array_unionandarray_intersectperformsRowConverter::convert_columns()on a per-row basis, which introduces avoidable overhead due to repeated conversions and intermediate allocations.This PR improves performance by:
sorted().dedup()pattern in favor of hash-based set operationsWhat changes are included in this PR?
Refactored the internal set operation implementation to use batch row conversion and a single-pass construction of result arrays.
Benchmarks
Are these changes tested?
Yes. Existing SQL logic tests updated to reflect new output order.
Are there any user-facing changes?
Yes. The output order may differ from the previous implementation.
Previously, results were implicitly sorted due to the use of
sorted().dedup(). The new implementation preserves the order of first appearance within each list.This is a user-visible behavioral change, but it is consistent with typical SQL set operation semantics, which do not guarantee a specific output order.