Skip to content

Conversation

@lyne7-sc
Copy link
Contributor

@lyne7-sc lyne7-sc commented Feb 9, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

The current implementation of array_union and array_intersect performs RowConverter::convert_columns() on a per-row basis, which introduces avoidable overhead due to repeated conversions and intermediate allocations.

This PR improves performance by:

  1. converting all list values to rows in a batch
  2. reusing hash sets across iterations
  3. removing the sorted().dedup() pattern in favor of hash-based set operations

What changes are included in this PR?

Refactored the internal set operation implementation to use batch row conversion and a single-pass construction of result arrays.

Benchmarks

group                               before                                  optimized
-----                               ------                                  ---------
array_intersect/high_overlap/10     2.99  1442.0±99.94µs        ? ?/sec     1.00   481.6±21.45µs        ? ?/sec
array_intersect/high_overlap/100    1.90      9.5±0.63ms        ? ?/sec     1.00      5.0±0.09ms        ? ?/sec
array_intersect/high_overlap/50     2.01      5.3±0.41ms        ? ?/sec     1.00      2.6±0.05ms        ? ?/sec
array_intersect/low_overlap/10      3.47  1288.1±72.39µs        ? ?/sec     1.00   371.4±14.08µs        ? ?/sec
array_intersect/low_overlap/100     2.35      9.2±0.43ms        ? ?/sec     1.00      3.9±0.08ms        ? ?/sec
array_intersect/low_overlap/50      2.45      5.1±0.41ms        ? ?/sec     1.00      2.1±0.07ms        ? ?/sec
array_union/high_overlap/10         4.01  1593.1±292.17µs        ? ?/sec    1.00   396.9±13.43µs        ? ?/sec
array_union/high_overlap/100        2.54      9.8±0.18ms        ? ?/sec     1.00      3.9±0.11ms        ? ?/sec
array_union/high_overlap/50         2.65      5.4±0.10ms        ? ?/sec     1.00      2.0±0.07ms        ? ?/sec
array_union/low_overlap/10          3.74  1622.7±96.50µs        ? ?/sec     1.00   434.3±17.87µs        ? ?/sec
array_union/low_overlap/100         2.39     10.3±0.92ms        ? ?/sec     1.00      4.3±0.11ms        ? ?/sec
array_union/low_overlap/50          2.63      5.8±0.27ms        ? ?/sec     1.00      2.2±0.11ms        ? ?/sec

Are these changes tested?

Yes. Existing SQL logic tests updated to reflect new output order.

Are there any user-facing changes?

Yes. The output order may differ from the previous implementation.

Previously, results were implicitly sorted due to the use of sorted().dedup(). The new implementation preserves the order of first appearance within each list.

This is a user-visible behavioral change, but it is consistent with typical SQL set operation semantics, which do not guarantee a specific output order.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 9, 2026
let r_start = r_offsets[i].as_usize();
let r_end = r_offsets[i + 1].as_usize();

let mut count = 0usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count can be declared out of cycle and reused?
perhaps we can find a better name and clarify count of what the variable is storing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count can be determined solely from the seen size too I believe? instead of incrementing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, count is redundant here since we can just use seen size.

seen.clear();
r_set.clear();

match set_op {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get more performance gains by moving this match outside the hot loop? Or making it a const generic for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, using const generics could definitely give us a nice performance boost.

group                               optimized                              optimized_const_generic
-----                               ---------                              ----------
array_intersect/high_overlap/10     1.79   800.7±51.79µs        ? ?/sec    1.00   446.9±15.17µs        ? ?/sec
array_intersect/high_overlap/100    1.77      8.2±0.13ms        ? ?/sec    1.00      4.6±0.08ms        ? ?/sec
array_intersect/high_overlap/50     1.77      4.0±0.06ms        ? ?/sec    1.00      2.3±0.07ms        ? ?/sec
array_intersect/low_overlap/10      1.70   570.4±53.84µs        ? ?/sec    1.00    335.3±4.74µs        ? ?/sec
array_intersect/low_overlap/100     1.62      6.7±0.27ms        ? ?/sec    1.00      4.2±0.07ms        ? ?/sec
array_intersect/low_overlap/50      1.71      3.4±0.44ms        ? ?/sec    1.00  1993.1±23.05µs        ? ?/sec
array_union/high_overlap/10         1.62   548.4±30.79µs        ? ?/sec    1.00    337.6±8.12µs        ? ?/sec
array_union/high_overlap/100        2.06      7.5±2.17ms        ? ?/sec    1.00      3.6±0.10ms        ? ?/sec
array_union/high_overlap/50         1.53      2.8±0.06ms        ? ?/sec    1.00  1805.2±72.23µs        ? ?/sec
array_union/low_overlap/10          1.88  718.8±148.49µs        ? ?/sec    1.00   382.7±15.45µs        ? ?/sec
array_union/low_overlap/100         1.67      6.9±0.21ms        ? ?/sec    1.00      4.1±0.13ms        ? ?/sec
array_union/low_overlap/50          1.70      3.5±0.10ms        ? ?/sec    1.00      2.0±0.06ms        ? ?/sec

let r_start = r_offsets[i].as_usize();
let r_end = r_offsets[i + 1].as_usize();

let mut count = 0usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count can be determined solely from the seen size too I believe? instead of incrementing it

);
}

fn invoke_array_intersect(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function look exactly the same as invoke_array_union(). Maybe drop one of them ?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks for the suggestion.

);
}

for &array_size in ARRAY_SIZES {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the two for &array_size in ARRAY_SIZES loops could be simplified into one by using another outer/inner loop: for (overlap_label, overlap_ratio) in &[("high_overlap", 0.8), ("low_overlap", 0.2)] { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


let mut result_offsets = Vec::with_capacity(l.len() + 1);
result_offsets.push(OffsetSize::usize_as(0));
let mut final_rows = Vec::with_capacity(rows_l.num_rows());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for SetOp::Intercept the capacity could be optimised to min(rows_l.num_rows(), rows_r.num_rows())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the capacity for SetOp::Intersect to min(rows_l.num_rows(), rows_r.num_rows()).

return internal_err!("{set_op}: failed to get array from rows");
SetOp::Intersect => {
// Build hash set from right array for lookup table
// then iterator left array to find common elements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// then iterator left array to find common elements.
// then iterate left array to find common elements.

let overlap_positions = &positions[..overlap_count];

for i in 0..array_size {
if overlap_positions.contains(&i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice::contains() is O(n) (linear search). Using a HashSet would be O(1), but create_arrays_with_overlap() is called before group.bench_with_input(...), so maybe it is OK.

let overlap_positions: std::collections::HashSet<_> =
            positions[..overlap_count].iter().copied().collect();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

return internal_err!("{set_op}: failed to get array from rows");
SetOp::Intersect => {
// Build hash set from right array for lookup table
// then iterator left array to find common elements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be faster to create the HashSet from the shorter array and iterate over the longer one. This would minimise the memory usage for the hash set and can reduce the number of hash operations, especially when there's a significant size difference between the two arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now building the HashSet from the shorter array and iterating over the longer one, good catch.

)]
#[derive(Debug, PartialEq, Eq, Hash)]
pub(super) struct ArrayIntersect {
pub struct ArrayIntersect {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now public to be able to use it in the benchmark test (a separate crate).
Maybe annotate it with #[doc(hidden)] to hide it from the end users, since it is not really supposed to be part of the public APIs ?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayIntersect already has #[user_doc], and keeping it visible aligns with how other user-facing SQL functions are exposed?

@lyne7-sc
Copy link
Contributor Author

Thanks everyone for the reviews. I've addressed all the feedback. Please let me know if anything else needs adjustment.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lyne7-sc looks like it is fine now
test changed because array_ doesnt preserve order like in https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_intersect.html?utm_source=chatgpt.com

However in future it could be a reason for flaky tests.
Thanks @martin-g and @Jefffrey for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants