perf: Improve performance of native row-to-columnar transition used by JVM shuffle#3289
perf: Improve performance of native row-to-columnar transition used by JVM shuffle#3289andygrove wants to merge 30 commits intoapache:mainfrom
Conversation
Use bulk-append methods for primitive types in SparkUnsafeArray: - Non-nullable path uses append_slice() for optimal memcpy-style copy - Nullable path uses pointer iteration with efficient null bitset reading Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp Benchmark results (10K elements): | Type | Baseline | Optimized | Speedup | |------|----------|-----------|---------| | i32/no_nulls | 6.08µs | 0.65µs | **9.3x** | | i32/with_nulls | 22.49µs | 16.21µs | **1.39x** | | i64/no_nulls | 6.15µs | 1.22µs | **5x** | | i64/with_nulls | 16.41µs | 16.41µs | 1x | | f64/no_nulls | 8.05µs | 1.22µs | **6.6x** | | f64/with_nulls | 16.52µs | 16.21µs | 1.02x | | date32/no_nulls | - | 0.66µs | ~9x | | timestamp/no_nulls | - | 1.21µs | ~5x | Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The #[inline] attribute on functions with loops iterating over thousands of elements provides no benefit - the function call overhead is negligible compared to loop body execution, and inlining large functions causes instruction cache pressure. Keep #[inline] only on small helper functions: - get_header_portion_in_bytes (tiny const fn) - is_null_at (small, hot path) - null_bitset_ptr (tiny accessor) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused ArrayBuilder import - Use div_ceil() instead of manual implementation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Optimize struct field processing in native shuffle by using field-major instead of row-major order. This moves type dispatch from O(rows × fields) to O(fields), eliminating per-row type matching overhead. Previously, for each row we iterated over all fields and called `append_field()` which did a type match for EVERY field in EVERY row. For a struct with N fields and M rows, that's N×M type matches. The new approach: 1. First pass: Loop over rows, build struct validity 2. Second pass: For each field, get typed builder once, then process all rows for that field This keeps type dispatch at O(fields) instead of O(rows × fields). For complex nested types (struct, list, map), falls back to existing `append_field` since they have their own recursive processing logic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a Criterion benchmark to measure the performance of struct column processing in native shuffle. Tests various struct sizes (5, 10, 20 fields) and row counts (1K, 10K rows). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This extends the field-major optimization (from commit 471fb2a) to recursively handle nested Struct fields. Previously, nested structs fell back to row-major processing via `append_field`, losing the benefit of field-major processing at each nesting level. Changes: - Add `append_nested_struct_fields_field_major` helper function that recursively processes nested struct fields using field-major order - Update `append_struct_fields_field_major` to use field-major processing for nested Struct fields instead of falling back to `append_field` - Add benchmarks for 2-level and 3-level nested structs The optimization: 1. Gets the nested StructBuilder once per field 2. Builds nested struct validity in one pass 3. Recursively applies field-major processing to nested struct fields List and Map fields continue to fall back to `append_field` since they have variable-length elements that are harder to optimize. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3289 +/- ##
============================================
+ Coverage 56.12% 60.13% +4.00%
- Complexity 976 1468 +492
============================================
Files 119 175 +56
Lines 11743 16085 +4342
Branches 2251 2665 +414
============================================
+ Hits 6591 9672 +3081
- Misses 4012 5066 +1054
- Partials 1140 1347 +207 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add batched processing for List and Map columns that moves type dispatch outside the row loop, similar to the struct field-major optimization. Changes: - Add `append_list_column_batch` that dispatches on element type once, then processes all rows with the typed builder - Add `append_map_column_batch` that dispatches on key/value types once, with optimized paths for common combinations (Int64/Int64, Int32/Int32, etc.) - Update `append_columns` to use the new batch functions - Add benchmark for List<Int64> column conversion The optimization: - List columns: Type dispatch goes from O(rows) to O(1) for primitive elements - Map columns: Type dispatch goes from O(rows × 2) to O(2) for primitive key/values - Complex element types fall back to per-row dispatch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds criterion benchmark for Map<Int64, Int64> conversion to ensure the batched map column processing optimization is covered by benchmarks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename benchmark file to better reflect its scope (struct, list, map) - Fix incorrect comment: "native shuffle" -> "JVM shuffle" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename complex_type_conversion.rs to jvm_shuffle.rs - Rename array_conversion.rs to array_element_append.rs - Merge row_columnar.rs primitive benchmark into jvm_shuffle.rs - Delete redundant row_columnar.rs - Update comments to clarify these benchmark JVM shuffle path The jvm_shuffle benchmark now covers: - Primitive types (100 Int64 columns) - Struct (flat, nested, deeply nested) - List - Map The array_element_append benchmark is a micro-benchmark for the inner loop of array element iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add macro `impl_append_to_builder` to generate bulk append methods, reducing ~190 lines of duplicated unsafe code in list.rs - Add comprehensive safety documentation to SparkUnsafeObject trait explaining memory layout invariants and JVM ownership guarantees - Add safety documentation to append_columns function - Add #[inline] annotations to trait accessor methods for better optimization - Keep unsafe pointer iteration for performance (benchmarks show 7-14% regression with safe accessor approach for some types) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace .unwrap() and .expect() calls on builder downcasts with proper error handling that returns CometError::Internal with descriptive messages including: - The expected type - The actual type (via type_id for downcast_builder_ref) - The field index (for get_field_builder) Added two macros: - downcast_builder_ref!: returns Result with type mismatch details - get_field_builder!: returns Result with field index and expected type Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add SAFETY comments to: - SparkUnsafeRow::is_null_at and set_not_null_at - SparkUnsafeArray::new and is_null_at - Batch processing functions (append_list_column_batch, append_map_column_batch, append_struct_fields_field_major) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace repeated unsafe pointer dereference patterns with a macro that encapsulates the safety invariants. This reduces code duplication and centralizes the safety documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use expect with the builder type name to provide better error messages if a downcast fails. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use if-let instead of is_some() + unwrap() in update_metrics - Use local root_op variable directly instead of re-reading from exec_context Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The benchmark was importing from `shuffle::list` but the module is at `shuffle::spark_unsafe::list`. Also make the list module public so benchmarks can access it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // so we should always execute partition 0. | ||
| let stream = root_op.native_plan.execute(0, task_ctx)?; | ||
| exec_context.stream = Some(stream); | ||
| exec_context.root_op = Some(root_op); |
There was a problem hiding this comment.
It has been a month since I made this change, so I do not fully remember, but maybe just removing an unnecessary clone.
…perf # Conflicts: # native/core/src/execution/jni_api.rs
| /// The caller must ensure: | ||
| /// - `row_addresses_ptr` points to an array of at least `row_end` jlong values | ||
| /// - `row_sizes_ptr` points to an array of at least `row_end` jint values | ||
| /// - Each address in `row_addresses_ptr[row_start..row_end]` points to valid Spark UnsafeRow data |
comphead
left a comment
There was a problem hiding this comment.
Thanks @andygrove nice PR, there is many SAFETY requirements, are we sure we comply to them in all possible cases?
That's a good question. Some of the changes in this PR are adding SAFETY docs to existing unsafe code. I created #3603 to add SAFETY docs to existing unsafe code in I think it is good to question any new unsafe code being added in this PR. |
This PR combines numerous related optimizations for the row-to-columnar conversion that happens in JVM shuffle for row-based inputs (Spark-native plans).
Benchmarks are in separate PR #3290
Struct Conversion Benchmarks
Flat Structs (Struct)
2-Level Nested Structs (Struct<Struct>)
3-Level Nested Structs (Struct<Struct<Struct>>)
Array Conversion Benchmarks (10K elements)
append_slice()append_slice()append_slice()append_slice()append_slice()Note: Array benchmarks don't exist on main, so comparison is based on original PR benchmarks showing 5-9x speedup for non-nullable arrays.
Summary
Struct Processing Improvements
The improvement is more pronounced with:
Why the Improvement?
Field-major order: Type dispatch happens once per field instead of once per row per field
Better cache locality: Processing all rows for one field before moving to next field keeps data in CPU cache
Recursive optimization: For nested structs, the field-major approach is applied at each nesting level, compounding the benefits