GH-48277: [C++][Parquet] unpack with shuffle algorithm #47994

AntoinePrv · 2025-10-29T14:23:23Z

Rationale for this change

What changes are included in this PR?

Add a new method for building unpacking kernels.
The constexpr code generation creates a kernel appropriate for a given input/output bit width and simd size.
I have included a number of xsimd fallback that have been merged upstream.
I have run extensive benchmarks and re-dispatched among different sizes on specific architectures when it was not performing well.
The biggest win here is SSE4.2, though AVX2 improves too.
This is not built/tested for AVX512, though there are not really limitation. Currently the arch detection between all the avx512 is not consistent and sometimes error. I would need to investigate with the upcoming xsimd release.

Are these changes tested?

Yes

Are there any user-facing changes?

No

GitHub Issue: [C++][Parquet] Better simd unpack algorithm #48277

github-actions · 2025-10-29T14:23:48Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2025-11-27T13:49:50Z

⚠️ GitHub issue #48277 has been automatically assigned in GitHub to PR creator.

AntoinePrv · 2025-11-27T18:00:18Z

@pitrou apart from R-lint, this is looking pretty good.

pitrou · 2025-11-27T18:27:05Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-11-27T18:27:11Z

Benchmark runs are scheduled for commit a4bfe8a. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2025-11-27T20:52:09Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a4bfe8a.

There were 37 benchmark results indicating a performance regression:

Pull Request Run on amd64-c6a-4xlarge-linux at 2025-11-27 19:32:37Z
- BM_UnpackUint64 (C++) with params=DynamicAligned/47/64, source=cpp-micro, suite=arrow-bpacking-benchmark
- IsInInt64SmallSet (C++) with params=64, source=cpp-micro, suite=arrow-compute-scalar-set-lookup-benchmark
and 35 more (see the report linked below)

The full Conbench report has more details.

AntoinePrv · 2025-11-28T09:31:59Z

@pitrou I'm running this locally, and I made an error when fixing ASAN over-reading problem.
These latest benchmarks are not doing well.

pitrou · 2025-11-28T14:05:50Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-11-28T14:05:56Z

Benchmark runs are scheduled for commit dd3ec0d. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2025-11-28T18:08:56Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit dd3ec0d.

There were 19 benchmark results indicating a performance regression:

Pull Request Run on amd64-c6a-4xlarge-linux at 2025-11-28 15:12:47Z
- BM_UnpackUint32 (C++) with params=DynamicUnaligned/20/64, source=cpp-micro, suite=arrow-bpacking-benchmark
- BM_DeltaLengthDecodingByteArray (C++) with params=max-string-length:8/batch-size:2048, source=cpp-micro, suite=parquet-encoding-benchmark
and 17 more (see the report linked below)

The full Conbench report has more details.

pitrou · 2025-12-01T15:57:57Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-12-01T15:58:04Z

Benchmark runs are scheduled for commit 408ef04. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-12-02T15:07:06Z

@ursabot please benchmark lang=C++

conbench-apache-arrow · 2025-12-10T19:17:07Z

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

conbench-apache-arrow · 2025-12-13T15:40:26Z

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Co-authored-by: Antoine Pitrou <[email protected]>

AntoinePrv · 2026-02-10T16:44:23Z

@pitrou ready again

pitrou

Looks very good to me, just a couple minor comments!

pitrou · 2026-02-10T16:28:30Z

cpp/src/arrow/util/bpacking.cc

-template void unpack<uint16_t>(const uint8_t*, uint16_t*, int, int, int);
-template void unpack<uint32_t>(const uint8_t*, uint32_t*, int, int, int);
-template void unpack<uint64_t>(const uint8_t*, uint64_t*, int, int, int);
+template void unpack<bool>(const uint8_t*, bool*, const UnpackOptions&);


I'm curious, why not put all the unpack-related APIs inside arrow::internal::bpacking as well? Does it cause too much code churn, or would it fail for other reasons?

No reason, anything works really. My reasoning was unpack is a "library-public" utility function, so it lives in arrow::internal while arrow::internal::bpacking is "private" to the unpack function. Does that makes sense?

Kind of, though we might want to revisit later anyway. Not necessary for this PR in any case!