Skip to content

Conversation

@AntoinePrv
Copy link
Contributor

@AntoinePrv AntoinePrv commented Oct 29, 2025

Rationale for this change

What changes are included in this PR?

  • Add a new method for building unpacking kernels.
    The constexpr code generation creates a kernel appropriate for a given input/output bit width and simd size.
  • I have included a number of xsimd fallback that have been merged upstream.
  • I have run extensive benchmarks and re-dispatched among different sizes on specific architectures when it was not performing well.
  • The biggest win here is SSE4.2, though AVX2 improves too.
  • This is not built/tested for AVX512, though there are not really limitation. Currently the arch detection between all the avx512 is not consistent and sometimes error. I would need to investigate with the upcoming xsimd release.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@AntoinePrv AntoinePrv force-pushed the new-bpacking branch 4 times, most recently from d2743d4 to 6e72467 Compare October 30, 2025 15:58
@AntoinePrv AntoinePrv force-pushed the new-bpacking branch 2 times, most recently from a7e4cd9 to 9efa59a Compare November 20, 2025 17:19
@AntoinePrv AntoinePrv force-pushed the new-bpacking branch 2 times, most recently from d01fdba to b28ea9b Compare November 27, 2025 09:57
@AntoinePrv AntoinePrv changed the title unpack with shuffle algorithm GH-48277: [C++][Parquet] unpack with shuffle algorithm Nov 27, 2025
@github-actions
Copy link

⚠️ GitHub issue #48277 has been automatically assigned in GitHub to PR creator.

@AntoinePrv AntoinePrv marked this pull request as ready for review November 27, 2025 14:03
@AntoinePrv
Copy link
Contributor Author

@pitrou apart from R-lint, this is looking pretty good.

@pitrou
Copy link
Member

pitrou commented Nov 27, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit a4bfe8a. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a4bfe8a.

There were 37 benchmark results indicating a performance regression:

The full Conbench report has more details.

@AntoinePrv
Copy link
Contributor Author

AntoinePrv commented Nov 28, 2025

@pitrou I'm running this locally, and I made an error when fixing ASAN over-reading problem.
These latest benchmarks are not doing well.

@pitrou
Copy link
Member

pitrou commented Nov 28, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit dd3ec0d. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit dd3ec0d.

There were 19 benchmark results indicating a performance regression:

The full Conbench report has more details.

@pitrou
Copy link
Member

pitrou commented Dec 1, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit 408ef04. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Member

pitrou commented Dec 2, 2025

@ursabot please benchmark lang=C++

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

@AntoinePrv AntoinePrv requested a review from pitrou February 10, 2026 15:28
@pitrou pitrou added the CI: Extra: C++ Run extra C++ CI label Feb 10, 2026
@AntoinePrv
Copy link
Contributor Author

@pitrou ready again

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me, just a couple minor comments!

template void unpack<uint16_t>(const uint8_t*, uint16_t*, int, int, int);
template void unpack<uint32_t>(const uint8_t*, uint32_t*, int, int, int);
template void unpack<uint64_t>(const uint8_t*, uint64_t*, int, int, int);
template void unpack<bool>(const uint8_t*, bool*, const UnpackOptions&);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why not put all the unpack-related APIs inside arrow::internal::bpacking as well? Does it cause too much code churn, or would it fail for other reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, anything works really. My reasoning was unpack is a "library-public" utility function, so it lives in arrow::internal while arrow::internal::bpacking is "private" to the unpack function. Does that makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of, though we might want to revisit later anyway. Not necessary for this PR in any case!

///
/// Resulting in the following swizzled register.
///
/// |AAABBBCC|CDDDEEEF|AAABBBCC|CDDDEEEF||AAABBBCC|CDDDEEEF|CDDDEEEF|FFGGGHHH|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice the two 32-bit halves are slightly different but I'm not sure why, since we only care about keeping AAA, CCC, BBB and DDD. Why the difference between CDDDEEEF (first) and FFGGGHHH (second), for example?

Copy link
Contributor Author

@AntoinePrv AntoinePrv Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a limitation of the current algorithm. When we swizzle, we assume the worst byte spread (here 2) for all values. It does not care what is the in the other swizzled bytes, only adds their own. This is why we are doing sizeof(uint32_t) / max_spread = 4 / 2 = 2 shifts per swizzle and not 5 (from your next comment).
So below are all three steps one next to the other that determine what goes where

Val we want.        | ~ Val AAA       | ~ Val CCC       || ~ Val BBB       | ~ Val DDD       |
Byte idx for val:   | Idx 0  | Idx 1  | Idx 0  | Idx 1  || Idx 0  | Idx 1  | Idx 1  | Idx 2  |
Result data:        |AAABBBCC|CDDDEEEF|AAABBBCC|CDDDEEEF||AAABBBCC|CDDDEEEF|CDDDEEEF|FFGGGHHH|

I'll add a comment to clarify that.

See next comment for discussion the improvements to that.

/// write to memory.
/// Now we can do the same with C and D without having to go to another swizzle (doing
/// multiple shifts per swizzle).
/// The next value to unpack is E, which starts on bit 12 (not byte aligned).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

B, C, D were not byte aligned either, so I don't understand what makes E special.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beginning of a new iteration. First iteration started with pointer on A (aligned) and extracted A, B, C, D.
However we cannot advance the pointer and repeat the kernel loop from here because E is unaligned compared to A.

We need to add another loop inside the kernel that will process E, F, G, H with bit offsets. Then on I we are aligned again so we can exit and leave the outer loop call the kernel again with a pointer on I.

/// account that we have an extra offset of 12 bits (this will change all the constants
/// used for swizzle and shift).
///
/// Note that in this example, we could have swizzled more than two values in each slot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence my question above :)

/// Note that in this example, we could have swizzled more than two values in each slot.
/// In practice, there may be situations where a more complex algorithm could fit more
/// shifts per swizzles, but that tend to not be the case when the SIMD register size
/// increases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a followup task, would be interesting to investigate in which (bit width, SIMD size) combinations a better algorithm could spare some swizzles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in fact the interactive visualizations in the the blog post will help a lot.

@github-actions github-actions bot removed the CI: Extra: C++ Run extra C++ CI label Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants