-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48277: [C++][Parquet] unpack with shuffle algorithm #47994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
d2743d4 to
6e72467
Compare
a7e4cd9 to
9efa59a
Compare
d01fdba to
b28ea9b
Compare
|
|
f546ed9 to
4f9fbe1
Compare
|
@pitrou apart from R-lint, this is looking pretty good. |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit a4bfe8a. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a4bfe8a. There were 37 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
@pitrou I'm running this locally, and I made an error when fixing ASAN over-reading problem. |
a4bfe8a to
dd3ec0d
Compare
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit dd3ec0d. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit dd3ec0d. There were 19 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit 408ef04. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
@ursabot please benchmark lang=C++ |
|
Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
|
Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
Co-authored-by: Antoine Pitrou <[email protected]>
9ba2fce to
714aec6
Compare
|
@pitrou ready again |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me, just a couple minor comments!
| template void unpack<uint16_t>(const uint8_t*, uint16_t*, int, int, int); | ||
| template void unpack<uint32_t>(const uint8_t*, uint32_t*, int, int, int); | ||
| template void unpack<uint64_t>(const uint8_t*, uint64_t*, int, int, int); | ||
| template void unpack<bool>(const uint8_t*, bool*, const UnpackOptions&); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why not put all the unpack-related APIs inside arrow::internal::bpacking as well? Does it cause too much code churn, or would it fail for other reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason, anything works really. My reasoning was unpack is a "library-public" utility function, so it lives in arrow::internal while arrow::internal::bpacking is "private" to the unpack function. Does that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of, though we might want to revisit later anyway. Not necessary for this PR in any case!
| /// | ||
| /// Resulting in the following swizzled register. | ||
| /// | ||
| /// |AAABBBCC|CDDDEEEF|AAABBBCC|CDDDEEEF||AAABBBCC|CDDDEEEF|CDDDEEEF|FFGGGHHH| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice the two 32-bit halves are slightly different but I'm not sure why, since we only care about keeping AAA, CCC, BBB and DDD. Why the difference between CDDDEEEF (first) and FFGGGHHH (second), for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a limitation of the current algorithm. When we swizzle, we assume the worst byte spread (here 2) for all values. It does not care what is the in the other swizzled bytes, only adds their own. This is why we are doing sizeof(uint32_t) / max_spread = 4 / 2 = 2 shifts per swizzle and not 5 (from your next comment).
So below are all three steps one next to the other that determine what goes where
Val we want. | ~ Val AAA | ~ Val CCC || ~ Val BBB | ~ Val DDD |
Byte idx for val: | Idx 0 | Idx 1 | Idx 0 | Idx 1 || Idx 0 | Idx 1 | Idx 1 | Idx 2 |
Result data: |AAABBBCC|CDDDEEEF|AAABBBCC|CDDDEEEF||AAABBBCC|CDDDEEEF|CDDDEEEF|FFGGGHHH|
I'll add a comment to clarify that.
See next comment for discussion the improvements to that.
| /// write to memory. | ||
| /// Now we can do the same with C and D without having to go to another swizzle (doing | ||
| /// multiple shifts per swizzle). | ||
| /// The next value to unpack is E, which starts on bit 12 (not byte aligned). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
B, C, D were not byte aligned either, so I don't understand what makes E special.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The beginning of a new iteration. First iteration started with pointer on A (aligned) and extracted A, B, C, D.
However we cannot advance the pointer and repeat the kernel loop from here because E is unaligned compared to A.
We need to add another loop inside the kernel that will process E, F, G, H with bit offsets. Then on I we are aligned again so we can exit and leave the outer loop call the kernel again with a pointer on I.
| /// account that we have an extra offset of 12 bits (this will change all the constants | ||
| /// used for swizzle and shift). | ||
| /// | ||
| /// Note that in this example, we could have swizzled more than two values in each slot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence my question above :)
| /// Note that in this example, we could have swizzled more than two values in each slot. | ||
| /// In practice, there may be situations where a more complex algorithm could fit more | ||
| /// shifts per swizzles, but that tend to not be the case when the SIMD register size | ||
| /// increases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a followup task, would be interesting to investigate in which (bit width, SIMD size) combinations a better algorithm could spare some swizzles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in fact the interactive visualizations in the the blog post will help a lot.
Rationale for this change
What changes are included in this PR?
The
constexprcode generation creates a kernel appropriate for a given input/output bit width and simd size.xsimdfallback that have been merged upstream.xsimdrelease.Are these changes tested?
Yes
Are there any user-facing changes?
No