-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-47628: [C++][Parquet] Implement basic parquet file rewriter #47775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
e4de469 to
c216849
Compare
|
@pitrou @adamreeve @mapleFU Do you have any suggestions about this draft? Is there any efficient way to merge two parquet files' schema? |
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm I'm thinking that just reuse the current code a ok way, since these logic in current impl would be a bit hacking with current interface...
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed all the changes yet and will progressively post my comments.
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general workflow of the rewriter looks good to me. However, I don't believe we should directly manipulate the thrift objects.
| RowGroupRewriter(std::shared_ptr<ArrowInputFile> source, | ||
| std::shared_ptr<ArrowOutputStream> sink, | ||
| const RewriterProperties* props, | ||
| std::shared_ptr<RowGroupReader> row_group_reader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps introduce a RowGroupContext to hold all row group xxx readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't bring much benefit and requires one more step to unwrap of the wrapper class.
e037be7 to
253f281
Compare
b70917f to
439103e
Compare
439103e to
641ab8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Draft implementation of a C++ Parquet “file rewriter” that can rewrite/concatenate/join Parquet files by copying encoded bytes and re-emitting metadata (optionally including page indexes and bloom filters), avoiding full decode/re-encode.
Changes:
- Add
ParquetFileRewriter+RewriterPropertiesand implement basic concat (horizontal) + join (vertical) rewriting. - Add metadata copying helpers (
to_thrift, new builder entrypoints, newToThriftoverloads) to support fast metadata reconstruction. - Add Arrow-based tests for roundtrip rewriting scenarios (simple, concat, join, concat+join) and wire into CMake.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/parquet/thrift_internal.h | Add ToThrift helpers for page-index-related structs. |
| cpp/src/parquet/properties.h | Make ReaderProperties::GetStream const; add RewriterProperties + default factory decl. |
| cpp/src/parquet/properties.cc | Update ReaderProperties::GetStream definition to const. |
| cpp/src/parquet/page_index.h | Extend PageIndexBuilder API to allow setting pre-built indexes. |
| cpp/src/parquet/page_index.cc | Implement new PageIndexBuilder setters and mixed builder/prebuilt serialization. |
| cpp/src/parquet/metadata.h | Add start_offset(), expose to_thrift(), extend RowGroupMetaDataBuilder. |
| cpp/src/parquet/metadata.cc | Implement new metadata accessors and new row-group column-chunk injection path. |
| cpp/src/parquet/file_rewriter.h | New public rewriter API (ParquetFileRewriter). |
| cpp/src/parquet/file_rewriter.cc | Core rewriting implementation (copy streams, concat/join logic, index/bloom handling). |
| cpp/src/parquet/bloom_filter_writer.h | Extend BloomFilterBuilder with InsertBloomFilter. |
| cpp/src/parquet/bloom_filter_writer.cc | Implement InsertBloomFilter. |
| cpp/src/parquet/arrow/test_util.h | Add helper to write a table into a Parquet buffer for tests. |
| cpp/src/parquet/arrow/arrow_rewriter_test.cc | Add Arrow-level rewriter roundtrip tests. |
| cpp/src/parquet/CMakeLists.txt | Add new source (file_rewriter.cc) and new test target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (page_index_reader_ != nullptr && page_index_builder != nullptr) { | ||
| auto column_index = page_index_reader_->GetColumnIndex(column_ordinal_); | ||
| auto offset_index = page_index_reader_->GetOffsetIndex(column_ordinal_); | ||
| if (column_index != nullptr) { | ||
| page_index_builder->SetColumnIndex(column_ordinal_, column_index); | ||
| } | ||
| if (offset_index != nullptr) { | ||
| page_index_builder->SetOffsetIndex(column_ordinal_, offset_index, shift); | ||
| } | ||
| } | ||
|
|
||
| if (bloom_filter_reader_ != nullptr && bloom_filter_builder != nullptr) { | ||
| auto bloom_filter = bloom_filter_reader_->GetColumnBloomFilter(column_ordinal_); | ||
| if (bloom_filter != nullptr) { | ||
| bloom_filter_builder->InsertBloomFilter(column_ordinal_, | ||
| std::move(bloom_filter)); | ||
| } | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new rewriter code path copies page indexes (SetColumnIndex/SetOffsetIndex) and bloom filters (InsertBloomFilter) when enabled, but the added Arrow-level roundtrip tests only validate the decoded table contents. Consider extending tests to assert that rewritten files actually contain the expected page index / bloom filter structures (e.g., via ParquetFileReader::GetPageIndexReader / GetBloomFilterReader and checking non-null indexes/locations), so regressions in metadata copying/offset shifting are caught.
| std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source, | ||
| int64_t start, int64_t num_bytes); | ||
| int64_t start, int64_t num_bytes) const; | ||
|
|
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing ReaderProperties::GetStream to be const changes the mangled symbol and will break ABI for downstream code compiled against earlier versions of the library. If ABI compatibility matters here, consider keeping the old non-const overload (forwarding to the const implementation) instead of changing the existing signature in-place.
| // Non-const overload kept for ABI compatibility. It forwards to the const | |
| // implementation introduced in a later version. | |
| std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source, | |
| int64_t start, int64_t num_bytes) { | |
| return static_cast<const ReaderProperties*>(this)->GetStream(source, start, num_bytes); | |
| } |
| const void* to_thrift() const; | ||
|
|
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnChunkMetaData::to_thrift() exposes an untyped pointer to an internal Thrift struct as part of the public metadata API, which couples consumers to internal representation and is easy to misuse/UB (wrong cast / lifetime assumptions). Prefer an internal-only accessor, or return a typed reference/pointer to the concrete thrift type in an internal header, or provide a dedicated cloning/copy helper on the builder to avoid exposing raw thrift at all.
7be60a0 to
41248c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
41248c8 to
2140284
Compare
|
|
This is a draft PR now. I follow Java's implementation but I think it is not a good enough design for C++. Because we must copy lots of code from file_writer.cc or file_reader.cc and it will be troublesome to maintain in the future. I prefer to implement some classes inheriting
XXXWriterorXXXReader. I'll think about how to refactor the code. If anyone has any good suggestions, please comment.Now I have written two kinds of tests. Test the horizontal splicing and vertical splicing of parquet files separately. But only horizontal splicing is implemented now because I don't find an efficient way to merge two parquet files' schema.
Rationale for this change
Allow to rewrite parquet files in binary data formats instead of reading, decoding all values and writing them.
What changes are included in this PR?
ParquetFileRewriterandRewriterProperties.to_thriftandSetXXXmethods to help me copy the metadata.CopyStreammethods to callmemcpybetweenArrowInputStreamandArrowOutputStream.RowGroupMetaDataBuilder::NextColumnChunk(std::unique_ptr<ColumnChunkMetaData> cc_metadata, int64_t shift)which allows to add column metadata without creatingColumnChunkMetaDataBuilder.Are these changes tested?
Yes
Are there any user-facing changes?
ReaderProperties::GetStreamis changed to a const method. Only the signature has been changed. Its original implementation allows it to be declared as a const method.