Skip to content

Conversation

@HuaHuaY
Copy link
Contributor

@HuaHuaY HuaHuaY commented Oct 10, 2025

This is a draft PR now. I follow Java's implementation but I think it is not a good enough design for C++. Because we must copy lots of code from file_writer.cc or file_reader.cc and it will be troublesome to maintain in the future. I prefer to implement some classes inheriting XXXWriter or XXXReader. I'll think about how to refactor the code. If anyone has any good suggestions, please comment.

Now I have written two kinds of tests. Test the horizontal splicing and vertical splicing of parquet files separately. But only horizontal splicing is implemented now because I don't find an efficient way to merge two parquet files' schema.

Rationale for this change

Allow to rewrite parquet files in binary data formats instead of reading, decoding all values and writing them.

What changes are included in this PR?

  1. Add class ParquetFileRewriter and RewriterProperties.
  2. Add some to_thrift and SetXXX methods to help me copy the metadata.
  3. Add CopyStream methods to call memcpy between ArrowInputStream and ArrowOutputStream.
  4. Add RowGroupMetaDataBuilder::NextColumnChunk(std::unique_ptr<ColumnChunkMetaData> cc_metadata, int64_t shift) which allows to add column metadata without creating ColumnChunkMetaDataBuilder.

Are these changes tested?

Yes

Are there any user-facing changes?

  1. Add some new classes and methods mentioned above.
  2. ReaderProperties::GetStream is changed to a const method. Only the signature has been changed. Its original implementation allows it to be declared as a const method.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@HuaHuaY HuaHuaY changed the title [C++][Parquet] Implement basic parquet file rewriter [C++][Parquet] GH-47628: Implement basic parquet file rewriter Oct 10, 2025
@HuaHuaY
Copy link
Contributor Author

HuaHuaY commented Oct 13, 2025

@pitrou @adamreeve @mapleFU Do you have any suggestions about this draft? Is there any efficient way to merge two parquet files' schema?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm I'm thinking that just reuse the current code a ok way, since these logic in current impl would be a bit hacking with current interface...

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 13, 2025
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed all the changes yet and will progressively post my comments.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general workflow of the rewriter looks good to me. However, I don't believe we should directly manipulate the thrift objects.

RowGroupRewriter(std::shared_ptr<ArrowInputFile> source,
std::shared_ptr<ArrowOutputStream> sink,
const RewriterProperties* props,
std::shared_ptr<RowGroupReader> row_group_reader,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps introduce a RowGroupContext to hold all row group xxx readers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't bring much benefit and requires one more step to unwrap of the wrapper class.

@HuaHuaY HuaHuaY force-pushed the fix_issue_47664 branch 2 times, most recently from b70917f to 439103e Compare January 15, 2026 04:36
@HuaHuaY HuaHuaY marked this pull request as ready for review February 9, 2026 12:36
Copilot AI review requested due to automatic review settings February 9, 2026 12:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Draft implementation of a C++ Parquet “file rewriter” that can rewrite/concatenate/join Parquet files by copying encoded bytes and re-emitting metadata (optionally including page indexes and bloom filters), avoiding full decode/re-encode.

Changes:

  • Add ParquetFileRewriter + RewriterProperties and implement basic concat (horizontal) + join (vertical) rewriting.
  • Add metadata copying helpers (to_thrift, new builder entrypoints, new ToThrift overloads) to support fast metadata reconstruction.
  • Add Arrow-based tests for roundtrip rewriting scenarios (simple, concat, join, concat+join) and wire into CMake.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
cpp/src/parquet/thrift_internal.h Add ToThrift helpers for page-index-related structs.
cpp/src/parquet/properties.h Make ReaderProperties::GetStream const; add RewriterProperties + default factory decl.
cpp/src/parquet/properties.cc Update ReaderProperties::GetStream definition to const.
cpp/src/parquet/page_index.h Extend PageIndexBuilder API to allow setting pre-built indexes.
cpp/src/parquet/page_index.cc Implement new PageIndexBuilder setters and mixed builder/prebuilt serialization.
cpp/src/parquet/metadata.h Add start_offset(), expose to_thrift(), extend RowGroupMetaDataBuilder.
cpp/src/parquet/metadata.cc Implement new metadata accessors and new row-group column-chunk injection path.
cpp/src/parquet/file_rewriter.h New public rewriter API (ParquetFileRewriter).
cpp/src/parquet/file_rewriter.cc Core rewriting implementation (copy streams, concat/join logic, index/bloom handling).
cpp/src/parquet/bloom_filter_writer.h Extend BloomFilterBuilder with InsertBloomFilter.
cpp/src/parquet/bloom_filter_writer.cc Implement InsertBloomFilter.
cpp/src/parquet/arrow/test_util.h Add helper to write a table into a Parquet buffer for tests.
cpp/src/parquet/arrow/arrow_rewriter_test.cc Add Arrow-level rewriter roundtrip tests.
cpp/src/parquet/CMakeLists.txt Add new source (file_rewriter.cc) and new test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +118 to +141
if (page_index_reader_ != nullptr && page_index_builder != nullptr) {
auto column_index = page_index_reader_->GetColumnIndex(column_ordinal_);
auto offset_index = page_index_reader_->GetOffsetIndex(column_ordinal_);
if (column_index != nullptr) {
page_index_builder->SetColumnIndex(column_ordinal_, column_index);
}
if (offset_index != nullptr) {
page_index_builder->SetOffsetIndex(column_ordinal_, offset_index, shift);
}
}

if (bloom_filter_reader_ != nullptr && bloom_filter_builder != nullptr) {
auto bloom_filter = bloom_filter_reader_->GetColumnBloomFilter(column_ordinal_);
if (bloom_filter != nullptr) {
bloom_filter_builder->InsertBloomFilter(column_ordinal_,
std::move(bloom_filter));
}
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new rewriter code path copies page indexes (SetColumnIndex/SetOffsetIndex) and bloom filters (InsertBloomFilter) when enabled, but the added Arrow-level roundtrip tests only validate the decoded table contents. Consider extending tests to assert that rewritten files actually contain the expected page index / bloom filter structures (e.g., via ParquetFileReader::GetPageIndexReader / GetBloomFilterReader and checking non-null indexes/locations), so regressions in metadata copying/offset shifting are caught.

Copilot uses AI. Check for mistakes.
std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,
int64_t start, int64_t num_bytes);
int64_t start, int64_t num_bytes) const;

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing ReaderProperties::GetStream to be const changes the mangled symbol and will break ABI for downstream code compiled against earlier versions of the library. If ABI compatibility matters here, consider keeping the old non-const overload (forwarding to the const implementation) instead of changing the existing signature in-place.

Suggested change
// Non-const overload kept for ABI compatibility. It forwards to the const
// implementation introduced in a later version.
std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,
int64_t start, int64_t num_bytes) {
return static_cast<const ReaderProperties*>(this)->GetStream(source, start, num_bytes);
}

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +165
const void* to_thrift() const;

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnChunkMetaData::to_thrift() exposes an untyped pointer to an internal Thrift struct as part of the public metadata API, which couples consumers to internal representation and is easy to misuse/UB (wrong cast / lifetime assumptions). Prefer an internal-only accessor, or return a typed reference/pointer to the concrete thrift type in an internal header, or provide a dedicated cloning/copy helper on the builder to avoid exposing raw thrift at all.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@HuaHuaY HuaHuaY changed the title [C++][Parquet] GH-47628: Implement basic parquet file rewriter GH-47628: [C++][Parquet] Implement basic parquet file rewriter Feb 10, 2026
@github-actions
Copy link

⚠️ GitHub issue #47628 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants