Skip to content

[lake/lance] Add NestedRow type support for Lance#2578

Merged
leekeiabstraction merged 7 commits intoapache:mainfrom
XuQianJin-Stars:feature/issue-2404-row-lance
Mar 23, 2026
Merged

[lake/lance] Add NestedRow type support for Lance#2578
leekeiabstraction merged 7 commits intoapache:mainfrom
XuQianJin-Stars:feature/issue-2404-row-lance

Conversation

@XuQianJin-Stars
Copy link
Contributor

Purpose

Linked issue: close #2404

This PR adds NestedRow (Struct) type support for Lance lake storage, extending the existing Array type support implementation.

Brief change log

  • LanceArrowUtils.java:

    • Extended toArrowField() to handle RowType by recursively creating child fields for nested struct types
    • Extended toArrowType() to map Fluss RowType to Arrow Struct.INSTANCE
  • ArrowDataConverter.java:

    • Added copyStructVectorData() method to recursively copy data from shaded StructVector to non-shaded StructVector
    • Updated copyVectorData() to detect and delegate to struct-specific copy logic
  • ShadedArrowBatchWriter.java:

    • Extended initFieldVector() to properly allocate and initialize StructVector and its child vectors
  • FlinkLanceTieringTestBase.java:

    • Added createLogTableWithNestedRowType() helper method for creating tables with nested Row columns
    • Added createLogTableWithArrayOfRowType() helper method for creating tables with Array columns

Tests

Unit Tests (LanceArrowUtilsTest.java):

  • testToArrowSchemaWithNestedRowType: Verifies simple nested Row type conversion to Arrow Struct
  • testToArrowSchemaWithDeeplyNestedRowType: Verifies deeply nested Row type conversion
  • testToArrowSchemaWithArrayOfRowType: Verifies Array type conversion
  • testToArrowSchemaWithRowContainingArray: Verifies Row containing Array field

Unit Tests (LanceTieringTest.java):

  • testTieringWriteTableWithNestedRowType: Verifies writing and reading tables with nested Row type

Integration Tests (LanceTieringITCase.java):

  • testTieringWithNestedRowType: End-to-end test for tiering with nested Row type
  • testTieringWithArrayOfRowType: End-to-end test for tiering with Array type

API and Format

No API changes. This change extends the internal Lance lake storage format to support Struct types, which is backward compatible.

Documentation

No documentation changes needed. This is an internal enhancement to support additional data types in Lance lake storage.

@XuQianJin-Stars XuQianJin-Stars force-pushed the feature/issue-2404-row-lance branch from e6bae16 to ffc697d Compare March 11, 2026 04:05
@XuQianJin-Stars
Copy link
Contributor Author

hi @wuchong @leekeiabstraction @luoyuxia Hi, i already updated the pr. Please help review when you got some time.

Copy link
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY for the PR, left small comments

}
}

if (shadedVector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be more defensive in this class. Currently if only one of the vector class is struct, this will be skipped and use non-struct copy logic, failure will appear much later. (Appreciate that this is the convention of the class i.e. listvector, consider this as nit)

}

@Test
void testTieringWithNestedRowType() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we test row of row as well?

| :--- | :--- | :--- | :--- |
| `remote.data.dir` | `none` | String | The directory used for storing the kv snapshot data files and remote log for log tiered storage in a Fluss supported filesystem. |
| `remote.data.dir` | `none` | String | The directory used for storing the kv snapshot data files and remote log for log tiered storage in a Fluss supported filesystem. When upgrading to `remote.data.dirs`, please ensure this value is placed as the first entry in the new configuration.For new clusters, it is recommended to use `remote.data.dirs` instead. If `remote.data.dirs` is configured, this value will be ignored. |
| `remote.data.dirs` | `[]` | ArrayList | A comma-separated list of directories in Fluss supported filesystems for storing the kv snapshot data files and remote log files of tables/partitions. If configured, when a new table or a new partition is created, one of the directories from this list will be selected according to the strategy specified by `remote.data.dirs.strategy` (`ROUND_ROBIN` by default). If not configured, the system uses `remote.data.dir` as the sole remote data directory for all data. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these part of the lance row support change?

| `table.auto-partition.key` | `none` | String | This configuration defines the time-based partition key to be used for auto-partitioning when a table is partitioned with multiple keys. Auto-partitioning utilizes a time-based partition key to handle partitions automatically, including creating new ones and removing outdated ones, by comparing the time value of the partition with the current system time. In the case of a table using multiple partition keys (such as a composite partitioning strategy), this feature determines which key should serve as the primary time dimension for making auto-partitioning decisions.And If the table has only one partition key, this config is not necessary. Otherwise, it must be specified. |
| `table.auto-partition.time-unit` | `DAY` | AutoPartitionTimeUnit | The time granularity for auto created partitions. The default value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If the value is `DAY`, the partition format for auto created is yyyyMMdd. If the value is `MONTH`, the partition format for auto created is yyyyMM. If the value is `QUARTER`, the partition format for auto created is yyyyQ. If the value is `YEAR`, the partition format for auto created is yyyy. |
| `table.auto-partition.time-zone` | `Europe/Paris` | String | The time zone for auto partitions, which is by default the same as the system time zone. |
| `table.auto-partition.time-zone` | `Asia/Shanghai` | String | The time zone for auto partitions, which is by default the same as the system time zone. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@XuQianJin-Stars
Copy link
Contributor Author

@leekeiabstraction Hi, i already updated the pr. Please help review when you got some time.

@leekeiabstraction
Copy link
Contributor

LGTM overall, I'm away atm, will manual test/review on Sunday when I get back before approving/merging

Copy link
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY, verified with Quickstart instructions (with added row type) from PR: #2716

@xx789633 Would be good if you can have a look as well.

I left a few more small comments, otherwise LGTM, happy to approve once these are resolved and merge if @xx789633 has no further comments.

Comment on lines +508 to +509
int expectBucket,
boolean isPartitioned)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused args.

List<LogRecord> expectRecords,
int expectBucket,
boolean isPartitioned)
throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No method calls throw Exceptions, remove?

}

private Tuple2<List<LogRecord>, List<LogRecord>> genNestedRowLogRecords(
@Nullable String partition, int bucket, int numRecords) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partition is unused, remove?

@leekeiabstraction leekeiabstraction merged commit 649bb41 into apache:main Mar 23, 2026
6 checks passed
@leekeiabstraction
Copy link
Contributor

Hello, looks like mdx file change crept back in again. I have removed the unrelated change.

@XuQianJin-Stars
Copy link
Contributor Author

Hello, looks like mdx file change crept back in again. I have removed the unrelated change.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[lake/lance] NestedRow type support for Lance

2 participants