Skip to content

[FEATURE] Group initial load tasks by file size in iceberg-source #6725

@lawofcycles

Description

@lawofcycles

Is your feature request related to a problem? Please describe.

The iceberg-source creates one task per data file during initial load. For tables with many small files, the coordination overhead per task (DynamoDB acquire/complete operations) can dominate the actual file processing time.

Describe the solution you'd like

Group multiple data files into a single initial load task based on total file size, consistent with the approach planned for SHUFFLE_WRITE tasks.

Additional context

Related: #6682 (source-layer shuffle), #6724

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Unplanned

Relationships

None yet

Development

No branches or pull requests

Issue actions