DeltaScanNext exposing with_file_selection as public API #4342
-
|
Dear delta-rs team, I understand we're migrating towards DeltaScanNext to better integrate with kernel-based scan. But my scenario involves paralleled/partitioned reads where multiple threads/processes are reading different partitions of the table concurrently. For non-partitioned delta table, the smallest partitioning unit I can imagine would be a single parquet file identified by an active AddFile action of a particular snapshot. With DeltaTableProvider's with_files API, I'm able to narrow down the scope of the scan to only focus on a portion of the table's data for a single reader. However, this API does not seem to factor in delta semantics so that the subsequent plan execution/actual read would ignore features like deletion vector. As I skimmed through the code and especially the test case "test_scan_with_file_selection_applies_deletion_vectors", it looks promosing that DeltaScanNext, being delta native, is designed to address this. So my ask is: when are we planning to release with_file_selection as a public API and all the necessary wirings to ensure file selection based scan can correctly apply delta read features that we currently support (i.e. column mapping, deletion vector)? Thanks and looking forward to your reply. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
@ethan-tyler this seems to be related to the follow-up after this change: 25b4968. Can I expect the API to be official soon? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for starting this discussion @Cappu7ino! You're reading it right. If there's a strong external use case I'd be open to a public builder API for snapshot scoped file selection, but I don't want to stabilize an intermediate surface too early. We just finished the DF 53 / Arrow upgrade and I will be able to focus on completing the removal/migration. |
Beta Was this translation helpful? Give feedback.
Thanks for starting this discussion @Cappu7ino! You're reading it right.
with_file_selectionis internal migration scaffolding and not a public API yet. Reason being theDeltaTableProvider->DeltaScanNextmigration is still in progress. We are tracking this in epic #4239If there's a strong external use case I'd be open to a public builder API for snapshot scoped file selection, but I don't want to stabilize an intermediate surface too early. We just finished the DF 53 / Arrow upgrade and I will be able to focus on completing the removal/migration.