Skip to content

Commit f7688c4

Browse files
authored
fix: use single writer for all partition streams (delta-io#3870)
# Description Collecting the streams concurrently and sending them over a sync channel provides the benefit of us keeping a single writer instance where we write through. This prevents the edge case of potentially writing many small files, when datafusion decided to lots of partitioned streams that are relatively small. Since we now maintain one writer we just keep writing from all partition streams into that. I guess this stacks well with your PR @abhiaagarwal? Wdyt? - closes delta-io#3871 We now only create one file with the same code of the above issue instead of 3 small files: ```json {"protocol":{"minReaderVersion":1,"minWriterVersion":2}} {"metaData":{"id":"0d1846c0-71e6-4c6d-aba1-5b9f8d752937","name":null,"description":null,"format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"foo\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"createdTime":1760889081860,"configuration":{}}} {"add":{"path":"part-00000-352305f0-eff1-44d3-993e-6bd31cdd118c-c000.snappy.parquet","partitionValues":{},"size":510,"modificationTime":1760889081877,"dataChange":true,"stats":"{\"numRecords\":9,\"minValues\":{\"foo\":1},\"maxValues\":{\"foo\":3},\"nullCount\":{\"foo\":0}}","tags":null,"baseRowId":null,"defaultRowCommitVersion":null,"clusteringProvider":null}} {"commitInfo":{"timestamp":1760889081878,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists"},"engineInfo":"delta-rs:py-1.2.0","clientVersion":"delta-rs.py-1.2.0","operationMetrics":{"execution_time_ms":19,"num_added_files":1,"num_added_rows":9,"num_partitions":0,"num_removed_files":0}}} ``` --------- Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Signed-off-by: Ion Koutsouris
1 parent d0ae849 commit f7688c4

2 files changed

Lines changed: 241 additions & 168 deletions

File tree

0 commit comments

Comments
 (0)