Simplify file struct abstractions#1120
Conversation
| /// Represents one partition of a Parquet data set and this currently means one Parquet file. | ||
| /// | ||
| /// In the future it would be good to support subsets of files based on ranges of row groups | ||
| /// so that we can better parallelize reads of large files across available cores (see | ||
| /// [ARROW-10995](https://issues.apache.org/jira/browse/ARROW-10995)). | ||
| /// | ||
| /// We may also want to support reading Parquet files that are partitioned based on a key and | ||
| /// in this case we would want this partition struct to represent multiple files for a given | ||
| /// partition key (see [ARROW-11019](https://issues.apache.org/jira/browse/ARROW-11019)). | ||
| #[derive(Debug, Clone)] | ||
| pub struct ParquetPartition { |
There was a problem hiding this comment.
These comments were mostly outdated and the other features mentioned are now planned in the ListingTable provider
| .iter() | ||
| .map(|fp| fp.file_partition.files.as_slice()) | ||
| .collect() | ||
| pub fn file_groups(&self) -> &[Vec<PartitionedFile>] { |
There was a problem hiding this comment.
We replace partitions with file_groups to try decrease the overuse of the term "partition" which represents different (yet similar 😅) things in different contexts:
- on the listing table side, a partition refer to a "hive partition", that is to say a set of files grouped into a folder because they share a common attribute
- on the execution plan side, a partition is a unit of parallelism. Files are grouped together to provide a good workload for one thread/executor.
|
cc @yjshen |
| object_store: Arc<dyn ObjectStore>, | ||
| /// Parquet partitions to read | ||
| partitions: Vec<ParquetPartition>, | ||
| /// List of parquet files, grouped by output partition |
There was a problem hiding this comment.
"output partition" is vague here.
file_group, i.e. Vec<PartitionedFile>, is the unit of parallelism and will be processed by one single executor/thread.
There was a problem hiding this comment.
are you suggesting us coming up with a name that's semantically closer to concurrency?
There was a problem hiding this comment.
I was referring to the ExecutionPlan.output_partitioning(). Let me change this for something slightly more explicit 🙂 (I'll try to update this later today)
There was a problem hiding this comment.
Yes. I think we can rephrase this line to avoid the ambiguity
|
Thank you all for your reviews 😃, eager to see this merged so that I can create the PR for #1122 |
|
Thanks @rdettai ! |
Rationale for this change
Currently we have many abstractions that sound very similar:
PartitionedFile,FilePartition,ParquetPartition. This is an attempt to simplify the code by removingFilePartitionandParquetPartition.What changes are included in this PR?
FilePartitionandParquetPartitionFileGroupsDisplaywrapperFileGroupsDisplaywas not applied to the CSV, Avro and Json exec plans as those will be handled in a separate PR (for Multiple files per partition for CSV Json and Avro exec plans #1122)Are there any user-facing changes?
FilePartitionwas publicly accessible but not really part of the public API