Describe the bug
When a hive-partitioned listing table has files in the root directory (not inside any partition_col=value/ path), queries that reference partition columns fail with:
Arrow error: Schema error: Unable to get field named "year_month".
Valid fields: ["col1", "col2", ...]
This happens because try_into_partitioned_file in datafusion/catalog-listing/src/helpers.rs includes root-level files with empty partition_values (via parsed.into_iter().flatten()). When the query engine later tries to resolve partition column values for these files, it fails.
To Reproduce
-
Create a hive-partitioned external table pointing to a directory that contains both:
- Root-level files:
s3://bucket/table/data.parquet
- Partitioned files:
s3://bucket/table/year_month=2024-01/data.parquet
-
Query with partition column reference:
SELECT year_month, COUNT(*) FROM table GROUP BY year_month
- Error:
Unable to get field named "year_month"
This is a common scenario when a table transitions from non-partitioned to hive-partitioned storage — the original root file may still exist alongside the new partition directories.
Expected behavior
Files outside the partition structure should be skipped (with a debug log), since hive partition values are never null and there is no valid value to assign.
Additional context
parse_partitions_for_path already returns None for non-partition files, but the caller (try_into_partitioned_file) converts None to empty partition_values via .flatten()
- This also causes
Cannot merge statistics with different number of columns if the root file has a different schema than partitioned files
- The root file may also cause incorrect
COUNT(*) results (double-counting data)
Describe the bug
When a hive-partitioned listing table has files in the root directory (not inside any
partition_col=value/path), queries that reference partition columns fail with:This happens because
try_into_partitioned_fileindatafusion/catalog-listing/src/helpers.rsincludes root-level files with emptypartition_values(viaparsed.into_iter().flatten()). When the query engine later tries to resolve partition column values for these files, it fails.To Reproduce
Create a hive-partitioned external table pointing to a directory that contains both:
s3://bucket/table/data.parquets3://bucket/table/year_month=2024-01/data.parquetQuery with partition column reference:
Unable to get field named "year_month"This is a common scenario when a table transitions from non-partitioned to hive-partitioned storage — the original root file may still exist alongside the new partition directories.
Expected behavior
Files outside the partition structure should be skipped (with a debug log), since hive partition values are never null and there is no valid value to assign.
Additional context
parse_partitions_for_pathalready returnsNonefor non-partition files, but the caller (try_into_partitioned_file) convertsNoneto emptypartition_valuesvia.flatten()Cannot merge statistics with different number of columnsif the root file has a different schema than partitioned filesCOUNT(*)results (double-counting data)