Skip to content

Hive-partitioned listing table crashes when root directory contains non-partitioned files #21755

@zhuqi-lucas

Description

@zhuqi-lucas

Describe the bug

When a hive-partitioned listing table has files in the root directory (not inside any partition_col=value/ path), queries that reference partition columns fail with:

Arrow error: Schema error: Unable to get field named "year_month". 
Valid fields: ["col1", "col2", ...]

This happens because try_into_partitioned_file in datafusion/catalog-listing/src/helpers.rs includes root-level files with empty partition_values (via parsed.into_iter().flatten()). When the query engine later tries to resolve partition column values for these files, it fails.

To Reproduce

  1. Create a hive-partitioned external table pointing to a directory that contains both:

    • Root-level files: s3://bucket/table/data.parquet
    • Partitioned files: s3://bucket/table/year_month=2024-01/data.parquet
  2. Query with partition column reference:

SELECT year_month, COUNT(*) FROM table GROUP BY year_month
  1. Error: Unable to get field named "year_month"

This is a common scenario when a table transitions from non-partitioned to hive-partitioned storage — the original root file may still exist alongside the new partition directories.

Expected behavior

Files outside the partition structure should be skipped (with a debug log), since hive partition values are never null and there is no valid value to assign.

Additional context

  • parse_partitions_for_path already returns None for non-partition files, but the caller (try_into_partitioned_file) converts None to empty partition_values via .flatten()
  • This also causes Cannot merge statistics with different number of columns if the root file has a different schema than partitioned files
  • The root file may also cause incorrect COUNT(*) results (double-counting data)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions