Describe the bug
When writing queries on parquet files with field metadata and not stripping that
metadata, DataFusion errors out with the above error.
To Reproduce
Repro
-- First, ensure that parquet metadata is not skipped (it is skipped by default)
> set datafusion.execution.parquet.skip_metadata = false;
SELECT
'foo' AS name,
COUNT(
CASE
WHEN prev_value = false AND value = TRUE THEN 1
ELSE NULL
END
) AS count_true_rises
FROM
(
SELECT
value,
LAG(value) OVER (ORDER BY time ) AS prev_value
FROM
'repro.parquet'
);
Results in
Internal error: Physical input schema should be the same as the one converted from logical input schema. Differences: .
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
I made the parquet file available here:
parquet-with-metadata.zip
Here is the code to generate the parquet file (I am not sure how to create parquet files with metadata otherwise):
Details
use std::collections::HashMap;
use std::fs::File;
use std::sync::Arc;
use arrow::array::{BooleanArray, RecordBatch, TimestampNanosecondArray};
use arrow::datatypes::{DataType, Field, Schema, SchemaRef, TimeUnit};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// write a parquet file which has a metadata
let mut metadata = HashMap::new();
metadata.insert(String::from("year"), String::from("2015"));
let schema: SchemaRef = Arc::new(Schema::new(vec![
Field::new("time", DataType::Timestamp(TimeUnit::Nanosecond, None), false),
Field::new("value", DataType::Boolean, false)
.with_metadata(metadata),
]));
let time = TimestampNanosecondArray::from(vec![1_420_070_400_000_000_000i64, 1_420_070_401_000_000_000i64]);
let value = BooleanArray::from(vec![true, false]);
let batch = RecordBatch::try_new(schema.clone(), vec![
Arc::new(time),
Arc::new(value),
])?;
println!("Writing parquet file with metadata repro.parquet...");
let writer = File::create("repro.parquet")?;
let mut arrow_writer = parquet::arrow::ArrowWriter::try_new(
writer,
schema.clone(),
None,
)?;
arrow_writer.write(&batch)?;
arrow_writer.close()?;
Ok(())
}
Note this is all the more confusing because the error lists no differences
... converted from logical input schema. Differences: . <-- no differences are listed!!!
The difference is the metadata on the value field.
Expected behavior
I expect the query to pass without error
Additional context
_No response_¯
Describe the bug
When writing queries on parquet files with field metadata and not stripping that
metadata, DataFusion errors out with the above error.
To Reproduce
Repro
Results in
I made the parquet file available here:
parquet-with-metadata.zip
Here is the code to generate the parquet file (I am not sure how to create parquet files with metadata otherwise):
Details
Note this is all the more confusing because the error lists no differences
The difference is the metadata on the
valuefield.Expected behavior
I expect the query to pass without error
Additional context
_No response_¯