Skip to content

Commit 69cd666

Browse files
fix: skip empty metadata in intersect_metadata_for_union to prevent s… (#21127)
## Which issue does this PR close? - Closes #19049. ## Rationale for this change We're building a SQL engine on top of DataFusion and hit this while running benchmarks. A `UNION ALL` query against Parquet files that carry field metadata (like `PARQUET:field_id` or InfluxDB's `iox::column::type`). When one branch of the union has a NULL literal, `intersect_metadata_for_union` intersects the metadata from the data source with the empty metadata from the NULL — and since intersecting anything with an empty set gives empty, all metadata gets wiped out. Later, when `optimize_projections` prunes columns and `recompute_schema` rebuilds the Union schema, the logical schema has `{}` while the physical schema still has the original metadata from Parquet. This causes: ``` Internal error: Physical input schema should be the same as the one converted from logical input schema. Differences: - field metadata at index 0 [usage_idle]: (physical) {"iox::column::type": "..."} vs (logical) {} ``` As @erratic-pattern and @alamb discussed in the issue, empty metadata from NULL literals isn't saying "this field has no metadata" — it's saying "I don't know." It shouldn't erase metadata from branches that actually have it. I fixed this in `intersect_metadata_for_union` directly rather than patching `optimize_projections` or `recompute_schema`, since that's where the bad intersection happens and it covers all code paths that derive Union schemas. ## What changes are included in this PR? One change to `intersect_metadata_for_union` in `datafusion/expr/src/expr.rs`: branches with empty metadata are skipped during intersection instead of participating. Non-empty branches still intersect normally (conflicting values still get dropped). If every branch is empty, the result is empty — same as before. ## Are these changes tested? Added 7 unit tests for `intersect_metadata_for_union`: - Same metadata across branches — preserved - Conflicting non-empty values — dropped (existing behavior, unchanged) - One branch has metadata, other is empty — metadata preserved (the fix) - Empty branch comes first — still works - All branches empty — empty result - Mix of empty and conflicting non-empty — intersects only the non-empty ones - No inputs — empty result The full end-to-end reproduction needs Parquet files with field metadata as described in the issue. The unit tests cover the intersection logic directly. ## Are there any user-facing changes? No API changes. `UNION ALL` queries combining metadata-carrying sources with NULL literals will stop failing with schema mismatch errors.
1 parent c4562dc commit 69cd666

2 files changed

Lines changed: 96 additions & 7 deletions

File tree

datafusion/expr/src/expr.rs

Lines changed: 79 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -512,17 +512,26 @@ pub type SchemaFieldMetadata = std::collections::HashMap<String, String>;
512512
pub fn intersect_metadata_for_union<'a>(
513513
metadatas: impl IntoIterator<Item = &'a SchemaFieldMetadata>,
514514
) -> SchemaFieldMetadata {
515-
let mut metadatas = metadatas.into_iter();
516-
let Some(mut intersected) = metadatas.next().cloned() else {
517-
return Default::default();
518-
};
515+
let mut intersected: Option<SchemaFieldMetadata> = None;
519516

520517
for metadata in metadatas {
521-
// Only keep keys that exist in both with the same value
522-
intersected.retain(|k, v| metadata.get(k) == Some(v));
518+
// Skip empty metadata (e.g. from NULL literals or computed expressions)
519+
// to avoid dropping metadata from branches that have it.
520+
if metadata.is_empty() {
521+
continue;
522+
}
523+
match &mut intersected {
524+
None => {
525+
intersected = Some(metadata.clone());
526+
}
527+
Some(current) => {
528+
// Only keep keys that exist in both with the same value
529+
current.retain(|k, v| metadata.get(k) == Some(v));
530+
}
531+
}
523532
}
524533

525-
intersected
534+
intersected.unwrap_or_default()
526535
}
527536

528537
/// UNNEST expression.
@@ -4127,4 +4136,67 @@ mod test {
41274136
}
41284137
}
41294138
}
4139+
4140+
mod intersect_metadata_tests {
4141+
use super::super::intersect_metadata_for_union;
4142+
use std::collections::HashMap;
4143+
4144+
#[test]
4145+
fn all_branches_same_metadata() {
4146+
let m1 = HashMap::from([("key".into(), "val".into())]);
4147+
let m2 = HashMap::from([("key".into(), "val".into())]);
4148+
let result = intersect_metadata_for_union([&m1, &m2]);
4149+
assert_eq!(result, HashMap::from([("key".into(), "val".into())]));
4150+
}
4151+
4152+
#[test]
4153+
fn conflicting_metadata_dropped() {
4154+
let m1 = HashMap::from([("key".into(), "a".into())]);
4155+
let m2 = HashMap::from([("key".into(), "b".into())]);
4156+
let result = intersect_metadata_for_union([&m1, &m2]);
4157+
assert!(result.is_empty());
4158+
}
4159+
4160+
#[test]
4161+
fn empty_metadata_branch_skipped() {
4162+
let m1 = HashMap::from([("key".into(), "val".into())]);
4163+
let m2 = HashMap::new(); // e.g. NULL literal
4164+
let result = intersect_metadata_for_union([&m1, &m2]);
4165+
assert_eq!(result, HashMap::from([("key".into(), "val".into())]));
4166+
}
4167+
4168+
#[test]
4169+
fn empty_metadata_first_branch_skipped() {
4170+
let m1 = HashMap::new();
4171+
let m2 = HashMap::from([("key".into(), "val".into())]);
4172+
let result = intersect_metadata_for_union([&m1, &m2]);
4173+
assert_eq!(result, HashMap::from([("key".into(), "val".into())]));
4174+
}
4175+
4176+
#[test]
4177+
fn all_branches_empty_metadata() {
4178+
let m1: HashMap<String, String> = HashMap::new();
4179+
let m2: HashMap<String, String> = HashMap::new();
4180+
let result = intersect_metadata_for_union([&m1, &m2]);
4181+
assert!(result.is_empty());
4182+
}
4183+
4184+
#[test]
4185+
fn mixed_empty_and_conflicting() {
4186+
let m1 = HashMap::from([("key".into(), "a".into())]);
4187+
let m2 = HashMap::new();
4188+
let m3 = HashMap::from([("key".into(), "b".into())]);
4189+
let result = intersect_metadata_for_union([&m1, &m2, &m3]);
4190+
// m2 is skipped; m1 and m3 conflict → dropped
4191+
assert!(result.is_empty());
4192+
}
4193+
4194+
#[test]
4195+
fn no_inputs() {
4196+
let result = intersect_metadata_for_union(std::iter::empty::<
4197+
&HashMap<String, String>,
4198+
>());
4199+
assert!(result.is_empty());
4200+
}
4201+
}
41304202
}

datafusion/sqllogictest/test_files/metadata.slt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,23 @@ ORDER BY id, name, l_name;
139139
NULL bar NULL
140140
NULL NULL l_bar
141141

142+
# Regression test: UNION ALL + optimize_projections pruning unused columns causes
143+
# metadata loss when one branch has NULL literals (empty metadata) and the other
144+
# has field metadata. The optimizer prunes unused columns, triggering recompute_schema
145+
# which rebuilds the Union schema via intersect_metadata_for_union. Previously, this
146+
# intersection would drop metadata when any branch had empty metadata (from NULL literals).
147+
# See https://github.com/apache/datafusion/issues/19049
148+
query T
149+
SELECT name FROM (
150+
SELECT id, name FROM "table_with_metadata"
151+
UNION ALL
152+
SELECT NULL::int as id, NULL::string as name
153+
) GROUP BY name ORDER BY name;
154+
----
155+
bar
156+
baz
157+
NULL
158+
142159
# Regression test: missing field metadata from left side of the union when right side is chosen
143160
query T
144161
select name from (

0 commit comments

Comments
 (0)