[SPARK-53482][SQL] MERGE INTO support for when source has less nested field than target by szehon-ho · Pull Request #52347 · apache/spark

szehon-ho · 2025-09-15T21:39:39Z

What changes were proposed in this pull request?

Support MERGE INTO where source has less fields than target. This is already partially supported as part of: #51698, but only for top level fields. This support it even for nested fields (structs, including within other structs, arrays, and maps)

This patch modifies the MERGE INTO assignment to re-use existing logic in TableOutputResolver to resolve empty values in structs to null or default.

UPDATE can also benefit from this, but we can do it in a subsequent pr.

Why are the changes needed?

For cases where source has less fields than target in MERGE INTO, it should behave more gracefully (inserting null values where source field does not exist).

Does this PR introduce any user-facing change?

No, only that this scenario used to fail and will now pass.

This gates on a new flag: "spark.sql.merge.source.nested.type.coercion.enabled", enabled by default.

How was this patch tested?

Add unit test to MergeIntoTableSuiteBase

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2025-09-16T08:11:13Z

Does it work for missing nested fields?

… field than target

szehon-ho · 2025-10-28T21:35:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala


 object TableOutputResolver extends SQLConfHelper with Logging {

+  object DefaultValueFillMode extends Enumeration {


Unfortunately, we need to distinguish between filling top level defaults and nested defaults within structs.

Normal V2Writes does not expect nested source nested type coercion. For example, write into a target dataframe from a source dataframe that has a struct column with less fields does not work today.

This goes through Analyzer.ResolveOutputRelation, which calls resolveOutputColumns() => reorderColumnsByName() => resolveStruct/Array/MapType()

RowLevelOperations , in particular Merge Into, want recursive support to coerce source struct columns with less fields than target column.

This goes through resolveUpdate => resolveStruct/Array/MapType()

So hence, we need a three-way enum here to distinguish the three cases (none, first-level, recurse).

szehon-ho · 2025-10-28T23:39:03Z

@cloud-fan all the tests are fixed, can you take another look? Thanks!

...rc/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveRowLevelCommandAssignments.scala

cloud-fan · 2025-10-29T11:17:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

+  object DefaultValueFillMode extends Enumeration {
+    val FILL, RECURSE, NONE = Value
+
+    def getChildMode(mode: DefaultValueFillMode.Value): DefaultValueFillMode.Value = {


what does this mean?

Ah, need a flag whether to recrurse. See : #52347 (comment)

Its a bit hard to do as the method reorderColumnsByName doesnt recurse directly into itself , but indirectly via resolveStruct/Map/ArrayType

I removed this method in latest patch, hopefully its cleaner.

cloud-fan · 2025-10-30T05:07:09Z

thanks, merging to master!

…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 966e053) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following apache#52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (apache#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… field than target ### What changes were proposed in this pull request? Support MERGE INTO where source has less fields than target. This is already partially supported as part of: apache#51698, but only for top level fields. This support it even for nested fields (structs, including within other structs, arrays, and maps) This patch modifies the MERGE INTO assignment to re-use existing logic in TableOutputResolver to resolve empty values in structs to null or default. UPDATE can also benefit from this, but we can do it in a subsequent pr. ### Why are the changes needed? For cases where source has less fields than target in MERGE INTO, it should behave more gracefully (inserting null values where source field does not exist). ### Does this PR introduce _any_ user-facing change? No, only that this scenario used to fail and will now pass. This gates on a new flag: "spark.sql.merge.source.nested.type.coercion.enabled", enabled by default. ### How was this patch tested? Add unit test to MergeIntoTableSuiteBase ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52347 from szehon-ho/nested_merge_round_3. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following apache#52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (apache#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

aokolnychyi · 2025-11-27T06:12:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(true)

+  val MERGE_INTO_SOURCE_NESTED_TYPE_COERCION_ENABLED =
+    buildConf("spark.sql.merge.source.nested.type.coercion.enabled")


This doesn't seem to align with the naming pattern we use elsewhere.

+1 for @aokolnychyi 's comment. Let me make a PR for this.

dongjoon-hyun · 2025-12-11T00:35:26Z

I made a follow-up to rename the config name.

[SPARK-53482][SQL][FOLLOWUP] Rename spark.sql.merge(.nested.type.coercion.enabled -> NestedTypeCoercion.enabled) #53434

github-actions bot added the SQL label Sep 15, 2025

szehon-ho mentioned this pull request Sep 15, 2025

[SPARK-53482][SQL] MERGE INTO support nested case where source has less fields than target #52225

Closed

szehon-ho mentioned this pull request Sep 16, 2025

[SPARK-53546][TESTS][FOLLOW-UP] Fix nested array schema evolution and style for InMemoryBaseTable #52359

Closed

szehon-ho added 3 commits October 23, 2025 14:37

[SPARK-53482][SQL] MERGE INTO support for when source has less nested…

165ae51

… field than target

Add test

500b011

rebase and fix tests

4ed2f04

szehon-ho force-pushed the nested_merge_round_3 branch from 24136bf to 4ed2f04 Compare October 27, 2025 19:03

szehon-ho added 2 commits October 27, 2025 14:50

Fix more tests

62cce3e

Fix more tests

9a943f3

szehon-ho commented Oct 28, 2025

View reviewed changes

cloud-fan reviewed Oct 29, 2025

View reviewed changes

...rc/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveRowLevelCommandAssignments.scala Show resolved Hide resolved

cloud-fan reviewed Oct 29, 2025

View reviewed changes

...rc/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveRowLevelCommandAssignments.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 29, 2025

View reviewed changes

Review comments and add more tests

bebdac5

szehon-ho force-pushed the nested_merge_round_3 branch from 20d0057 to bebdac5 Compare October 29, 2025 23:33

Lint

01c536e

cloud-fan approved these changes Oct 30, 2025

View reviewed changes

cloud-fan closed this in 58e5768 Oct 30, 2025

manuzhang mentioned this pull request Nov 19, 2025

Spark: Add support for Spark 4.1 apache/iceberg#14155

Merged

szehon-ho mentioned this pull request Nov 21, 2025

[SPARK-54289][SQL] Allow MERGE INTO to preserve existing struct fields for UPDATE SET * when source struct has less nested fields than target struct #53149

Closed

aokolnychyi reviewed Nov 27, 2025

View reviewed changes

dongjoon-hyun mentioned this pull request Dec 11, 2025

[SPARK-53482][SQL][FOLLOWUP] Rename spark.sql.merge(.nested.type.coercion.enabled -> NestedTypeCoercion.enabled) #53434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-53482][SQL] MERGE INTO support for when source has less nested field than target#52347

[SPARK-53482][SQL] MERGE INTO support for when source has less nested field than target#52347
szehon-ho wants to merge 7 commits intoapache:masterfrom
szehon-ho:nested_merge_round_3

szehon-ho commented Sep 15, 2025 •

edited

Loading

Uh oh!

cloud-fan commented Sep 16, 2025

Uh oh!

szehon-ho Oct 28, 2025

Uh oh!

szehon-ho commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

cloud-fan Oct 29, 2025

Uh oh!

szehon-ho Oct 29, 2025 •

edited

Loading

Uh oh!

cloud-fan commented Oct 30, 2025

Uh oh!

aokolnychyi Nov 27, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun Dec 11, 2025

Uh oh!

dongjoon-hyun commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		object TableOutputResolver extends SQLConfHelper with Logging {

		object DefaultValueFillMode extends Enumeration {

Conversation

szehon-ho commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Sep 16, 2025

Uh oh!

szehon-ho Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

cloud-fan Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 30, 2025

Uh oh!

aokolnychyi Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

szehon-ho commented Sep 15, 2025 •

edited

Loading

szehon-ho Oct 29, 2025 •

edited

Loading

aokolnychyi Nov 27, 2025 •

edited

Loading