Skip to content

[SPARK-53482][SQL] MERGE INTO support for when source has less nested field than target#52347

Closed
szehon-ho wants to merge 7 commits intoapache:masterfrom
szehon-ho:nested_merge_round_3
Closed

[SPARK-53482][SQL] MERGE INTO support for when source has less nested field than target#52347
szehon-ho wants to merge 7 commits intoapache:masterfrom
szehon-ho:nested_merge_round_3

Conversation

@szehon-ho
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho commented Sep 15, 2025

What changes were proposed in this pull request?

Support MERGE INTO where source has less fields than target. This is already partially supported as part of: #51698, but only for top level fields. This support it even for nested fields (structs, including within other structs, arrays, and maps)

This patch modifies the MERGE INTO assignment to re-use existing logic in TableOutputResolver to resolve empty values in structs to null or default.

UPDATE can also benefit from this, but we can do it in a subsequent pr.

Why are the changes needed?

For cases where source has less fields than target in MERGE INTO, it should behave more gracefully (inserting null values where source field does not exist).

Does this PR introduce any user-facing change?

No, only that this scenario used to fail and will now pass.

This gates on a new flag: "spark.sql.merge.source.nested.type.coercion.enabled", enabled by default.

How was this patch tested?

Add unit test to MergeIntoTableSuiteBase

Was this patch authored or co-authored using generative AI tooling?

No

@cloud-fan
Copy link
Copy Markdown
Contributor

Does it work for missing nested fields?

@szehon-ho szehon-ho force-pushed the nested_merge_round_3 branch from 24136bf to 4ed2f04 Compare October 27, 2025 19:03

object TableOutputResolver extends SQLConfHelper with Logging {

object DefaultValueFillMode extends Enumeration {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we need to distinguish between filling top level defaults and nested defaults within structs.

  1. Normal V2Writes does not expect nested source nested type coercion. For example, write into a target dataframe from a source dataframe that has a struct column with less fields does not work today.

This goes through Analyzer.ResolveOutputRelation, which calls resolveOutputColumns() => reorderColumnsByName() => resolveStruct/Array/MapType()

  1. RowLevelOperations , in particular Merge Into, want recursive support to coerce source struct columns with less fields than target column.

This goes through resolveUpdate => resolveStruct/Array/MapType()

So hence, we need a three-way enum here to distinguish the three cases (none, first-level, recurse).

@szehon-ho
Copy link
Copy Markdown
Member Author

@cloud-fan all the tests are fixed, can you take another look? Thanks!

object DefaultValueFillMode extends Enumeration {
val FILL, RECURSE, NONE = Value

def getChildMode(mode: DefaultValueFillMode.Value): DefaultValueFillMode.Value = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link
Copy Markdown
Member Author

@szehon-ho szehon-ho Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, need a flag whether to recrurse. See : #52347 (comment)

Its a bit hard to do as the method reorderColumnsByName doesnt recurse directly into itself , but indirectly via resolveStruct/Map/ArrayType

I removed this method in latest patch, hopefully its cleaner.

@szehon-ho szehon-ho force-pushed the nested_merge_round_3 branch from 20d0057 to bebdac5 Compare October 29, 2025 23:33
@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 58e5768 Oct 30, 2025
dongjoon-hyun pushed a commit that referenced this pull request Nov 22, 2025
…s for UPDATE SET * when source struct has less nested fields than target struct

### What changes were proposed in this pull request?
Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c).  This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target.

Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls).

### Why are the changes needed?
Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct.  In this scenario, a user making a UPDATE SET * may have two interpretations.

The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL.  This is the current behavior.

The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained.  This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example.  For this case, this flag is added.

### Does this PR introduce _any_ user-facing change?
No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality.

### How was this patch tested?
Unit tests, especially around cases where the source struct is null.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #53149 from szehon-ho/merge_schema_evolution_update_nested.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Nov 22, 2025
…s for UPDATE SET * when source struct has less nested fields than target struct

### What changes were proposed in this pull request?
Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c).  This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target.

Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls).

### Why are the changes needed?
Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct.  In this scenario, a user making a UPDATE SET * may have two interpretations.

The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL.  This is the current behavior.

The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained.  This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example.  For this case, this flag is added.

### Does this PR introduce _any_ user-facing change?
No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality.

### How was this patch tested?
Unit tests, especially around cases where the source struct is null.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #53149 from szehon-ho/merge_schema_evolution_update_nested.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 966e053)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 25, 2025
…s for UPDATE SET * when source struct has less nested fields than target struct

### What changes were proposed in this pull request?
Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c).  This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target.

Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls).

### Why are the changes needed?
Following apache#52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct.  In this scenario, a user making a UPDATE SET * may have two interpretations.

The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL.  This is the current behavior.

The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained.  This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example.  For this case, this flag is added.

### Does this PR introduce _any_ user-facing change?
No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (apache#52347), and in any case there is a flag to toggle this functionality.

### How was this patch tested?
Unit tests, especially around cases where the source struct is null.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#53149 from szehon-ho/merge_schema_evolution_update_nested.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
… field than target

### What changes were proposed in this pull request?
Support MERGE INTO where source has less fields than target.  This is already partially supported as part of: apache#51698, but only for top level fields.  This support it even for nested fields (structs, including within other structs, arrays, and maps)

This patch modifies the MERGE INTO assignment to re-use existing logic in TableOutputResolver to resolve empty values in structs to null or default.

UPDATE can also benefit from this, but we can do it in a subsequent pr.

### Why are the changes needed?
For cases where source has less fields than target in MERGE INTO, it should behave more gracefully (inserting null values where source field does not exist).

### Does this PR introduce _any_ user-facing change?
No, only that this scenario used to fail and will now pass.

This gates on a new flag: "spark.sql.merge.source.nested.type.coercion.enabled", enabled by default.

### How was this patch tested?
Add unit test to MergeIntoTableSuiteBase

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#52347 from szehon-ho/nested_merge_round_3.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…s for UPDATE SET * when source struct has less nested fields than target struct

### What changes were proposed in this pull request?
Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c).  This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target.

Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls).

### Why are the changes needed?
Following apache#52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct.  In this scenario, a user making a UPDATE SET * may have two interpretations.

The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL.  This is the current behavior.

The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained.  This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example.  For this case, this flag is added.

### Does this PR introduce _any_ user-facing change?
No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (apache#52347), and in any case there is a flag to toggle this functionality.

### How was this patch tested?
Unit tests, especially around cases where the source struct is null.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#53149 from szehon-ho/merge_schema_evolution_update_nested.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
.createWithDefault(true)

val MERGE_INTO_SOURCE_NESTED_TYPE_COERCION_ENABLED =
buildConf("spark.sql.merge.source.nested.type.coercion.enabled")
Copy link
Copy Markdown
Contributor

@aokolnychyi aokolnychyi Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to align with the naming pattern we use elsewhere.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @aokolnychyi 's comment. Let me make a PR for this.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants