[SPARK-53482][SQL] MERGE INTO support for when source has less nested field than target#52347
[SPARK-53482][SQL] MERGE INTO support for when source has less nested field than target#52347szehon-ho wants to merge 7 commits intoapache:masterfrom
Conversation
|
Does it work for missing nested fields? |
24136bf to
4ed2f04
Compare
|
|
||
| object TableOutputResolver extends SQLConfHelper with Logging { | ||
|
|
||
| object DefaultValueFillMode extends Enumeration { |
There was a problem hiding this comment.
Unfortunately, we need to distinguish between filling top level defaults and nested defaults within structs.
- Normal V2Writes does not expect nested source nested type coercion. For example, write into a target dataframe from a source dataframe that has a struct column with less fields does not work today.
This goes through Analyzer.ResolveOutputRelation, which calls resolveOutputColumns() => reorderColumnsByName() => resolveStruct/Array/MapType()
- RowLevelOperations , in particular Merge Into, want recursive support to coerce source struct columns with less fields than target column.
This goes through resolveUpdate => resolveStruct/Array/MapType()
So hence, we need a three-way enum here to distinguish the three cases (none, first-level, recurse).
|
@cloud-fan all the tests are fixed, can you take another look? Thanks! |
...rc/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveRowLevelCommandAssignments.scala
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveRowLevelCommandAssignments.scala
Outdated
Show resolved
Hide resolved
| object DefaultValueFillMode extends Enumeration { | ||
| val FILL, RECURSE, NONE = Value | ||
|
|
||
| def getChildMode(mode: DefaultValueFillMode.Value): DefaultValueFillMode.Value = { |
There was a problem hiding this comment.
Ah, need a flag whether to recrurse. See : #52347 (comment)
Its a bit hard to do as the method reorderColumnsByName doesnt recurse directly into itself , but indirectly via resolveStruct/Map/ArrayType
I removed this method in latest patch, hopefully its cleaner.
20d0057 to
bebdac5
Compare
|
thanks, merging to master! |
…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 966e053) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following apache#52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (apache#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… field than target ### What changes were proposed in this pull request? Support MERGE INTO where source has less fields than target. This is already partially supported as part of: apache#51698, but only for top level fields. This support it even for nested fields (structs, including within other structs, arrays, and maps) This patch modifies the MERGE INTO assignment to re-use existing logic in TableOutputResolver to resolve empty values in structs to null or default. UPDATE can also benefit from this, but we can do it in a subsequent pr. ### Why are the changes needed? For cases where source has less fields than target in MERGE INTO, it should behave more gracefully (inserting null values where source field does not exist). ### Does this PR introduce _any_ user-facing change? No, only that this scenario used to fail and will now pass. This gates on a new flag: "spark.sql.merge.source.nested.type.coercion.enabled", enabled by default. ### How was this patch tested? Add unit test to MergeIntoTableSuiteBase ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52347 from szehon-ho/nested_merge_round_3. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following apache#52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (apache#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
| .createWithDefault(true) | ||
|
|
||
| val MERGE_INTO_SOURCE_NESTED_TYPE_COERCION_ENABLED = | ||
| buildConf("spark.sql.merge.source.nested.type.coercion.enabled") |
There was a problem hiding this comment.
This doesn't seem to align with the naming pattern we use elsewhere.
There was a problem hiding this comment.
+1 for @aokolnychyi 's comment. Let me make a PR for this.
|
I made a follow-up to rename the config name. |
What changes were proposed in this pull request?
Support MERGE INTO where source has less fields than target. This is already partially supported as part of: #51698, but only for top level fields. This support it even for nested fields (structs, including within other structs, arrays, and maps)
This patch modifies the MERGE INTO assignment to re-use existing logic in TableOutputResolver to resolve empty values in structs to null or default.
UPDATE can also benefit from this, but we can do it in a subsequent pr.
Why are the changes needed?
For cases where source has less fields than target in MERGE INTO, it should behave more gracefully (inserting null values where source field does not exist).
Does this PR introduce any user-facing change?
No, only that this scenario used to fail and will now pass.
This gates on a new flag: "spark.sql.merge.source.nested.type.coercion.enabled", enabled by default.
How was this patch tested?
Add unit test to MergeIntoTableSuiteBase
Was this patch authored or co-authored using generative AI tooling?
No