[SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution#45748
Closed
xupefei wants to merge 7 commits intoapache:masterfrom
Closed
[SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution#45748xupefei wants to merge 7 commits intoapache:masterfrom
xupefei wants to merge 7 commits intoapache:masterfrom
Conversation
n-young-db
approved these changes
Mar 28, 2024
Contributor
n-young-db
left a comment
There was a problem hiding this comment.
Stamping, but you probably need a Spark committer to review this.
srielau
approved these changes
Apr 15, 2024
Member
|
@xupefei could you provide more details in the PR description? For example, what is the difference with/without |
Contributor
Author
Hi @gengliangwang, I improved the PR description as you advised. Please have a look! |
gengliangwang
approved these changes
Apr 17, 2024
Member
|
Thanks, merging to master |
Member
cloud-fan
pushed a commit
that referenced
this pull request
Aug 22, 2025
…Data Source
### What changes were proposed in this pull request?
Add support for schema evolution for data source that support MERGE INTO, currently V2 DataSources. This means that if the SOURCE table of merge has a different schema than TARGET table, the TARGET table can automatically update to take into account the new or different fields.
The basic idea is to add
- TableCapability.MERGE_SCHEMA_EVOLUTION to indicate DSV2 table wants Spark to handle schema evolution for MERGE
- ResolveMergeIntoSchemaEvolution rule, will generate DSV2 TableChanges and calls Catalog.alterTable
For any new field in the top level or in a nested struct, Spark will add the field to the end.
TODOS:
1. this currently does not support the case where SOURCE has a missing nested field from TARGET, and there is a UPDATE or INSERT star.
Example:
```
MERGE INTO TARGET t USING SOURCE s
// s=struct('a', struct('b': Int))
// t = struct('a', struct('c', int))
```
will only work if the user specifies a value explicitly for the new nested field t.b for INSERT and UPDATE, ie
```
INSERT (s) VALUES (nested_struct('a', nested_struct('b', 1, 'c' 2)))
UPDATE SET a.b = 2
```
and not if they use INSERT * or UPDATE SET *.
2. Type widening is not allowed for the moment, as we need to decide what widenings to allow
We can take this in a follow on pr.
### Why are the changes needed?
#45748 added the syntax 'WITH SCHEMA EVOLUTION' to 'MERGE INTO'. However, this requires some external Spark extension to resolve Merge, and doesnt do anything in Spark's native MERGE implementation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added many tests to MergeIntoTableSuiteBase
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #51698 from szehon-ho/merge_schema_evolution.
Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan
pushed a commit
that referenced
this pull request
Jan 13, 2026
### What changes were proposed in this pull request? Similar to the [MERGE WITH SCHEMA EVOLUTION PR](#45748), **this PR introduces a syntax `WITH SCHEMA EVOLUTION` to the SQL `INSERT` command.** Since this syntax is not fully implemented for any table formats yet, **users will receive an exception if they try to use it.** When `WITH SCHEMA EVOLUTION` is specified, schema evolution-related features must be turned on for this single statement and only in this statement. **In this PR, Spark is only responsible for recognizing the existence or absence of the syntax WITH SCHEMA EVOLUTION**, and the recognition info is passed down from the `Analyzer`. When `WITH SCHEMA EVOLUTION` is detected, Spark sets the `mergeSchema` write option to `true` in the respective V2 Insert Command nodes. Data sources must respect the syntax and give appropriate reactions: Turn on features that are categorised as "schema evolution" when the `WITH SCHEMA EVOLUTION` Syntax exists. ### Why are the changes needed? This intuitive SQL Syntax allows the user to specify Automatic Schema Evolution for a specific `INSERT` operation. Some users would like Schema Evolution for DML commands like `MERGE`, `INSERT`,... where the schema between the table and query relations can mismatch. ### Does this PR introduce _any_ user-facing change? Yes, Introducing the SQL Syntax `WITH SCHEMA EVOLUTION` to SQL `INSERT`. ### How was this patch tested? Added UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53732 from longvu-db/insert-schema-evolution. Lead-authored-by: Thang Long VU <long.vu@databricks.com> Co-authored-by: Thang Long Vu <107926660+longvu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
This PR introduces a syntax
WITH SCHEMA EVOLUTIONto the SQL MERGE command, which allows the user to specify automatic schema evolution for a specific operation.MERGE WITH SCHEMA EVOLUTION INTO tgt USING src ON ... WHEN ...When
WITH SCHEMA EVOLUTIONis specified, schema evolution-related features must be turned on for this single statement and only in this statement.Spark is only responsible for recognizing the existence or absence of the syntax
WITH SCHEMA EVOLUTION, and the result is passed down to the MERGE command. Data sources must respect the syntax and give appropriate reactions: turn on features that are categorised as "schema evolution" when the syntax does exist. For example, when the underlying table is Delta Lake, the feature "mergeSchema" will be turned on (see https://github.com/delta-io/delta/blob/c41977db3529a3139d6306abe5ded161f070982a/spark/src/main/scala/org/apache/spark/sql/delta/DeltaAnalysis.scala#L538).Does this PR introduce any user-facing change?
Yes, see the previous section.
How was this patch tested?
New tests.
Was this patch authored or co-authored using generative AI tooling?
No.