Skip to content

[SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution#45748

Closed
xupefei wants to merge 7 commits intoapache:masterfrom
xupefei:merge-schema-evolution
Closed

[SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution#45748
xupefei wants to merge 7 commits intoapache:masterfrom
xupefei:merge-schema-evolution

Conversation

@xupefei
Copy link
Copy Markdown
Contributor

@xupefei xupefei commented Mar 28, 2024

Why are the changes needed?

This PR introduces a syntax WITH SCHEMA EVOLUTION to the SQL MERGE command, which allows the user to specify automatic schema evolution for a specific operation.

MERGE WITH SCHEMA EVOLUTION
INTO tgt
USING src
ON ...
WHEN ...

When WITH SCHEMA EVOLUTION is specified, schema evolution-related features must be turned on for this single statement and only in this statement.

Spark is only responsible for recognizing the existence or absence of the syntax WITH SCHEMA EVOLUTION, and the result is passed down to the MERGE command. Data sources must respect the syntax and give appropriate reactions: turn on features that are categorised as "schema evolution" when the syntax does exist. For example, when the underlying table is Delta Lake, the feature "mergeSchema" will be turned on (see https://github.com/delta-io/delta/blob/c41977db3529a3139d6306abe5ded161f070982a/spark/src/main/scala/org/apache/spark/sql/delta/DeltaAnalysis.scala#L538).

Does this PR introduce any user-facing change?

Yes, see the previous section.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Copy Markdown
Contributor

@n-young-db n-young-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping, but you probably need a Spark committer to review this.

@xupefei xupefei changed the title [WIP][SPARK-47627] Add SQL MERGE syntax to enable schema evolution [SPARK-47627] Add SQL MERGE syntax to enable schema evolution Mar 28, 2024
@HyukjinKwon HyukjinKwon changed the title [SPARK-47627] Add SQL MERGE syntax to enable schema evolution [SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution Mar 29, 2024
@gengliangwang
Copy link
Copy Markdown
Member

@xupefei could you provide more details in the PR description? For example, what is the difference with/without WITH SCHEMA EVOLUTION

@xupefei
Copy link
Copy Markdown
Contributor Author

xupefei commented Apr 16, 2024

@xupefei could you provide more details in the PR description? For example, what is the difference with/without WITH SCHEMA EVOLUTION

Hi @gengliangwang, I improved the PR description as you advised. Please have a look!

@gengliangwang
Copy link
Copy Markdown
Member

Thanks, merging to master

@dongjoon-hyun
Copy link
Copy Markdown
Member

cc @huaxingao , @RussellSpitzer

@xupefei xupefei deleted the merge-schema-evolution branch August 30, 2024 14:49
cloud-fan pushed a commit that referenced this pull request Aug 22, 2025
…Data Source

### What changes were proposed in this pull request?
Add support for schema evolution for data source that support MERGE INTO, currently V2 DataSources.  This means that if the SOURCE table of merge has a different schema than TARGET table, the TARGET table can automatically update to take into account the new or different fields.

The basic idea is to add

- TableCapability.MERGE_SCHEMA_EVOLUTION to indicate DSV2 table wants Spark to handle schema evolution for MERGE
- ResolveMergeIntoSchemaEvolution rule, will generate DSV2 TableChanges and calls Catalog.alterTable

For any new field in the top level or in a nested struct, Spark will add the field to the end.

TODOS:

1. this currently does not support the case where SOURCE has a missing nested field from TARGET, and there is a UPDATE or INSERT star.

Example:
```
MERGE INTO TARGET t USING SOURCE s
// s=struct('a', struct('b': Int))
// t = struct('a', struct('c', int))
```
will only work if the user specifies a value explicitly for the new nested field t.b for INSERT and UPDATE, ie
```
INSERT (s) VALUES (nested_struct('a', nested_struct('b', 1, 'c' 2)))
UPDATE SET a.b = 2
```
 and not if they use INSERT * or UPDATE SET *.

2. Type widening is not allowed for the moment, as we need to decide what widenings to allow

We can take this in a follow on pr.

### Why are the changes needed?
#45748 added the syntax 'WITH SCHEMA EVOLUTION' to 'MERGE INTO'.  However, this requires some external Spark extension to resolve Merge, and doesnt do anything in Spark's native MERGE implementation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added many tests to MergeIntoTableSuiteBase

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #51698 from szehon-ho/merge_schema_evolution.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jan 13, 2026
### What changes were proposed in this pull request?

Similar to the [MERGE WITH SCHEMA EVOLUTION PR](#45748), **this PR introduces a syntax `WITH SCHEMA EVOLUTION` to the SQL `INSERT` command.** Since this syntax is not fully implemented for any table formats yet, **users will receive an exception if they try to use it.**

When `WITH SCHEMA EVOLUTION` is specified, schema evolution-related features must be turned on for this single statement and only in this statement.

**In this PR, Spark is only responsible for recognizing the existence or absence of the syntax WITH SCHEMA EVOLUTION**, and the recognition info is passed down from the `Analyzer`. When `WITH SCHEMA EVOLUTION` is detected, Spark sets the `mergeSchema` write option to `true` in the respective V2 Insert Command nodes.

Data sources must respect the syntax and give appropriate reactions: Turn on features that are categorised as "schema evolution" when the `WITH SCHEMA EVOLUTION` Syntax exists.

### Why are the changes needed?

This intuitive SQL Syntax allows the user to specify Automatic Schema Evolution for a specific `INSERT` operation.

Some users would like Schema Evolution for DML commands like `MERGE`, `INSERT`,... where the schema between the table and query relations can mismatch.

### Does this PR introduce _any_ user-facing change?

Yes, Introducing the SQL Syntax `WITH SCHEMA EVOLUTION` to SQL `INSERT`.

### How was this patch tested?

Added UTs.
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53732 from longvu-db/insert-schema-evolution.

Lead-authored-by: Thang Long VU <long.vu@databricks.com>
Co-authored-by: Thang Long Vu <107926660+longvu-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants