[SPARK-55690] Schema evolution in DSv2 AppendData, OverwriteByExpression, OverwritePartitionsDynamic by johanl-db · Pull Request #54488 · apache/spark

johanl-db · 2026-02-25T17:01:40Z

What changes were proposed in this pull request?

Adds support for schema evolution during INSERT operations (AppendData, OverwriteByExpression, OverwritePartitionsDynamic)

When the table reports capability AUTOMATIC_SCHEMA_EVOLUTION, a new analyzer rule ResolveInsertSchemaEvolution collects new columns and nested fields present in the source query but not in the table schema, and adds them to the target table by calling catalog.alterTable()

Identifying new columns/fields respects the resolution semantics of INSERT operations: matching fields by-name vs by-position.

This builds on previous from @szehon-ho , in particular #51698.
The first two commits move this previous code around to reuse it, the core of the implementation is in the third commit.

Why are the changes needed?

The WITH SCHEMA EVOLUTION syntax for SQL inserts was added recently: #53732. This actually implements schema evolution behind this syntax.

Does this PR introduce any user-facing change?

Yes, when the WITH SCHEMA EVOLUTION clause is specified in SQL INSERT operations, new columns and nested fields in the source data will be added to the target table - assuming the data source supports schema evolution (capability AUTOMATIC_SCHEMA_EVOLUTION):

CREATE TABLE target (id INT);
INSERT INTO target VALUES (1);
INSERT WITH SCHEMA EVOLUTION INTO target SELECT 2 AS id, "two" AS value;

SELECT * FROM target;
| id | value |
|----|-------|
| 1  |  null |
| 2  | "two" |

How was this patch tested?

Added basic testing in DataSourceV2SQLSuite.
Integrated with Delta and ran extensive Delta test harness for schema evolution against this implementation.
See delta-io/delta#6140. A number of expected failures for tests that would need to be updated on Delta side (different error class returned, negative tests checking something specifically doesn't work if a fix is disabled, ..)

…volution.scala

szehon-ho

Thanks, I think this is a great pr! The tests coverage can be improved on various cases, but functionally its a good change:
eg

INSERT OVERWRITE with PartitionOverwriteMode.DYNAMIC + schema evolution
Case-insensitive column name matching
Static partition overwrite with schema evolution
Table without AUTOMATIC_SCHEMA_EVOLUTION capability should no-op

etc

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

szehon-ho

This looks good to me!

suggestion: add tests like:

type evolution
2 level structs
non-partitioned table
constraints

Also do we run the same tests Dataframe API? (I think we only test with SQL?)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

sql/core/src/test/scala/org/apache/spark/sql/connector/InsertIntoTests.scala

szehon-ho

lgtm! cc @aokolnychyi @cloud-fan

johanl-db · 2026-03-03T07:55:34Z

d tests like:

type evolution

2 level structs

non-partitioned table

constraints

Also do we run the same tests Dataframe API? (I think we only test with SQL?)

I meant to reply earlier to this:

2 level structs: I've added tests for struct evolution nested inside another struct, a map key/value and an array.
type evolution: I'm planning a follow up to properly support type evolution - it'll need to be more granular that today where we blindly attempt to apply any type change even if the table doesn't support it. I've added a basic test here, but proper coverage will be added in that follow up.
non partitioned tables: for REPLACE WHERE, the only reason I'm using partitioned tables is because the InMemoryCatalog doesn't support partial overwrite on non-partitioned tables. It didn't seem necessary to implement this, schema evolution doesn't apply to partition columns (you can't add a new partition columns, and partition columns can't be nested types so can't use struct evolution)
Constraints: not sure exactly what you had in mind there

For the dataframe API: Spark doesn't actually provide a way to enable schema evolution via the dataframe API, so I've left that out for now. Adding it would require more discussions: Delta (and Iceberg) do it via an a writer option mergeSchema, but Spark doesn't really say anything about that.

szehon-ho · 2026-03-03T19:45:07Z

meant to reply earlier to this:

No worries, i took another look and the test coverage looks good in the latest pr.

non partitioned tables: for REPLACE WHERE, the only reason I'm using partitioned tables is because the InMemoryCatalog doesn't support partial overwrite on non-partitioned tables. It didn't seem necessary to implement this, schema evolution doesn't apply to partition columns (you can't add a new partition columns, and partition columns can't be nested types so can't use struct evolution)

Yea makes sense, i think it can be a follow up.

Constraints: not sure exactly what you had in mind there

Sorry maybe we can ignore that, i was thinking the other case, it was more applicable for source having fewer columns case (checking whether putting NULLS violate the constraints), but you are right and it doesnt apply here.

For the dataframe API: Spark doesn't actually provide a way to enable schema evolution via the dataframe API, so I've left that out for now. Adding it would require more discussions: Delta (and Iceberg) do it via an a writer option mergeSchema, but Spark doesn't really say anything about that.

Yes, also realize after I typed it, forgot that there is no mergeSchema option for normal inserts. It'd be nice at some point, but definitely a follow up

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

aokolnychyi · 2026-03-10T10:21:07Z

@szehon-ho @johanl-db I'd say we should get #54704 in first to reduce the scope of this PR and simplify review.

…n-insert

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

aokolnychyi · 2026-03-12T09:24:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -3642,7 +3648,8 @@ class Analyzer(
    override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning(
      _.containsPattern(COMMAND), ruleId) {
      case v2Write: V2WriteCommand


@johanl-db @szehon-ho, can you folks explain the relation between skipSchemaEvolution via ACCEPT_ANY_SCHEMA and automatic schema evolution via AUTOMATIC_SCHEMA_EVOLUTION? Are these two mutually exclusive? Or can they co-exist? MERGE vs INSERT?

Merge with ACCEPT_ANY_SCHEMA on normal DSV2 data source breaks today as it relies on external rule to resolve the merge.

I think insert already works with ACCEPT_ANY_SCHEMA, and this would be another mode. Probably should be mutually exclusive?

Discussed with @aokolnychyi this morning: AUTOMATIC_SCHEMA_EVOLUTION and ACCEPT_ANY_SCHEMA are not exclusive:

AUTOMATIC_SCHEMA_EVOLUTION allows the rule ResolveSchemaEvolution to trigger

ACCEPT_ANY_SCHEMA skips some resolution steps in Spark, under the assumption that the connector will handle them. In particular:

For INSERT, skips schema alignment in ResolveOutputRelation

For MERGE: skips clause resolution in Analyzer, and skips schema alignment in ResolveRowLevelCommandAssignments

At least, that's how Spark applies these capabilities today, even though the name ACCEPT_ANY_SCHEMA suggests more.

The connector can choose to set either depending on the resolution flow that suits.
For example, Delta today always handles schema evolution itself (doesn't set AUTOMATIC_SCHEMA_EVOLUTION) and does resolution / schema alignment (sets ACCEPT_ANY_SCHEMA)

As Delta moves to DSv2, my plan is to have two phases:

Delta sets both AUTOMATIC_SCHEMA_EVOLUTION and ACCEPT_ANY_SCHEMA: Spark handles schema evolution, but Delta takes over to do the resolution of MERGE clauses initially, and then do schema alignment for both INSERT and MERGE

Delta sets only AUTOMATIC_SCHEMA_EVOLUTION: once we've reconciled all behavior differences between how Delta and Spark do schema alignment today, we hand over schema alignment to Spark. This will require substantial efforts, and careful breaking changes (if at all possible) in Delta

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala

aokolnychyi · 2026-03-12T10:03:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

+object ResolveSchemaEvolution extends Rule[LogicalPlan] with Logging {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning(
+    _.containsPattern(COMMAND), ruleId) {


I do agree about _.containsPattern but what about ruleId? Does it mean we will now mark this rule as ineffective after first pass? Will this potentially break MERGE that relies on some but not full resolution before the schema evolution kicks in?

@szehon-ho, can you check?

Removed ruleId, this is very likely wrong.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

aokolnychyi · 2026-03-12T10:57:45Z

This is getting very close, primarily minor suggestions. Worried about rule ID.
Thanks for pushing this through, @johanl-db!

…n-insert

aokolnychyi

Looks good to me except some minor points.

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

aokolnychyi · 2026-03-16T16:02:30Z

Thanks, @johanl-db! Merged to master.

### What changes were proposed in this pull request? Follow-up cleanup after [merge / V2 write schema evolution refactors](#54488): - Remove unused `DataTypeUtils.extractAllFieldPaths` and `extractLeafFieldPaths` (no call sites). - Update `Analyzer` comments that still referred to the removed rule name `ResolveMergeIntoTableSchemaEvolution`; the rule is now `ResolveSchemaEvolution`. - Clarify the `MergeIntoTable.schemaEvolutionReady` comment: `isSchemaEvolutionCandidate` lives on the companion object and is private. ### Why are the changes needed? Avoid misleading references to deleted APIs and dead utility code. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests; change is comment + unused API removal only. Closes #54930 from szehon-ho/SPARK-55690-fix-dead-refs. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…6319) ## Description Support for schema evolution in INSERT landed in Spark master: apache/spark#54488 A few changes are required to maintain compatibility in Delta: - V2Write commands (AppendData, OverwriteByExpression, ..) now have an extra parameter `withSchemaEvolution` and introduce a few methods to implement for schema evolution for V2Write commands (e.p. `writePrivileges`) - Spark doesn't set delta's writer option "mergeSchema" anymore, which was used as workaround until a `withSchemaEvolution` was introduced on write plan nodes. Instead, Spark now sets `withSchemaEvolution` directly, and a pre-resolution rule `PropagateSchemaEvolutionWriteOption` is added in Delta to set the writer option `mergeSchema` when `withSchemaEvolution=true` if it's not already explicitly set by the user. ## How was this patch tested? Existing tests ## Does this PR introduce _any_ user-facing changes? No

johanl-db added 4 commits February 25, 2026 16:06

Move ResolveMergeIntoSchemaEvolution.scala -> ResolveMergeIntoSchemaE…

b3963d0

…volution.scala

Move schemaChanges from MergeIntoTable to ResolveSchemaEvolution

05b32cb

Schema evolution for DSv2 INSERT

7be9d2a

Merge branch 'master' into dsv2-schema-evolution-insert

6bc2956

szehon-ho reviewed Feb 26, 2026

View reviewed changes

johanl-db mentioned this pull request Feb 26, 2026

[Spark][Prototype] Schema evolution in DSv2 INSERT delta-io/delta#6140

Open

Add tests, address comments

a3042e3

johanl-db changed the title ~~[WIP][SPARK-55690] Schema evolution in DSv2 AppendData, OverwriteByExpression, OverwritePartitionsDynamic~~ [SPARK-55690] Schema evolution in DSv2 AppendData, OverwriteByExpression, OverwritePartitionsDynamic Feb 26, 2026

johanl-db requested a review from szehon-ho February 26, 2026 14:09

szehon-ho reviewed Feb 27, 2026

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala Outdated Show resolved Hide resolved

sql/core/src/test/scala/org/apache/spark/sql/connector/InsertIntoTests.scala Outdated Show resolved Hide resolved

Add tests

aeae2b4

szehon-ho approved these changes Feb 28, 2026

View reviewed changes

Fix checking catalog name in test

862fdb6

johanl-db mentioned this pull request Mar 6, 2026

[SPARK-55689] Skip unsupported column changes during schema evolution #54658

Closed

aokolnychyi reviewed Mar 9, 2026

View reviewed changes

johanl-db added 2 commits March 10, 2026 11:16

Refactor to use single schema evolution rule + address comments

42ca213

Minor improvements

18f10a1

johanl-db requested a review from aokolnychyi March 10, 2026 11:55

johanl-db added 2 commits March 11, 2026 16:37

Merge remote-tracking branch 'spark/master' into dsv2-schema-evolutio…

b7e303f

…n-insert

Resolve conflicts from apache#54704

0cc771f