[SPARK-55689] Skip unsupported column changes during schema evolution by johanl-db · Pull Request #54658 · apache/spark

johanl-db · 2026-03-06T14:22:14Z

What changes were proposed in this pull request?

The initial implementation of schema evolution in MERGE/INSERT is too aggressive when trying to automatically apply schema evolution: any type mismatch between the source data and target table triggers an attempt to change the target data type, even though the table may not support it.

This change adds a new DSv2 trait SupportsSchemaEvolution that lets connectors indicate whether a given column change should be applied or not.

Why are the changes needed?

When schema evolution is enabled, the following write currently fails if the connector doesn't support changing the type of value from STRING to INT:

CREATE TABLE t (key INT, value STRING);
INSERT WITH SCHEMA EVOLUTION INTO t VALUES (1, 1)

On the other hand, the write succeeds without schema evolutio, a cast from INT to STRING is added, which is valid.

Does this PR introduce any user-facing change?

Yes, the following query now succeeds instead of trying - and failing - to change data type of value to INT:

CREATE TABLE t (key INT, value INT);
INSERT WITH SCHEMA EVOLUTION INTO t VALUES (1, "1")

How was this patch tested?

Added tests for type evolution in INSERT and MERGE INTO

gengliangwang

Overall Summary

This PR introduces a SupportsTypeEvolution DSv2 trait so connectors can control which type changes are allowed during schema evolution, and extends INSERT operations with schema evolution support (previously only MERGE INTO). The schema change logic is refactored from MergeIntoTable/ResolveMergeIntoSchemaEvolution into a shared ResolveSchemaEvolution utility. The design direction is sound.

General comments:

The PR description still says "TODO" under user-facing changes. This should be filled in before merging.
The changesForSchemaEvolution lazy val on V2WriteCommand computes schemaChanges even when schemaEvolutionEnabled is false (since it's a separate lazy val). While needSchemaEvolution properly guards the combination, consider making changesForSchemaEvolution short-circuit when schema evolution is disabled, to avoid calling into ResolveSchemaEvolution.schemaChanges unnecessarily.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsTypeEvolution.java

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsSchemaEvolution.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsSchemaEvolution.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

szehon-ho

lgtm!

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsSchemaEvolution.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

aokolnychyi · 2026-04-01T07:59:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

-      isByName: Boolean): Array[TableChange] =
-    computeSchemaChanges(
-      originalTarget,
+      isByName: Boolean): Array[TableChange] = {


A question for a follow-up PR: does it make sense to return Seq instead of Array here?

cloud-fan

Summary

The fundamental problem. When DML (INSERT or MERGE) encounters a type mismatch between source data and target table, Spark has two mutually exclusive strategies per column: evolve the target schema to match the source, or cast the source data to match the target. On master, enabling schema evolution commits to evolve-everything — all type changes are passed to catalog.alterTable, and if the connector can't apply one, the query fails. There is no fallback to casting. This means schema evolution is strictly worse than no schema evolution for type combinations where casting would work (e.g., inserting INT into a STRING column).

The solution. This PR introduces SupportsSchemaEvolution, a DSv2 mix-in interface where connectors declare which column changes they can physically apply. computeSupportedSchemaChanges filters candidate changes through supportsColumnChange before sending them to alterTable. Unsupported changes are silently dropped from the evolution set — the remaining type mismatches then resolve through standard TableOutputResolver cast insertion, just as if schema evolution were disabled for those columns. This lets connectors evolve what they can and cast the rest, instead of all-or-nothing.

Scope. The interface applies uniformly to all DML with schema evolution — both INSERT (V2WriteCommand.pendingSchemaChanges) and MERGE (MergeIntoTable.pendingSchemaChanges) call computeSupportedSchemaChanges.

Backward compatibility. Tables that have AUTOMATIC_SCHEMA_EVOLUTION but don't implement SupportsSchemaEvolution get the old unfiltered behavior (all changes attempted).

sql/core/src/test/scala/org/apache/spark/sql/connector/InsertIntoTests.scala

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsSchemaEvolution.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

cloud-fan · 2026-04-08T11:45:39Z

all minor, LGTM

…ntoTests.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

…alog/InMemoryBaseTable.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

…ysis/ResolveSchemaEvolution.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

…log/SupportsSchemaEvolution.java Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

johanl-db force-pushed the dsv2-type-evolution branch from b5c6f7d to 285a4a8 Compare March 6, 2026 15:45

gengliangwang reviewed Mar 6, 2026

View reviewed changes

dongjoon-hyun marked this pull request as draft March 7, 2026 18:32

johanl-db force-pushed the dsv2-type-evolution branch from 285a4a8 to a3bb1ed Compare March 17, 2026 17:21

johanl-db changed the title ~~[WIP] Type evolution in DSv2~~ Skip unsupported column changes during schema evolution Mar 17, 2026

johanl-db changed the title ~~Skip unsupported column changes during schema evolution~~ [SPARK-55689] Skip unsupported column changes during schema evolution Mar 17, 2026

johanl-db force-pushed the dsv2-type-evolution branch from a3bb1ed to 695f9a8 Compare March 17, 2026 18:16

Skip unsupported column changes during schema evolution

c560be2

johanl-db force-pushed the dsv2-type-evolution branch from 695f9a8 to c560be2 Compare March 17, 2026 18:18

johanl-db marked this pull request as ready for review March 17, 2026 18:19

aokolnychyi reviewed Mar 19, 2026

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsSchemaEvolution.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 19, 2026

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSchemaEvolution.scala Outdated Show resolved Hide resolved