Skip to content

[SPARK-55689] Skip unsupported column changes during schema evolution#54658

Closed
johanl-db wants to merge 9 commits intoapache:masterfrom
johanl-db:dsv2-type-evolution
Closed

[SPARK-55689] Skip unsupported column changes during schema evolution#54658
johanl-db wants to merge 9 commits intoapache:masterfrom
johanl-db:dsv2-type-evolution

Conversation

@johanl-db
Copy link
Copy Markdown
Contributor

@johanl-db johanl-db commented Mar 6, 2026

What changes were proposed in this pull request?

The initial implementation of schema evolution in MERGE/INSERT is too aggressive when trying to automatically apply schema evolution: any type mismatch between the source data and target table triggers an attempt to change the target data type, even though the table may not support it.

This change adds a new DSv2 trait SupportsSchemaEvolution that lets connectors indicate whether a given column change should be applied or not.

Why are the changes needed?

When schema evolution is enabled, the following write currently fails if the connector doesn't support changing the type of value from STRING to INT:

CREATE TABLE t (key INT, value STRING);
INSERT WITH SCHEMA EVOLUTION INTO t VALUES (1, 1)

On the other hand, the write succeeds without schema evolutio, a cast from INT to STRING is added, which is valid.

Does this PR introduce any user-facing change?

Yes, the following query now succeeds instead of trying - and failing - to change data type of value to INT:

CREATE TABLE t (key INT, value INT);
INSERT WITH SCHEMA EVOLUTION INTO t VALUES (1, "1")

How was this patch tested?

Added tests for type evolution in INSERT and MERGE INTO

@johanl-db johanl-db force-pushed the dsv2-type-evolution branch from b5c6f7d to 285a4a8 Compare March 6, 2026 15:45
Copy link
Copy Markdown
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Summary

This PR introduces a SupportsTypeEvolution DSv2 trait so connectors can control which type changes are allowed during schema evolution, and extends INSERT operations with schema evolution support (previously only MERGE INTO). The schema change logic is refactored from MergeIntoTable/ResolveMergeIntoSchemaEvolution into a shared ResolveSchemaEvolution utility. The design direction is sound.

General comments:

  • The PR description still says "TODO" under user-facing changes. This should be filled in before merging.
  • The changesForSchemaEvolution lazy val on V2WriteCommand computes schemaChanges even when schemaEvolutionEnabled is false (since it's a separate lazy val). While needSchemaEvolution properly guards the combination, consider making changesForSchemaEvolution short-circuit when schema evolution is disabled, to avoid calling into ResolveSchemaEvolution.schemaChanges unnecessarily.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft March 7, 2026 18:32
@johanl-db johanl-db force-pushed the dsv2-type-evolution branch from 285a4a8 to a3bb1ed Compare March 17, 2026 17:21
@johanl-db johanl-db changed the title [WIP] Type evolution in DSv2 Skip unsupported column changes during schema evolution Mar 17, 2026
@johanl-db johanl-db changed the title Skip unsupported column changes during schema evolution [SPARK-55689] Skip unsupported column changes during schema evolution Mar 17, 2026
@johanl-db johanl-db force-pushed the dsv2-type-evolution branch from a3bb1ed to 695f9a8 Compare March 17, 2026 18:16
@johanl-db johanl-db force-pushed the dsv2-type-evolution branch from 695f9a8 to c560be2 Compare March 17, 2026 18:18
@johanl-db johanl-db marked this pull request as ready for review March 17, 2026 18:19
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

isByName: Boolean): Array[TableChange] =
computeSchemaChanges(
originalTarget,
isByName: Boolean): Array[TableChange] = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question for a follow-up PR: does it make sense to return Seq instead of Array here?

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

The fundamental problem. When DML (INSERT or MERGE) encounters a type mismatch between source data and target table, Spark has two mutually exclusive strategies per column: evolve the target schema to match the source, or cast the source data to match the target. On master, enabling schema evolution commits to evolve-everything — all type changes are passed to catalog.alterTable, and if the connector can't apply one, the query fails. There is no fallback to casting. This means schema evolution is strictly worse than no schema evolution for type combinations where casting would work (e.g., inserting INT into a STRING column).

The solution. This PR introduces SupportsSchemaEvolution, a DSv2 mix-in interface where connectors declare which column changes they can physically apply. computeSupportedSchemaChanges filters candidate changes through supportsColumnChange before sending them to alterTable. Unsupported changes are silently dropped from the evolution set — the remaining type mismatches then resolve through standard TableOutputResolver cast insertion, just as if schema evolution were disabled for those columns. This lets connectors evolve what they can and cast the rest, instead of all-or-nothing.

Scope. The interface applies uniformly to all DML with schema evolution — both INSERT (V2WriteCommand.pendingSchemaChanges) and MERGE (MergeIntoTable.pendingSchemaChanges) call computeSupportedSchemaChanges.

Backward compatibility. Tables that have AUTOMATIC_SCHEMA_EVOLUTION but don't implement SupportsSchemaEvolution get the old unfiltered behavior (all changes attempted).

@cloud-fan
Copy link
Copy Markdown
Contributor

all minor, LGTM

aokolnychyi and others added 5 commits April 8, 2026 15:19
…ntoTests.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
…alog/InMemoryBaseTable.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
…ysis/ResolveSchemaEvolution.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
…ysis/ResolveSchemaEvolution.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
…log/SupportsSchemaEvolution.java

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants