Skip to content

[SPARK-55690] Schema evolution in DSv2 AppendData, OverwriteByExpression, OverwritePartitionsDynamic#54488

Closed
johanl-db wants to merge 16 commits intoapache:masterfrom
johanl-db:dsv2-schema-evolution-insert
Closed

[SPARK-55690] Schema evolution in DSv2 AppendData, OverwriteByExpression, OverwritePartitionsDynamic#54488
johanl-db wants to merge 16 commits intoapache:masterfrom
johanl-db:dsv2-schema-evolution-insert

Conversation

@johanl-db
Copy link
Copy Markdown
Contributor

@johanl-db johanl-db commented Feb 25, 2026

What changes were proposed in this pull request?

Adds support for schema evolution during INSERT operations (AppendData, OverwriteByExpression, OverwritePartitionsDynamic)

When the table reports capability AUTOMATIC_SCHEMA_EVOLUTION, a new analyzer rule ResolveInsertSchemaEvolution collects new columns and nested fields present in the source query but not in the table schema, and adds them to the target table by calling catalog.alterTable()

Identifying new columns/fields respects the resolution semantics of INSERT operations: matching fields by-name vs by-position.

This builds on previous from @szehon-ho , in particular #51698.
The first two commits move this previous code around to reuse it, the core of the implementation is in the third commit.

Why are the changes needed?

The WITH SCHEMA EVOLUTION syntax for SQL inserts was added recently: #53732. This actually implements schema evolution behind this syntax.

Does this PR introduce any user-facing change?

Yes, when the WITH SCHEMA EVOLUTION clause is specified in SQL INSERT operations, new columns and nested fields in the source data will be added to the target table - assuming the data source supports schema evolution (capability AUTOMATIC_SCHEMA_EVOLUTION):

CREATE TABLE target (id INT);
INSERT INTO target VALUES (1);
INSERT WITH SCHEMA EVOLUTION INTO target SELECT 2 AS id, "two" AS value;

SELECT * FROM target;
| id | value |
|----|-------|
| 1  |  null |
| 2  | "two" |

How was this patch tested?

Added basic testing in DataSourceV2SQLSuite.
Integrated with Delta and ran extensive Delta test harness for schema evolution against this implementation.
See delta-io/delta#6140. A number of expected failures for tests that would need to be updated on Delta side (different error class returned, negative tests checking something specifically doesn't work if a fix is disabled, ..)

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is a great pr! The tests coverage can be improved on various cases, but functionally its a good change:
eg

  • INSERT OVERWRITE with PartitionOverwriteMode.DYNAMIC + schema evolution
  • Case-insensitive column name matching
  • Static partition overwrite with schema evolution
  • Table without AUTOMATIC_SCHEMA_EVOLUTION capability should no-op

etc

@johanl-db johanl-db changed the title [WIP][SPARK-55690] Schema evolution in DSv2 AppendData, OverwriteByExpression, OverwritePartitionsDynamic [SPARK-55690] Schema evolution in DSv2 AppendData, OverwriteByExpression, OverwritePartitionsDynamic Feb 26, 2026
@johanl-db johanl-db requested a review from szehon-ho February 26, 2026 14:09
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

suggestion: add tests like:

  • type evolution
  • 2 level structs
  • non-partitioned table
  • constraints

Also do we run the same tests Dataframe API? (I think we only test with SQL?)

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanl-db
Copy link
Copy Markdown
Contributor Author

johanl-db commented Mar 3, 2026

d tests like:

  • type evolution
  • 2 level structs
  • non-partitioned table
  • constraints

Also do we run the same tests Dataframe API? (I think we only test with SQL?)

I meant to reply earlier to this:

  • 2 level structs: I've added tests for struct evolution nested inside another struct, a map key/value and an array.
  • type evolution: I'm planning a follow up to properly support type evolution - it'll need to be more granular that today where we blindly attempt to apply any type change even if the table doesn't support it. I've added a basic test here, but proper coverage will be added in that follow up.
  • non partitioned tables: for REPLACE WHERE, the only reason I'm using partitioned tables is because the InMemoryCatalog doesn't support partial overwrite on non-partitioned tables. It didn't seem necessary to implement this, schema evolution doesn't apply to partition columns (you can't add a new partition columns, and partition columns can't be nested types so can't use struct evolution)
  • Constraints: not sure exactly what you had in mind there

For the dataframe API: Spark doesn't actually provide a way to enable schema evolution via the dataframe API, so I've left that out for now. Adding it would require more discussions: Delta (and Iceberg) do it via an a writer option mergeSchema, but Spark doesn't really say anything about that.

@szehon-ho
Copy link
Copy Markdown
Member

szehon-ho commented Mar 3, 2026

meant to reply earlier to this:

No worries, i took another look and the test coverage looks good in the latest pr.

non partitioned tables: for REPLACE WHERE, the only reason I'm using partitioned tables is because the InMemoryCatalog doesn't support partial overwrite on non-partitioned tables. It didn't seem necessary to implement this, schema evolution doesn't apply to partition columns (you can't add a new partition columns, and partition columns can't be nested types so can't use struct evolution)

Yea makes sense, i think it can be a follow up.

Constraints: not sure exactly what you had in mind there

Sorry maybe we can ignore that, i was thinking the other case, it was more applicable for source having fewer columns case (checking whether putting NULLS violate the constraints), but you are right and it doesnt apply here.

For the dataframe API: Spark doesn't actually provide a way to enable schema evolution via the dataframe API, so I've left that out for now. Adding it would require more discussions: Delta (and Iceberg) do it via an a writer option mergeSchema, but Spark doesn't really say anything about that.

Yes, also realize after I typed it, forgot that there is no mergeSchema option for normal inserts. It'd be nice at some point, but definitely a follow up

@aokolnychyi
Copy link
Copy Markdown
Contributor

@szehon-ho @johanl-db I'd say we should get #54704 in first to reduce the scope of this PR and simplify review.

@johanl-db johanl-db requested a review from aokolnychyi March 10, 2026 11:55
@@ -3642,7 +3648,8 @@ class Analyzer(
override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning(
_.containsPattern(COMMAND), ruleId) {
case v2Write: V2WriteCommand
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanl-db @szehon-ho, can you folks explain the relation between skipSchemaEvolution via ACCEPT_ANY_SCHEMA and automatic schema evolution via AUTOMATIC_SCHEMA_EVOLUTION? Are these two mutually exclusive? Or can they co-exist? MERGE vs INSERT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge with ACCEPT_ANY_SCHEMA on normal DSV2 data source breaks today as it relies on external rule to resolve the merge.

I think insert already works with ACCEPT_ANY_SCHEMA, and this would be another mode. Probably should be mutually exclusive?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @aokolnychyi this morning: AUTOMATIC_SCHEMA_EVOLUTION and ACCEPT_ANY_SCHEMA are not exclusive:

  • AUTOMATIC_SCHEMA_EVOLUTION allows the rule ResolveSchemaEvolution to trigger
  • ACCEPT_ANY_SCHEMA skips some resolution steps in Spark, under the assumption that the connector will handle them. In particular:
    • For INSERT, skips schema alignment in ResolveOutputRelation
    • For MERGE: skips clause resolution in Analyzer, and skips schema alignment in ResolveRowLevelCommandAssignments

At least, that's how Spark applies these capabilities today, even though the name ACCEPT_ANY_SCHEMA suggests more.

The connector can choose to set either depending on the resolution flow that suits.
For example, Delta today always handles schema evolution itself (doesn't set AUTOMATIC_SCHEMA_EVOLUTION) and does resolution / schema alignment (sets ACCEPT_ANY_SCHEMA)

As Delta moves to DSv2, my plan is to have two phases:

  1. Delta sets both AUTOMATIC_SCHEMA_EVOLUTION and ACCEPT_ANY_SCHEMA: Spark handles schema evolution, but Delta takes over to do the resolution of MERGE clauses initially, and then do schema alignment for both INSERT and MERGE
  2. Delta sets only AUTOMATIC_SCHEMA_EVOLUTION: once we've reconciled all behavior differences between how Delta and Spark do schema alignment today, we hand over schema alignment to Spark. This will require substantial efforts, and careful breaking changes (if at all possible) in Delta

object ResolveSchemaEvolution extends Rule[LogicalPlan] with Logging {

override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning(
_.containsPattern(COMMAND), ruleId) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree about _.containsPattern but what about ruleId? Does it mean we will now mark this rule as ineffective after first pass? Will this potentially break MERGE that relies on some but not full resolution before the schema evolution kicks in?

@szehon-ho, can you check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ruleId, this is very likely wrong.

@aokolnychyi
Copy link
Copy Markdown
Contributor

This is getting very close, primarily minor suggestions. Worried about rule ID.
Thanks for pushing this through, @johanl-db!

@johanl-db johanl-db requested a review from aokolnychyi March 13, 2026 17:33
Copy link
Copy Markdown
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me except some minor points.

@aokolnychyi
Copy link
Copy Markdown
Contributor

Thanks, @johanl-db! Merged to master.

dongjoon-hyun pushed a commit that referenced this pull request Mar 23, 2026
### What changes were proposed in this pull request?

Follow-up cleanup after [merge / V2 write schema evolution refactors](#54488):

- Remove unused `DataTypeUtils.extractAllFieldPaths` and `extractLeafFieldPaths` (no call sites).
- Update `Analyzer` comments that still referred to the removed rule name `ResolveMergeIntoTableSchemaEvolution`; the rule is now `ResolveSchemaEvolution`.
- Clarify the `MergeIntoTable.schemaEvolutionReady` comment: `isSchemaEvolutionCandidate` lives on the companion object and is private.

### Why are the changes needed?

Avoid misleading references to deleted APIs and dead utility code.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests; change is comment + unused API removal only.

Closes #54930 from szehon-ho/SPARK-55690-fix-dead-refs.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
murali-db pushed a commit to delta-io/delta that referenced this pull request Apr 1, 2026
…6319)

## Description
Support for schema evolution in INSERT landed in Spark master:
apache/spark#54488

A few changes are required to maintain compatibility in Delta:
- V2Write commands (AppendData, OverwriteByExpression, ..) now have an
extra parameter `withSchemaEvolution` and introduce a few methods to
implement for schema evolution for V2Write commands (e.p.
`writePrivileges`)
- Spark doesn't set delta's writer option "mergeSchema" anymore, which
was used as workaround until a `withSchemaEvolution` was introduced on
write plan nodes. Instead, Spark now sets `withSchemaEvolution`
directly, and a pre-resolution rule
`PropagateSchemaEvolutionWriteOption` is added in Delta to set the
writer option `mergeSchema` when `withSchemaEvolution=true` if it's not
already explicitly set by the user.

## How was this patch tested?
Existing tests

## Does this PR introduce _any_ user-facing changes?
No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants