[SPARK-56388][CONNECT] Add XML support to Spark Connect Parse protocol#55262
Open
hvanhovell wants to merge 1 commit intoapache:masterfrom
Open
[SPARK-56388][CONNECT] Add XML support to Spark Connect Parse protocol#55262hvanhovell wants to merge 1 commit intoapache:masterfrom
hvanhovell wants to merge 1 commit intoapache:masterfrom
Conversation
Add PARSE_FORMAT_XML to the Connect Parse proto enum and wire it through DataFrameReader.xml(Dataset[String]) on the client and SparkConnectPlanner on the server. Includes E2E tests and plan generation test coverage. Co-authored-by: Herman van Hövell
HyukjinKwon
approved these changes
Apr 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Added
PARSE_FORMAT_XMLto theParse.ParseFormatenum in the Spark Connect proto(
relations.proto) and wired it through the full stack:DataFrameReader.scala):xml(Dataset[String])now sendsPARSE_FORMAT_XMLinstead of
PARSE_FORMAT_UNSPECIFIEDSparkConnectPlanner.scala): handlesPARSE_FORMAT_XMLby dispatching todataFrameReader.xml(ds)Why are the changes needed?
DataFrameReader.xml(Dataset[String])was sendingPARSE_FORMAT_UNSPECIFIEDto the server,causing it to fail. This adds proper XML support to the Connect Parse protocol.
Does this PR introduce any user-facing change?
Yes.
spark.read.xml(dataset)now works correctly in Spark Connect (previously it threw an error).How was this patch tested?
ClientE2ETestSuitetests:xml from Dataset[String] inferSchema,xml from Dataset[String] with schema,xml from Dataset[String] with invalid schemaPlanGenerationTestSuitetest:xml from datasetxml_from_dataset.json/xml_from_dataset.proto.binWas this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6