[SPARK-56255][PYTHON][CONNECT] Make spark.read.csv accept DataFrame input by Yicong-Huang · Pull Request #55274 · apache/spark

Yicong-Huang · 2026-04-09T07:07:49Z

What changes were proposed in this pull request?

This PR adds support for passing a DataFrame containing CSV strings directly to spark.read.csv(), following the same pattern established by #55097 (SPARK-56253) for spark.read.json().

Changes:

Classic PySpark (readwriter.py): Updated csv() to accept DataFrame input, delegating to PythonSQLUtils.csvFromDataFrame().
Spark Connect (connect/readwriter.py): Updated csv() to accept DataFrame input, using the existing Parse logical plan with PARSE_FORMAT_CSV.
Scala bridge (PythonSQLUtils.scala): Added csvFromDataFrame() method that validates the input DataFrame has a StringType first column, then delegates to DataFrameReader.csv().
Tests: Added 5 test cases each for classic and Connect (basic input, schema override, non-string column error, multiple columns, zero columns).

Why are the changes needed?

spark.read.json() already supports DataFrame input (SPARK-56253), but spark.read.csv() does not. This inconsistency means users who want to parse CSV strings stored in a DataFrame must use workarounds. Adding DataFrame support to csv() makes the API consistent with json() and enables Connect-compatible CSV parsing without sc.parallelize().

Does this PR introduce any user-facing change?

Yes. spark.read.csv() now accepts a DataFrame with a single string column as input, in addition to the existing str, list, and RDD inputs.

csv_df = spark.createDataFrame([("Alice,25",), ("Bob,30",)], schema="value STRING")
spark.read.csv(csv_df, schema="name STRING, age INT").show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 25|
# |  Bob| 30|
# +-----+---+

How was this patch tested?

Added 10 new test cases:

test_csv_with_dataframe_input (classic + connect)
test_csv_with_dataframe_input_and_schema (classic + connect)
test_csv_with_dataframe_input_non_string_column (classic + connect)
test_csv_with_dataframe_input_multiple_columns (classic + connect)
test_csv_with_dataframe_input_zero_columns (classic + connect)

All tests pass locally.

Was this patch authored or co-authored using generative AI tooling?

No.

feat: make spark.read.csv accept DataFrame input

ea8eb05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56255][PYTHON][CONNECT] Make spark.read.csv accept DataFrame input#55274

[SPARK-56255][PYTHON][CONNECT] Make spark.read.csv accept DataFrame input#55274
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56255

Yicong-Huang commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yicong-Huang commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Yicong-Huang commented Apr 9, 2026 •

edited

Loading