Skip to content

[SPARK-56255][PYTHON][CONNECT] Make spark.read.csv accept DataFrame input#55274

Open
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56255
Open

[SPARK-56255][PYTHON][CONNECT] Make spark.read.csv accept DataFrame input#55274
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56255

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented Apr 9, 2026

What changes were proposed in this pull request?

This PR adds support for passing a DataFrame containing CSV strings directly to spark.read.csv(), following the same pattern established by #55097 (SPARK-56253) for spark.read.json().

Changes:

  • Classic PySpark (readwriter.py): Updated csv() to accept DataFrame input, delegating to PythonSQLUtils.csvFromDataFrame().
  • Spark Connect (connect/readwriter.py): Updated csv() to accept DataFrame input, using the existing Parse logical plan with PARSE_FORMAT_CSV.
  • Scala bridge (PythonSQLUtils.scala): Added csvFromDataFrame() method that validates the input DataFrame has a StringType first column, then delegates to DataFrameReader.csv().
  • Tests: Added 5 test cases each for classic and Connect (basic input, schema override, non-string column error, multiple columns, zero columns).

Why are the changes needed?

spark.read.json() already supports DataFrame input (SPARK-56253), but spark.read.csv() does not. This inconsistency means users who want to parse CSV strings stored in a DataFrame must use workarounds. Adding DataFrame support to csv() makes the API consistent with json() and enables Connect-compatible CSV parsing without sc.parallelize().

Does this PR introduce any user-facing change?

Yes. spark.read.csv() now accepts a DataFrame with a single string column as input, in addition to the existing str, list, and RDD inputs.

csv_df = spark.createDataFrame([("Alice,25",), ("Bob,30",)], schema="value STRING")
spark.read.csv(csv_df, schema="name STRING, age INT").show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 25|
# |  Bob| 30|
# +-----+---+

How was this patch tested?

Added 10 new test cases:

  • test_csv_with_dataframe_input (classic + connect)
  • test_csv_with_dataframe_input_and_schema (classic + connect)
  • test_csv_with_dataframe_input_non_string_column (classic + connect)
  • test_csv_with_dataframe_input_multiple_columns (classic + connect)
  • test_csv_with_dataframe_input_zero_columns (classic + connect)

All tests pass locally.

Was this patch authored or co-authored using generative AI tooling?

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant