Skip to content

[SPARK-56414][SQL] Per-write options should take precedence over session config in Parquet and Avro#55280

Open
cloud-fan wants to merge 3 commits intoapache:masterfrom
cloud-fan:fix-parquet-write-option-priority
Open

[SPARK-56414][SQL] Per-write options should take precedence over session config in Parquet and Avro#55280
cloud-fan wants to merge 3 commits intoapache:masterfrom
cloud-fan:fix-parquet-write-option-priority

Conversation

@cloud-fan
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

In ParquetUtils.prepareWrite and AvroUtils.prepareWrite, several Hadoop configuration keys are unconditionally set from the session-level SQLConf. Since write options are already merged into the Hadoop conf upstream via SessionState.newHadoopConfWithOptions, these unconditional sets silently overwrite any per-write options the user provided.

This PR introduces a shared DataSourceUtils.setConfIfAbsent utility that only sets a key from SQLConf if it is not already present in the conf, allowing per-write options to take precedence.

Parquet keys fixed:

  • spark.sql.parquet.writeLegacyFormat
  • spark.sql.parquet.outputTimestampType
  • spark.sql.parquet.fieldId.write.enabled
  • spark.sql.legacy.parquet.nanosAsLong
  • spark.sql.parquet.annotateVariantLogicalType

Avro keys fixed:

  • Zstandard buffer pool (avro.output.codec.zstd.bufferpool)
  • Compression levels (avro.mapred.<codec>.level)

Why are the changes needed?

Per-write options (passed via DataFrameWriter.option()) should take precedence over session-level SQLConf defaults. This is already the case for compression codecs in both Parquet and Avro, but other write configuration keys had their per-write values silently overwritten. For example, setting spark.sql.parquet.outputTimestampType as a write option had no effect because prepareWrite always replaced it with the session config value.

Does this PR introduce any user-facing change?

Yes. Per-write options for the listed keys now take effect instead of being silently ignored. Previously, only the session-level SQLConf value was used regardless of what was passed as a write option.

How was this patch tested?

Added an integration test in ParquetEncodingSuite that verifies per-write options override session config for outputTimestampType (checks the physical Parquet type in the file footer) and writerVersion (checks delta encoding is used with PARQUET_2_0).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-6)

…on config in Parquet and Avro

Co-authored-by: Isaac
@cloud-fan cloud-fan changed the title [SPARK-xxxx][SQL] Per-write options should take precedence over session config in Parquet and Avro [SPARK-56414][SQL] Per-write options should take precedence over session config in Parquet and Avro Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant