Skip to content

Add quote style and trimming to csv writier#20813

Open
xanderbailey wants to merge 12 commits intoapache:mainfrom
xanderbailey:xb/csv_writer_options
Open

Add quote style and trimming to csv writier#20813
xanderbailey wants to merge 12 commits intoapache:mainfrom
xanderbailey:xb/csv_writer_options

Conversation

@xanderbailey
Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey commented Mar 8, 2026

Which issue does this PR close?

Related arrow-rs PRs apache/arrow-rs#8960 and apache/arrow-rs#9004

Rationale for this change

The CSV writer was missing support for quote_style, ignore_leading_whitespace, and ignore_trailing_whitespace options that are available on the underlying arrow WriterBuilder. This meant users couldn't control quoting behaviour or whitespace trimming when writing CSV files.

What changes are included in this PR?

Adds three new CSV writer options wired through the full stack:

  • quote_style — controls when fields are quoted (Always, Necessary, NonNumeric, Never). Modelled as a protobuf enum (CsvQuoteStyle).
  • ignore_leading_whitespace — trims leading whitespace from string values on write.
  • ignore_trailing_whitespace — trims trailing whitespace from string values on write.

Are these changes tested?

Yes — sqllogictest coverage added in csv_files.slt

Are there any user-facing changes?

Three new format.* options available in COPY TO and CREATE EXTERNAL TABLE for CSV:

  • format.quote_style (string: Always, Necessary, NonNumeric, Never)
  • format.ignore_leading_whitespace (boolean)
  • format.ignore_trailing_whitespace (boolean)

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Mar 8, 2026
Comment thread datafusion/sqllogictest/test_files/csv_files.slt Outdated
pub quote_style: i32,
/// Whether to ignore leading whitespace in string values
#[prost(bytes = "vec", tag = "21")]
pub ignore_leading_whitespace: ::prost::alloc::vec::Vec<u8>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern here for other bools being Vec

pub quote_style: i32,
/// Whether to ignore leading whitespace in string values
#[prost(bytes = "vec", tag = "21")]
pub ignore_leading_whitespace: ::prost::alloc::vec::Vec<u8>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern here for other bools being Vec

Ok(CsvQuoteStyleProto::Always) => CsvQuoteStyle::Always,
Ok(CsvQuoteStyleProto::NonNumeric) => CsvQuoteStyle::NonNumeric,
Ok(CsvQuoteStyleProto::Never) => CsvQuoteStyle::Never,
_ => CsvQuoteStyle::Necessary,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't error on:

            compression: match proto.compression {
                0 => CompressionTypeVariant::GZIP,
                1 => CompressionTypeVariant::BZIP2,
                2 => CompressionTypeVariant::XZ,
                3 => CompressionTypeVariant::ZSTD,
                _ => CompressionTypeVariant::UNCOMPRESSED,
            },

So made the same true here

@xanderbailey
Copy link
Copy Markdown
Contributor Author

@Jefffrey are you maybe able to take a look here?

Copy link
Copy Markdown
Contributor

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some integration testing of this PR since it solves a problem one of my customers was having as well. I think it's good to merge, with a squashing of course 😈

I ended up backporting it to 52.4.0 which applied cleanly as well. I notice that arrow 57.2 actually introduced QuoteStyle which means this change requires arrow-csv 57.2.0 or later

Comment thread datafusion/common/src/parsers.rs
rtyler added a commit to buoyant-data/oxbow that referenced this pull request Apr 1, 2026
Digging deeper into load failures I discovered that while
arrow-csv and other Rust CSV parsers can handle some characters
properly, some RDMS and C-based CSV parsers (older) cannot, without
quoting.

This relies on a backport of apache/datafusion#20813 to the 52.4.0
release in order to support this. It also requires arrow-csv 57.2.0 or
later since that release is the first that `QuoteStyle` exists.
@xanderbailey
Copy link
Copy Markdown
Contributor Author

@alamb could I bother you for a review here please?

Comment thread datafusion/common/src/file_options/csv_writer.rs Outdated
Comment thread datafusion/common/src/parsers.rs Outdated
Comment thread datafusion/sqllogictest/test_files/csv_files.slt Outdated
Comment thread datafusion/sqllogictest/test_files/csv_files.slt Outdated
Comment thread datafusion/common/src/parsers.rs
Comment on lines +2945 to 2948
pub ignore_leading_whitespace: Option<bool>, default = None
/// Whether to ignore trailing whitespace in string values when writing CSV.
pub ignore_trailing_whitespace: Option<bool>, default = None
/// Specifies whether newlines in (quoted) values are supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we have this trend of doing Option<bool> instead of just bool 🤔

What is the default in case of None, and how is that different from just true/false (whichever it is)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also curious about this but didn't want to break the pattern...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it some ser-deser quirk? I'll see if I can figure this out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I can't find a good reason for this... Would you like me to change or keep the pattern?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps its to do with hierarchy of configs and/or sql parsing 🤔

@alamb do you happen to know a reason for having Option<bool> instead of plain bool for cases where it'll end up being true or false anyway? (i.e. None doesn't represent a third state, but eventually maps to either true or false)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fired a question off into the Discord to see if anyone has some more insight; ideally I'd like to get this config right the first time to avoid needing a breaking change later (or forever be stuck with this awkward interface if it turns out to be unnecessary)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine to leave them as Option<bool>; but would be good to have a docstring explaining what the default is in this case

@xanderbailey xanderbailey requested a review from Jefffrey April 16, 2026 10:00
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, just want to get a second opinion on Option<bool> and which way is preferable

Comment on lines +2945 to 2948
pub ignore_leading_whitespace: Option<bool>, default = None
/// Whether to ignore trailing whitespace in string values when writing CSV.
pub ignore_trailing_whitespace: Option<bool>, default = None
/// Specifies whether newlines in (quoted) values are supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fired a question off into the Discord to see if anyone has some more insight; ideally I'd like to get this config right the first time to avoid needing a breaking change later (or forever be stuck with this awkward interface if it turns out to be unnecessary)

Comment on lines +2945 to 2948
pub ignore_leading_whitespace: Option<bool>, default = None
/// Whether to ignore trailing whitespace in string values when writing CSV.
pub ignore_trailing_whitespace: Option<bool>, default = None
/// Specifies whether newlines in (quoted) values are supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine to leave them as Option<bool>; but would be good to have a docstring explaining what the default is in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add quote-style parameter for CSV options

3 participants