Add ignore leading and trailing white space to csv parser by xanderbailey · Pull Request #8960 · apache/arrow-rs

xanderbailey · 2025-12-05T23:15:31Z

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Closes Add spark like ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options to the csv writer #8961

Rationale for this change

Spark's csv writer can do this and it would be nice to have feature parity in projects like datafusion.
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.

What changes are included in this PR?

Adds two new options for ignore_leading_whitespace and ignore_trailing_whitespace to csv writer.
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.

Are these changes tested?

Yes
We typically require tests for all PRs in order to:

Prevent the code from being accidentally broken by subsequent changes
Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

xanderbailey · 2025-12-05T23:46:03Z

+                    if self.ignore_leading_whitespace {
+                        trimmed = trimmed.trim_start();
+                    }
+                    if self.ignore_trailing_whitespace {
+                        trimmed = trimmed.trim_end();
+                    }


Very open to suggestions if this is the right way to go about this. It looks like the other options are passed into FormatOptions but that's used much more widely than just the CSV reader...

alamb · 2025-12-11T21:46:51Z

run benchmark csv_writer

alamb-ghbot · 2025-12-11T21:46:57Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing xb/add_ignore_leading_whitespace (60e3f5a) to a67cd19 diff
BENCH_NAME=csv_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench csv_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=xb_add_ignore_leading_whitespace
Results will be posted here when complete

alamb-ghbot · 2025-12-11T21:48:24Z

🤖: Benchmark completed

Details

group                    main                                   xb_add_ignore_leading_whitespace
-----                    ----                                   --------------------------------
record_batches_to_csv    1.03     17.8±0.15µs        ? ?/sec    1.00     17.3±0.16µs        ? ?/sec

alamb

Thanks @xanderbailey -- I think this is a nice change. CI needs to be fixed and I have a suggestion for expanded type support and improved documentation, but I don't think those are necessary before merging this PR

I also started some benchmarks to make sure this doesn't inadvertently slow things down

alamb · 2025-12-11T21:48:48Z

        self.null_value.as_deref().unwrap_or(DEFAULT_NULL_VALUE)
    }

+    /// Set whether to ignore leading whitespace in string values


I would have personally suggested using "strip leading whitespace" but I see this is consistent with Spark

Perhaps you could add an example to this documentation like

/// /// For example, a string values such as " foo" will be written as "foo"

alamb · 2025-12-11T21:49:11Z

+        self.ignore_leading_whitespace
+    }
+
+    /// Set whether to ignore trailing whitespace in string values


Similarly to above, can we add an example?

/// /// For example, a string values such as "foo. " will be written as "foo"

alamb · 2025-12-11T21:52:45Z

+use arrow_schema::*;
+use std::sync::Arc;
+
+fn main() {


thank you for this example. However, I think this might be hard to find

Would you be willing to move this example into a doc example in https://github.com/apache/arrow-rs/blob/7c6a883302551dde7e89bfed1779c74dac677a0a/arrow-csv/src/writer.rs#L20-L19

Perhaps as a way to show how to use options

Happy to. Would you remove this one? Or keep this too?

Have kept this and added an example to the docs inline

alamb · 2025-12-11T21:54:48Z

-                byte_record.push_field(buffer.as_bytes());
+
+                // Apply whitespace trimming if options are enabled and the column is a string type
+                let field_bytes = if should_trim && batch.column(col_idx).data_type() == &DataType::Utf8 {


There are other string types as well, such as LargeUtf8 and Utf8View. Could you please add support (and a test) for them as well?

Ah yes of course! I'm still used to spark types!

alamb · 2025-12-15T18:18:52Z

Thanks @xanderbailey

# Which issue does this PR close? Following on from #8960, we are now exposing the quote style as a part of the csv writer options which allows users to quote columns similar to Spark's `quoteAll` setting.  - Closes #[9003](#9003). # Rationale for this change  # What changes are included in this PR? Expose `QuoteStyle` in the `WriterBuilder`  # Are these changes tested? Yes with examples and unit tests.  # Are there any user-facing changes?

@xanderbailey

# Which issue does this PR close? # Rationale for this change while reviewing @xanderbailey's PR in #8960, I found that there are examples for arrow-csv and they are hard to find. Also each example add extra binaries and thus slows down CI and tests. For example the `whitespace_handling` example makes a new 2.9MB binary: ```shell cargo run -p arrow-csv --example whitespace_handling ... du -s -h target/debug/examples/whitespace_handling 2.9M target/debug/examples/whitespace_handling ``` Let's consolidate the examples to make them easier to find # What changes are included in this PR? 1. Consolidate the examples 2. Improver other csv docs # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 3. Serve as another way to document the expected behavior of the code # Are there any user-facing changes? Docs only, no functional changes

## Which issue does this PR close? - Closes apache#10669 Related arrow-rs PRs apache/arrow-rs#8960 and apache/arrow-rs#9004 ## Rationale for this change The CSV writer was missing support for `quote_style`, `ignore_leading_whitespace`, and `ignore_trailing_whitespace` options that are available on the underlying arrow `WriterBuilder`. This meant users couldn't control quoting behaviour or whitespace trimming when writing CSV files. ## What changes are included in this PR? Adds three new CSV writer options wired through the full stack: - `quote_style` — controls when fields are quoted (`Always`, `Necessary`, `NonNumeric`, `Never`). Modelled as a protobuf enum (`CsvQuoteStyle`). - `ignore_leading_whitespace` — trims leading whitespace from string values on write. - `ignore_trailing_whitespace` — trims trailing whitespace from string values on write. ## Are these changes tested? Yes — sqllogictest coverage added in `csv_files.slt` ## Are there any user-facing changes? Three new `format.*` options available in COPY TO and CREATE EXTERNAL TABLE for CSV: - `format.quote_style` (string: `Always`, `Necessary`, `NonNumeric`, `Never`) - `format.ignore_leading_whitespace` (boolean) - `format.ignore_trailing_whitespace` (boolean)

Add ignore leading and trailing white space to csv parser

60e3f5a

github-actions Bot added the arrow Changes to the arrow crate label Dec 5, 2025

new line

b7ea804

xanderbailey commented Dec 5, 2025

View reviewed changes

alamb approved these changes Dec 11, 2025

View reviewed changes

xanderbailey added 4 commits December 12, 2025 09:53

Update doc strings

a283737

handle other string types

4d80f63

fmt

9046b23

Add example

83c7b5a

alamb merged commit e3a670e into apache:main Dec 15, 2025
23 checks passed

alamb mentioned this pull request Dec 15, 2025

Move examples in arrow-csv to docstrings, polish up docs #9001

Merged

xanderbailey mentioned this pull request Dec 16, 2025

Add quote style to csv writer #9004

Merged

alamb mentioned this pull request Jan 6, 2026

Add spark like ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options to the csv writer #8961

Closed

xanderbailey mentioned this pull request Mar 8, 2026

Add quote style and trimming to csv writier apache/datafusion#20813

Merged

Conversation

xanderbailey commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 11, 2025

Uh oh!

alamb-ghbot commented Dec 11, 2025

Uh oh!

alamb-ghbot commented Dec 11, 2025

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xanderbailey commented Dec 5, 2025 •

edited

Loading

alamb left a comment •

edited

Loading