Add ignore leading and trailing white space to csv parser#8960
Add ignore leading and trailing white space to csv parser#8960alamb merged 6 commits intoapache:mainfrom
Conversation
| if self.ignore_leading_whitespace { | ||
| trimmed = trimmed.trim_start(); | ||
| } | ||
| if self.ignore_trailing_whitespace { | ||
| trimmed = trimmed.trim_end(); | ||
| } |
There was a problem hiding this comment.
Very open to suggestions if this is the right way to go about this. It looks like the other options are passed into FormatOptions but that's used much more widely than just the CSV reader...
|
run benchmark csv_writer |
|
🤖 |
|
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Thanks @xanderbailey -- I think this is a nice change. CI needs to be fixed and I have a suggestion for expanded type support and improved documentation, but I don't think those are necessary before merging this PR
I also started some benchmarks to make sure this doesn't inadvertently slow things down
| self.null_value.as_deref().unwrap_or(DEFAULT_NULL_VALUE) | ||
| } | ||
|
|
||
| /// Set whether to ignore leading whitespace in string values |
There was a problem hiding this comment.
I would have personally suggested using "strip leading whitespace" but I see this is consistent with Spark
Perhaps you could add an example to this documentation like
///
/// For example, a string values such as " foo" will be written as "foo"
| self.ignore_leading_whitespace | ||
| } | ||
|
|
||
| /// Set whether to ignore trailing whitespace in string values |
There was a problem hiding this comment.
Similarly to above, can we add an example?
///
/// For example, a string values such as "foo. " will be written as "foo"
| use arrow_schema::*; | ||
| use std::sync::Arc; | ||
|
|
||
| fn main() { |
There was a problem hiding this comment.
thank you for this example. However, I think this might be hard to find
Would you be willing to move this example into a doc example in https://github.com/apache/arrow-rs/blob/7c6a883302551dde7e89bfed1779c74dac677a0a/arrow-csv/src/writer.rs#L20-L19
Perhaps as a way to show how to use options
There was a problem hiding this comment.
Happy to. Would you remove this one? Or keep this too?
There was a problem hiding this comment.
Have kept this and added an example to the docs inline
| byte_record.push_field(buffer.as_bytes()); | ||
|
|
||
| // Apply whitespace trimming if options are enabled and the column is a string type | ||
| let field_bytes = if should_trim && batch.column(col_idx).data_type() == &DataType::Utf8 { |
There was a problem hiding this comment.
There are other string types as well, such as LargeUtf8 and Utf8View. Could you please add support (and a test) for them as well?
There was a problem hiding this comment.
Ah yes of course! I'm still used to spark types!
|
Thanks @xanderbailey |
# Which issue does this PR close? Following on from #8960, we are now exposing the quote style as a part of the csv writer options which allows users to quote columns similar to Spark's `quoteAll` setting. <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #[9003](#9003). # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? Expose `QuoteStyle` in the `WriterBuilder` <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? Yes with examples and unit tests. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? # Rationale for this change while reviewing @xanderbailey's PR in #8960, I found that there are examples for arrow-csv and they are hard to find. Also each example add extra binaries and thus slows down CI and tests. For example the `whitespace_handling` example makes a new 2.9MB binary: ```shell cargo run -p arrow-csv --example whitespace_handling ... du -s -h target/debug/examples/whitespace_handling 2.9M target/debug/examples/whitespace_handling ``` Let's consolidate the examples to make them easier to find # What changes are included in this PR? 1. Consolidate the examples 2. Improver other csv docs # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 3. Serve as another way to document the expected behavior of the code # Are there any user-facing changes? Docs only, no functional changes
## Which issue does this PR close? - Closes apache#10669 Related arrow-rs PRs apache/arrow-rs#8960 and apache/arrow-rs#9004 ## Rationale for this change The CSV writer was missing support for `quote_style`, `ignore_leading_whitespace`, and `ignore_trailing_whitespace` options that are available on the underlying arrow `WriterBuilder`. This meant users couldn't control quoting behaviour or whitespace trimming when writing CSV files. ## What changes are included in this PR? Adds three new CSV writer options wired through the full stack: - `quote_style` — controls when fields are quoted (`Always`, `Necessary`, `NonNumeric`, `Never`). Modelled as a protobuf enum (`CsvQuoteStyle`). - `ignore_leading_whitespace` — trims leading whitespace from string values on write. - `ignore_trailing_whitespace` — trims trailing whitespace from string values on write. ## Are these changes tested? Yes — sqllogictest coverage added in `csv_files.slt` ## Are there any user-facing changes? Three new `format.*` options available in COPY TO and CREATE EXTERNAL TABLE for CSV: - `format.quote_style` (string: `Always`, `Necessary`, `NonNumeric`, `Never`) - `format.ignore_leading_whitespace` (boolean) - `format.ignore_trailing_whitespace` (boolean)
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
ignoreLeadingWhiteSpaceandignoreTrailingWhiteSpaceoptions to the csv writer #8961Rationale for this change
Spark's csv writer can do this and it would be nice to have feature parity in projects like datafusion.
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
What changes are included in this PR?
Adds two new options for
ignore_leading_whitespaceandignore_trailing_whitespaceto csv writer.There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
Are these changes tested?
Yes
We typically require tests for all PRs in order to:
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.