Add quote style and trimming to csv writier#20813
Add quote style and trimming to csv writier#20813xanderbailey wants to merge 12 commits intoapache:mainfrom
Conversation
| pub quote_style: i32, | ||
| /// Whether to ignore leading whitespace in string values | ||
| #[prost(bytes = "vec", tag = "21")] | ||
| pub ignore_leading_whitespace: ::prost::alloc::vec::Vec<u8>, |
There was a problem hiding this comment.
Following the pattern here for other bools being Vec
| pub quote_style: i32, | ||
| /// Whether to ignore leading whitespace in string values | ||
| #[prost(bytes = "vec", tag = "21")] | ||
| pub ignore_leading_whitespace: ::prost::alloc::vec::Vec<u8>, |
There was a problem hiding this comment.
Following the pattern here for other bools being Vec
| Ok(CsvQuoteStyleProto::Always) => CsvQuoteStyle::Always, | ||
| Ok(CsvQuoteStyleProto::NonNumeric) => CsvQuoteStyle::NonNumeric, | ||
| Ok(CsvQuoteStyleProto::Never) => CsvQuoteStyle::Never, | ||
| _ => CsvQuoteStyle::Necessary, |
There was a problem hiding this comment.
We don't error on:
compression: match proto.compression {
0 => CompressionTypeVariant::GZIP,
1 => CompressionTypeVariant::BZIP2,
2 => CompressionTypeVariant::XZ,
3 => CompressionTypeVariant::ZSTD,
_ => CompressionTypeVariant::UNCOMPRESSED,
},
So made the same true here
|
@Jefffrey are you maybe able to take a look here? |
rtyler
left a comment
There was a problem hiding this comment.
I have done some integration testing of this PR since it solves a problem one of my customers was having as well. I think it's good to merge, with a squashing of course 😈
I ended up backporting it to 52.4.0 which applied cleanly as well. I notice that arrow 57.2 actually introduced QuoteStyle which means this change requires arrow-csv 57.2.0 or later
Digging deeper into load failures I discovered that while arrow-csv and other Rust CSV parsers can handle some characters properly, some RDMS and C-based CSV parsers (older) cannot, without quoting. This relies on a backport of apache/datafusion#20813 to the 52.4.0 release in order to support this. It also requires arrow-csv 57.2.0 or later since that release is the first that `QuoteStyle` exists.
|
@alamb could I bother you for a review here please? |
| pub ignore_leading_whitespace: Option<bool>, default = None | ||
| /// Whether to ignore trailing whitespace in string values when writing CSV. | ||
| pub ignore_trailing_whitespace: Option<bool>, default = None | ||
| /// Specifies whether newlines in (quoted) values are supported. |
There was a problem hiding this comment.
I'm wondering why we have this trend of doing Option<bool> instead of just bool 🤔
What is the default in case of None, and how is that different from just true/false (whichever it is)?
There was a problem hiding this comment.
I was also curious about this but didn't want to break the pattern...
There was a problem hiding this comment.
Is it some ser-deser quirk? I'll see if I can figure this out.
There was a problem hiding this comment.
No I can't find a good reason for this... Would you like me to change or keep the pattern?
There was a problem hiding this comment.
Perhaps its to do with hierarchy of configs and/or sql parsing 🤔
@alamb do you happen to know a reason for having Option<bool> instead of plain bool for cases where it'll end up being true or false anyway? (i.e. None doesn't represent a third state, but eventually maps to either true or false)
There was a problem hiding this comment.
I fired a question off into the Discord to see if anyone has some more insight; ideally I'd like to get this config right the first time to avoid needing a breaking change later (or forever be stuck with this awkward interface if it turns out to be unnecessary)
There was a problem hiding this comment.
Should be fine to leave them as Option<bool>; but would be good to have a docstring explaining what the default is in this case
Jefffrey
left a comment
There was a problem hiding this comment.
Changes look good to me, just want to get a second opinion on Option<bool> and which way is preferable
| pub ignore_leading_whitespace: Option<bool>, default = None | ||
| /// Whether to ignore trailing whitespace in string values when writing CSV. | ||
| pub ignore_trailing_whitespace: Option<bool>, default = None | ||
| /// Specifies whether newlines in (quoted) values are supported. |
There was a problem hiding this comment.
I fired a question off into the Discord to see if anyone has some more insight; ideally I'd like to get this config right the first time to avoid needing a breaking change later (or forever be stuck with this awkward interface if it turns out to be unnecessary)
| pub ignore_leading_whitespace: Option<bool>, default = None | ||
| /// Whether to ignore trailing whitespace in string values when writing CSV. | ||
| pub ignore_trailing_whitespace: Option<bool>, default = None | ||
| /// Specifies whether newlines in (quoted) values are supported. |
There was a problem hiding this comment.
Should be fine to leave them as Option<bool>; but would be good to have a docstring explaining what the default is in this case
Which issue does this PR close?
Related arrow-rs PRs apache/arrow-rs#8960 and apache/arrow-rs#9004
Rationale for this change
The CSV writer was missing support for
quote_style,ignore_leading_whitespace, andignore_trailing_whitespaceoptions that are available on the underlying arrowWriterBuilder. This meant users couldn't control quoting behaviour or whitespace trimming when writing CSV files.What changes are included in this PR?
Adds three new CSV writer options wired through the full stack:
quote_style— controls when fields are quoted (Always,Necessary,NonNumeric,Never). Modelled as a protobuf enum (CsvQuoteStyle).ignore_leading_whitespace— trims leading whitespace from string values on write.ignore_trailing_whitespace— trims trailing whitespace from string values on write.Are these changes tested?
Yes — sqllogictest coverage added in
csv_files.sltAre there any user-facing changes?
Three new
format.*options available in COPY TO and CREATE EXTERNAL TABLE for CSV:format.quote_style(string:Always,Necessary,NonNumeric,Never)format.ignore_leading_whitespace(boolean)format.ignore_trailing_whitespace(boolean)