Conversation
ozankabak
left a comment
There was a problem hiding this comment.
I'm not sure if we lose generality by doing this once at the listing table factory level (instead of the old way of doing it in create_physical_plan). I'd appreciate thoughts from @devinjdangelo and @alamb
| .has_header | ||
| .unwrap_or(state.config_options().catalog.has_header), | ||
| ) | ||
| .with_header(first_chunk && self.options.has_header) |
There was a problem hiding this comment.
What if there is a catalog option on header? This might not be tested in the code since the SQL tests wouldn't cover here.
There was a problem hiding this comment.
Thanks @metegenez for the review :) I have tried to extract the option from the catalog in the most inclusive way possible. This step seems to be when we create a TableProvider.
Another alternative would be to consult the catalog wherever we read the header from a FileFormat. However, doing this everywhere makes the code cluttered (in many places, there isn't even a catalog object available). Therefore, I think that if we check this once during the execution, where FileFormat is first introduced—that is, at the part where TableProvider is created—we won't need to check it again. Are there any use cases for creating and using FileFormat independently of any TableProvider?
alamb
left a comment
There was a problem hiding this comment.
Thank you for this contribution @berkaysynnada
In general I think we will be able to merge these PRs faster if you can break them up into multiple PRs that are not related. I realize this takes more work on the author's behalf (aka you) but it results in more efficient reviews because it is easier to see the effects of a particular change (rather than have three separate changes intermixed)
The change to serializing proto encoding and cargo.tom looks good to me. I don't fully understand the rationale or implications to change the config value from Option -- the way it currently is (if not explicitly set fall back to value on ConfigOptions) makes more sense to me
| md-5 = { version = "^0.10.0", optional = true } | ||
| rand = { workspace = true } | ||
| regex = { worksapce = true, optional = true } | ||
| regex = { workspace = true, optional = true } |
There was a problem hiding this comment.
Thank you -- that is a nice fix -- it would be easier to merge as a separate PR
| // Options controlling CSV format | ||
| message CsvOptions { | ||
| bytes has_header = 1; // Indicates if the CSV has a header row | ||
| bool has_header = 1; // Indicates if the CSV has a header row |
|
Maybe I can help clarify things a little bit. The purpose of the PR is not to change the current fallback behavior, just simplify its implementation. @berkaysynnada is moving the fallback logic to The upside of the approach in this PR is that it simplifies the code across multiple files (e.g. proto), but the fallback behavior no longer becomes automatic for all custom table providers as far as I can tell -- @berkaysynnada please correct me if I'm wrong. Maybe we can do the following: Instead of making every |
Thank you for the explanation -- what you describe makes sense -- maybe we can update the documentation to add some of this context and that would be good enough. I haven't thought through all the nuances of general fallbacks, but given what you say perhaps it is a corner case we can worry about if/when somone actually hits it. From my perspective the most important thing is for the behavior to be clearly documented. |
This explanation summarizes it very well. Currently, the extraction of catalog information is encapsulated in the |
|
After talking with @ozankabak, we decided to undo the changes related to header options. There's also another place where SessionState values during TableProvider creation might not be safe. I'm not sure if removing SessionState from this API is the right move, but it definitely needs more thought. Rushing to clean up the code could lead to losing functionality. We should discuss these in a different issue.
For now, this PR only makes a small fix to |
ozankabak
left a comment
There was a problem hiding this comment.
SGTM, let's update the PR title and body prior to merging
d83ddf0 to
c47dca6
Compare
c47dca6 to
5e2e023
Compare
alamb
left a comment
There was a problem hiding this comment.
Thank you @berkaysynnada -- this change for cast looks good to me
I also double checked that casting bool to integer preserves ordering:
> create table t(b boolean, x integer);
0 row(s) fetched.
Elapsed 0.006 seconds.
> insert into t values (true, true::int);
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.
> insert into t values (false, false::int);
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select * from t order by b;
+-------+---+
| b | x |
+-------+---+
| false | 0 |
| true | 1 |
+-------+---+
2 row(s) fetched.
Elapsed 0.009 seconds.
> select * from t order by x;
+-------+---+
| b | x |
+-------+---+
| false | 0 |
| true | 1 |
+-------+---+
2 row(s) fetched.* header option removed * Update csv.rs * Update path_partition.rs * Update path_partition.rs * adding test * adding test * Revert
Which issue does this PR close?
Closes #.
Rationale for this change
A minor fix in the
CastExprordering. Boolean to numeric cast needs to preserve ordering.What changes are included in this PR?
Fixed
CastExprordering for Boolean to numeric castAre these changes tested?
with existing tests
Are there any user-facing changes?