Skip to content

Read CSV format text from stdin or memory#54

Merged
alamb merged 1 commit intoapache:masterfrom
heymind:csvread
Apr 26, 2021
Merged

Read CSV format text from stdin or memory#54
alamb merged 1 commit intoapache:masterfrom
heymind:csvread

Conversation

@heymind
Copy link
Copy Markdown
Contributor

@heymind heymind commented Apr 25, 2021

Migrate from apache/arrow#10066.

Which issue does this PR close?

Close Jira issue ARROW-12306. Closes #198

Rationale for this change

Let CsvFile and CsvExec support reading from a reader (not only files)

What changes are included in this PR?

This pr adds the following new pub functions:

  • CsvFile::try_new_from_reader
  • CsvFile::try_new_from_reader_infer_schema
  • CsvExec::try_new_from_reader
  • CsvStream::try_new_from_reader

Are there any user-facing changes?

No.

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good -- the only thing that is wonky to me is that if you Clone one of these readers it won't work. However, in this case the error will be clear so I think this PR can be merged and if anyone cares about the Clone behavior we can fix it in a follow on PR.

Thanks again @heymind

What do you think @Dandandan / @andygrove ?

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

Merging #54 (10f8c3b) into master (9ba214a) will decrease coverage by 0.04%.
The diff coverage is 73.29%.

❗ Current head 10f8c3b differs from pull request most recent head c938270. Consider uploading reports for the commit c938270 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
- Coverage   76.24%   76.20%   -0.05%     
==========================================
  Files         134      134              
  Lines       23051    23199     +148     
==========================================
+ Hits        17576    17679     +103     
- Misses       5475     5520      +45     
Impacted Files Coverage Δ
datafusion/src/datasource/csv.rs 73.33% <69.86%> (-9.28%) ⬇️
datafusion/src/physical_plan/csv.rs 79.09% <75.72%> (-4.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ba214a...c938270. Read the comment docs.

Path(String),

/// Read CSV data from a reader
Reader(Mutex<Option<Box<dyn Read + Send + Sync + 'static>>>),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this look if we would support multiple readers/partitions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am wondering, if we do it like this, we would we need to do the same for json/xml/etc sources that should support essentially the same to avoid reimplementing it for each format.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we file a follow on ticket for this work (supporting Reader input for JSON / XML sources). As you say I don't think it needs to be part of this PR.

@Dandandan
Copy link
Copy Markdown
Contributor

Looks good to me, thanks @heymind .
Seems like a cool addition to integrate with much more sources. I have some small questions about partitioning and re-use across formats, but that shouldn't be a blocker.

@alamb alamb merged commit 3855473 into apache:master Apr 26, 2021
@houqp houqp added api change Changes the API exposed to users of the crate ballista enhancement New feature or request labels Jul 29, 2021
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
ewgenius pushed a commit to ewgenius/datafusion that referenced this pull request Apr 23, 2025
fix: More dangling references (apache#54)

* fix: More dangling references

* test: Add tests for remove_dangling_identifiers

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
aditanase pushed a commit to hstack/datafusion that referenced this pull request May 18, 2025
fix: More dangling references (apache#54)

* fix: More dangling references

* test: Add tests for remove_dangling_identifiers

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read CSV format text from stdin or memory

5 participants