FileStream: Open next file in parallel while decoding by thinkharderdev · Pull Request #5161 · apache/datafusion

thinkharderdev · 2023-02-02T14:24:44Z

Which issue does this PR close?

Rationale for this change

File opening is mostly IO (and may involve a bunch of sequential IO) so it can probably be parallelized well with decoding. So we should open the next file in parallel while decoding the current file in FileStream

What changes are included in this PR?

Are these changes tested?

I think this should be covered by existing tests

Are there any user-facing changes?

FileStreamMetrics.time_opening is a slightly different metric now as it won't capture time spent opening but rather time spent opening while also not concurrently decoding.

alamb

Looks good to me @thinkharderdev -- thank you. It would be great to figure out some way to test this PR (mostly to ensure we don't break this behavior in the future). However, i don't have any clever ideas on how to do so.

I went through the logic in detail.

I left some suggestions for comments to clarify the intent, which I think would be valuable but are not necessary.

cc @tustvold

alamb · 2023-02-05T10:53:13Z

+                                    partition_values,
+                                }
+                            }
+                            None => return Poll::Ready(None),


Suggested change

None => return Poll::Ready(None),

// No more input files

None => return Poll::Ready(None),

alamb · 2023-02-05T10:56:24Z

                    Ok(reader) => {
+                        let partition_values = mem::take(partition_values);
+
+                        let next = self.next_file().transpose();


Suggested change

let next = self.next_file().transpose();

// begin opening next file

let next = self.next_file().transpose();

Dandandan · 2023-02-06T10:14:41Z

        /// The reader instance
        reader: BoxStream<'static, Result<RecordBatch, ArrowError>>,
+        /// A [`FileOpenFuture`] for the next file to be processed
+        next: Option<(FileOpenFuture, Vec<ScalarValue>)>,


I wonder if we could make it future-proof by potentially prefetching n files instead of 1? I guess in cases where file opening is slower than scanning / processing, this could make a difference (e.g. small files).

Perhaps a follow on PR could turn this into a stream and use StreamExt::buffered or something

yeah, that seems like a good idea

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb · 2023-02-07T10:54:01Z

Let's file a ticket for the "buffer N items at a time" idea and work on it as a follow on PR

alamb · 2023-02-07T10:54:10Z

Thanks again @thinkharderdev

ursabot · 2023-02-07T11:02:41Z

Benchmark runs are scheduled for baseline = 48732b4 and contender = 816a0f8. 816a0f8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

thinkharderdev · 2023-02-07T11:02:46Z

Added #5209

thinkharderdev requested a review from tustvold February 2, 2023 14:24

github-actions bot added the core Core DataFusion crate label Feb 2, 2023

alamb approved these changes Feb 5, 2023

View reviewed changes

Dandandan reviewed Feb 6, 2023

View reviewed changes

thinkharderdev and others added 4 commits February 6, 2023 09:34

FileStream: Open next file in parallel while decoding

0c3a4a3

Apply suggestions from code review

423c60f

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

more descriptive method name

4de7590

formatting

fa60853

thinkharderdev force-pushed the file-stream-pipeline branch from f8a339d to fa60853 Compare February 6, 2023 14:37

alamb merged commit 816a0f8 into master Feb 7, 2023

alamb deleted the file-stream-pipeline branch February 7, 2023 10:54

thinkharderdev mentioned this pull request Feb 7, 2023

FileStream: Buffer more than one FileOpenFuture #5209

Open

nenorbot mentioned this pull request Mar 30, 2023

FileStream does not poll next file open future #5799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileStream: Open next file in parallel while decoding#5161

FileStream: Open next file in parallel while decoding#5161
alamb merged 4 commits intomasterfrom
file-stream-pipeline

thinkharderdev commented Feb 2, 2023

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

Uh oh!

alamb Feb 5, 2023

Uh oh!

Uh oh!

alamb Feb 5, 2023

Uh oh!

Dandandan Feb 6, 2023

Uh oh!

tustvold Feb 6, 2023

Uh oh!

thinkharderdev Feb 6, 2023

Uh oh!

alamb commented Feb 7, 2023

Uh oh!

alamb commented Feb 7, 2023

Uh oh!

ursabot commented Feb 7, 2023

Uh oh!

thinkharderdev commented Feb 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	None => return Poll::Ready(None),
	// No more input files
	None => return Poll::Ready(None),

	let next = self.next_file().transpose();
	// begin opening next file
	let next = self.next_file().transpose();

Conversation

thinkharderdev commented Feb 2, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb Feb 5, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Feb 5, 2023

Choose a reason for hiding this comment

Uh oh!

Dandandan Feb 6, 2023

Choose a reason for hiding this comment

Uh oh!

tustvold Feb 6, 2023

Choose a reason for hiding this comment

Uh oh!

thinkharderdev Feb 6, 2023

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 7, 2023

Uh oh!

alamb commented Feb 7, 2023

Uh oh!

ursabot commented Feb 7, 2023

Uh oh!

thinkharderdev commented Feb 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants