Use `filter` (filter_record_batch) instead of `take` to avoid using indices by Dandandan · Pull Request #2218 · apache/datafusion

Dandandan · 2022-04-12T18:49:51Z

Which issue does this PR close?

n/a

Rationale for this change

We can use the more efficient filter kernel than building an indices array and using take.
Referenced by @alamb in apache/arrow-rs#1542

What changes are included in this PR?

Use filter instead of take.

This has the following benefits (besides removing some code)

This also uses the faster counting, and has some extra fast paths.
Avoids materializing the indices both in Vec and UInt64Array
The filter kernel itself is faster / than take.

Are there any user-facing changes?

No

tustvold

This makes a lot of sense to me, historically filter used to be significantly slower than the take kernels which might explain why this code was like this, but since apache/arrow-rs#1248 it makes sense to use the upstream logic. This will also ensure DataFusion benefits from any further improvements made there

tustvold · 2022-04-12T19:33:50Z

        selection: &BooleanArray,
    ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter_count = selection


The upstream already performs this optimisation, see here and so you can probably elide counting the bits again.

I thought this as well, but the earlier fast path also avoided the scatter path in case of all values being true.
The filter count can not easily be reused between this code and the filter kernel.

I am also OK with removing this path, for making the code a bit simpler.

also avoided the scatter path in case of all values being true.

You could inspect the length of the returned filtered array instead of counting the bits twice?

That is a good suggestion 👍

alamb

cc @yjshen

tustvold

I think we shouldn't be replicating functionality that exists upstream to handle nulls and special case full or empty selections. Aside from being slower, it is more code

tustvold

Thank you for bearing with me 😄

yjshen

This makes a lot of sense to me. It's great to read both the code and the discussions.

yjshen · 2022-04-13T00:59:32Z

-            .collect::<ArrowResult<Vec<Arc<dyn Array>>>>()?;
-
-        let tmp_batch = RecordBatch::try_new(batch.schema(), tmp_columns)?;
+        let tmp_batch = filter_record_batch(batch, selection)?;


Cool, you do have a great sense of code smell!

TIL the new filter kernel, it's great to read.

yjshen · 2022-04-13T01:06:31Z

Thanks again for sharing this with me @Dandandan @tustvold @alamb. Amazing team ❤️

alamb

👌 Nice

alamb · 2022-04-13T12:45:42Z

-            .collect::<ArrowResult<Vec<Arc<dyn Array>>>>()?;
-
-        let tmp_batch = RecordBatch::try_new(batch.schema(), tmp_columns)?;
+        let tmp_batch = filter_record_batch(batch, selection)?;


Use filter instead of take to avoid using indices

14590fb

github-actions bot added the datafusion label Apr 12, 2022

Dandandan requested review from alamb and yjshen April 12, 2022 18:54

Simplify / speed up by performing it on a record batch

0cddc19

Dandandan changed the title ~~Use filter instead of take to avoid using indices~~ Use filter (filter_record_batch) instead of take to avoid using indices Apr 12, 2022

Dandandan added 2 commits April 12, 2022 21:05

Fix

c30d822

Bring back special case

3fb7c5d

tustvold approved these changes Apr 12, 2022

View reviewed changes

alamb approved these changes Apr 12, 2022

View reviewed changes

Handle selection array with nulls

812ced5

tustvold reviewed Apr 12, 2022

View reviewed changes

Comment thread datafusion/physical-expr/src/physical_expr.rs Outdated

Simplify

324de1f

tustvold requested changes Apr 12, 2022

View reviewed changes

Use size of output from filter to check filter count

11f57ce

tustvold reviewed Apr 12, 2022

View reviewed changes

Comment thread datafusion/physical-expr/src/physical_expr.rs Outdated

Simplify

f595a46

tustvold approved these changes Apr 12, 2022

View reviewed changes

Dandandan added 2 commits April 12, 2022 22:25

Fix import in tests

a8b6182

Clippy

767dabe

yjshen approved these changes Apr 13, 2022

View reviewed changes

yjshen merged commit 6d75948 into apache:master Apr 13, 2022

alamb reviewed Apr 13, 2022

View reviewed changes

Conversation

Dandandan commented Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

yjshen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yjshen commented Apr 13, 2022

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dandandan commented Apr 12, 2022 •

edited

Loading