Skip to content

Use filter (filter_record_batch) instead of take to avoid using indices#2218

Merged
yjshen merged 10 commits intoapache:masterfrom
Dandandan:improve_code
Apr 13, 2022
Merged

Use filter (filter_record_batch) instead of take to avoid using indices#2218
yjshen merged 10 commits intoapache:masterfrom
Dandandan:improve_code

Conversation

@Dandandan
Copy link
Copy Markdown
Contributor

@Dandandan Dandandan commented Apr 12, 2022

Which issue does this PR close?

n/a

Rationale for this change

We can use the more efficient filter kernel than building an indices array and using take.
Referenced by @alamb in apache/arrow-rs#1542

What changes are included in this PR?

Use filter instead of take.

This has the following benefits (besides removing some code)

  • This also uses the faster counting, and has some extra fast paths.
  • Avoids materializing the indices both in Vec and UInt64Array
  • The filter kernel itself is faster / than take.

Are there any user-facing changes?

No

No

@Dandandan Dandandan requested review from alamb and yjshen April 12, 2022 18:54
@Dandandan Dandandan changed the title Use filter instead of take to avoid using indices Use filter (filter_record_batch) instead of take to avoid using indices Apr 12, 2022
Copy link
Copy Markdown
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense to me, historically filter used to be significantly slower than the take kernels which might explain why this code was like this, but since apache/arrow-rs#1248 it makes sense to use the upstream logic. This will also ensure DataFusion benefits from any further improvements made there

selection: &BooleanArray,
) -> Result<ColumnarValue> {
if selection.iter().all(|b| b == Some(true)) {
let filter_count = selection
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream already performs this optimisation, see here and so you can probably elide counting the bits again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this as well, but the earlier fast path also avoided the scatter path in case of all values being true.
The filter count can not easily be reused between this code and the filter kernel.

I am also OK with removing this path, for making the code a bit simpler.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also avoided the scatter path in case of all values being true.

You could inspect the length of the returned filtered array instead of counting the bits twice?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good suggestion 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yjshen

Comment thread datafusion/physical-expr/src/physical_expr.rs Outdated
Copy link
Copy Markdown
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't be replicating functionality that exists upstream to handle nulls and special case full or empty selections. Aside from being slower, it is more code

Comment thread datafusion/physical-expr/src/physical_expr.rs Outdated
Copy link
Copy Markdown
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bearing with me 😄

Copy link
Copy Markdown
Member

@yjshen yjshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense to me. It's great to read both the code and the discussions.

.collect::<ArrowResult<Vec<Arc<dyn Array>>>>()?;

let tmp_batch = RecordBatch::try_new(batch.schema(), tmp_columns)?;
let tmp_batch = filter_record_batch(batch, selection)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, you do have a great sense of code smell!

TIL the new filter kernel, it's great to read.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👃 👍

@yjshen yjshen merged commit 6d75948 into apache:master Apr 13, 2022
@yjshen
Copy link
Copy Markdown
Member

yjshen commented Apr 13, 2022

Thanks again for sharing this with me @Dandandan @tustvold @alamb. Amazing team ❤️

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 Nice

.collect::<ArrowResult<Vec<Arc<dyn Array>>>>()?;

let tmp_batch = RecordBatch::try_new(batch.schema(), tmp_columns)?;
let tmp_batch = filter_record_batch(batch, selection)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👃 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants