Use filter (filter_record_batch) instead of take to avoid using indices#2218
Use filter (filter_record_batch) instead of take to avoid using indices#2218yjshen merged 10 commits intoapache:masterfrom
filter (filter_record_batch) instead of take to avoid using indices#2218Conversation
filter instead of take to avoid using indicesfilter (filter_record_batch) instead of take to avoid using indices
tustvold
left a comment
There was a problem hiding this comment.
This makes a lot of sense to me, historically filter used to be significantly slower than the take kernels which might explain why this code was like this, but since apache/arrow-rs#1248 it makes sense to use the upstream logic. This will also ensure DataFusion benefits from any further improvements made there
| selection: &BooleanArray, | ||
| ) -> Result<ColumnarValue> { | ||
| if selection.iter().all(|b| b == Some(true)) { | ||
| let filter_count = selection |
There was a problem hiding this comment.
The upstream already performs this optimisation, see here and so you can probably elide counting the bits again.
There was a problem hiding this comment.
I thought this as well, but the earlier fast path also avoided the scatter path in case of all values being true.
The filter count can not easily be reused between this code and the filter kernel.
I am also OK with removing this path, for making the code a bit simpler.
There was a problem hiding this comment.
also avoided the scatter path in case of all values being true.
You could inspect the length of the returned filtered array instead of counting the bits twice?
There was a problem hiding this comment.
That is a good suggestion 👍
tustvold
left a comment
There was a problem hiding this comment.
I think we shouldn't be replicating functionality that exists upstream to handle nulls and special case full or empty selections. Aside from being slower, it is more code
tustvold
left a comment
There was a problem hiding this comment.
Thank you for bearing with me 😄
yjshen
left a comment
There was a problem hiding this comment.
This makes a lot of sense to me. It's great to read both the code and the discussions.
| .collect::<ArrowResult<Vec<Arc<dyn Array>>>>()?; | ||
|
|
||
| let tmp_batch = RecordBatch::try_new(batch.schema(), tmp_columns)?; | ||
| let tmp_batch = filter_record_batch(batch, selection)?; |
There was a problem hiding this comment.
Cool, you do have a great sense of code smell!
TIL the new filter kernel, it's great to read.
|
Thanks again for sharing this with me @Dandandan @tustvold @alamb. Amazing team ❤️ |
| .collect::<ArrowResult<Vec<Arc<dyn Array>>>>()?; | ||
|
|
||
| let tmp_batch = RecordBatch::try_new(batch.schema(), tmp_columns)?; | ||
| let tmp_batch = filter_record_batch(batch, selection)?; |
Which issue does this PR close?
n/a
Rationale for this change
We can use the more efficient
filterkernel than building an indices array and usingtake.Referenced by @alamb in apache/arrow-rs#1542
What changes are included in this PR?
Use filter instead of take.
This has the following benefits (besides removing some code)
VecandUInt64Arrayfilterkernel itself is faster / thantake.Are there any user-facing changes?
No
No