Conversation
|
Performance. This PR: On |
| if x isa PooledVector || allunique(refpool) | ||
| return refpool, refarray | ||
| else | ||
| return nothing, nothing |
There was a problem hiding this comment.
@quinnj - can Arrow.jl give us an example of pooled vector with duplicates in refpool?
There was a problem hiding this comment.
It's technically allowed to have duplicates in the dictionary pool; but we could also call allunique in the DataAPI.refpool method in Arrow.jl to avoid this. Perhaps we should make it formally part of the refpool contract?
There was a problem hiding this comment.
That way we ensure other refpool-enabled array types get the same optimization w/o needing to hard-code PooledVector
There was a problem hiding this comment.
Yes that's a possibility. But are we sure no existing implementations have duplicates and that it couldn't be useful? Another solution would be to add a new function to DataAPI to check whether entries are unique. That may be overkill if having duplicates is not useful.
There was a problem hiding this comment.
I don't know, I can't see duplicate values in a pool being useful at all. It's our API, so I say we enforce it. All people would have to do if they allow duplicates is call allunique on their refpool. I think it just comes back to the fact that all our current uses/designs for refpool revolve around it being a unique pool, so we might as well enforce that.
nalimilan
left a comment
There was a problem hiding this comment.
Looks good at least as a short-term measure until we find a general solution.
|
I think we have a general solution, but @quinnj would have to confirm that it is OK with Arrow.jl.
which implies that in this case The question to @quinnj is:
|
|
@nalimilan - just to be 100% sure - for CategoricalArray.jl both |
|
Good idea! Then we already have everything we need. CategoricalArray indeed just wraps its internal dict, so it's very cheap. And types for which computing |
They can do it lazily and dynamically decide if it is worth to compute it. Let me wait for @quinnj to comment if he is OK with this and if yes then I will merge and make a patch release. |
|
Yes, I'm ok w/ this approach. I think we can figure things out on the Arrow.jl side; I don't think it would be bad even to enforce teh uniqueness in Arrow.jl specifically, since the benefit would be great. We could do a uniqueness check/scrub when we do the initial record batch processing. I'll need to think a bit more about invrefpool; I'll ping @dmbates as well since he was the one who implemented the DataAPI methods in Arrow.jl, though I'm not signing him up for any additional work if he doesn't want it 😛 |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
Thank you! |
avoid using
alluniqueforPooledArray.