Optimize completecases to process only missingable columns#2726
Optimize completecases to process only missingable columns#2726bkamins merged 7 commits intoJuliaData:mainfrom
completecases to process only missingable columns#2726Conversation
|
Looks good. The comments are minor. |
|
While benchmarking, I've found a much bigger problem. I suspect that this line is not aware of df[!, i] type and .!ismissing.(df[!, i]) is slow.
Few benchamarks: df = DataFrame(
a = rand(1000000), b = rand(Int, 1000000),
c = allowmissing(rand(1000000)), d = repeat([1:9; missing], 100000)
)main branch timings: Timings for res = trues(size(df, 1))
for i in 1:size(df, 2)
v = df[!, i]
if Missing <: eltype(v)
res .&= .!ismissing.(v)
end
endand function completecases(df::AbstractDataFrame, col::ColumnIndex)
v = df[!, col]
if Missing <: eltype(v)
return .!ismissing.(v)
else
return trues(size(df, 1))
end
endTiming for for i in 1:size(df, 2)
v = df[!, i]
if Missing <: eltype(v)
res .&= aux(v)
end
endand aux(v) = .!ismissing.(v)To sum up, optimization with processing only missingable columns brought expected gain in performance. |
|
Good catch. However, it is not related to function barrier. The core of the problem is the following I think: The most likely reason is that we use here However, then it should be better to allocate an @mbauman - the problem with broadcast fusion performance that we have here probably cannot be helped and we have to resolve it manually - right? |
`df[!, i]` extracted to variable `res .= .!ismissing.(v)` splited for performance into two lines
|
As you suggested I've added and preallocated |
|
Great. Looks good. Could you please add tests to make sure we cover all possible scenarios properly? Thank you! |
|
@pstorozenko - do you plan to work on this? (no rush, but usually stalled PRs get outdated and it is hard to get back to them after some time). Thank you! |
|
Yes, but you're right, thanks for pinging. I had to think for a while, what needs testing. |
|
Looks good. I have edited the tests a bit. |
|
Sure thing, I tried to mimic the design of the rest of tests, but a little refactor is always good. |
I noticed, but these tests were written very long time ago and I thought to clean them up a bit. |
|
I pushed one additional fix that cleans inference issues. |
| res = BitVector(undef, size(df, 1)) | ||
| res .= .!ismissing.(v) | ||
| return res |
There was a problem hiding this comment.
Why not just this?
| res = BitVector(undef, size(df, 1)) | |
| res .= .!ismissing.(v) | |
| return res | |
| return .!ismissing.(v) |
There was a problem hiding this comment.
because it is not type stable. I have just reversed this. The current design (with res) has no performance penalty, but pasess @inferred.
There was a problem hiding this comment.
That's surprising. How about just adding ::BitVector in the function definition? That would make it clearer that the goal is to fix inference.
There was a problem hiding this comment.
I thought you would ask about it :).
The ::BitVector annotation could potentially allocate once more by converting Vector{Bool} to BitVector while what I do should guarantee only that only one allocation happens because broadcasting assignment does copyto! of Broadcasted into target.
Here is an example:
julia> f1(x::AbstractVector)::BitVector = .!ismissing.(x)
f1 (generic function with 1 method)
julia> f2(x::AbstractVector) = (res = BitVector(undef, length(x)); res .= .!ismissing.(x); return x)
f2 (generic function with 1 method)
julia> using SparseArrays
julia> x = sparse(1:10^7);
julia> using BenchmarkTools
julia> f1(x::AbstractVector)::BitVector = .!ismissing.(x)
f1 (generic function with 1 method)
julia> @btime f1($x);
568.808 ms (7 allocations: 87.02 MiB)
julia> @btime f2($x);
525.133 ms (4 allocations: 1.20 MiB)
|
@nalimilan - any more comments on this? |
|
@pstorozenko - please review the PR (as it was updated by me). If you can think of something to change (e.g. some more tests of corner caes) please do add them. Otherwise let me know we are done and I think we are good to merge this. |
|
Thank you! |
Optimization to two
completecasesmethods by only processing missingable columns.Discussed in #2724