Optimize `completecases` to process only missingable columns by pstorozenko · Pull Request #2726 · JuliaData/DataFrames.jl

pstorozenko · 2021-04-17T18:15:42Z

Optimization to two completecases methods by only processing missingable columns.
Discussed in #2724

src/abstractdataframe/abstractdataframe.jl

bkamins · 2021-04-17T18:33:40Z

Looks good. The comments are minor.
Can you please also post some benchmarks and double check that we properly cover all the possible cases in tests.

src/abstractdataframe/abstractdataframe.jl

pstorozenko · 2021-04-17T22:29:36Z

While benchmarking, I've found a much bigger problem.

DataFrames.jl/src/abstractdataframe/abstractdataframe.jl

Line 765 in a671a3b

res .&= .!ismissing.(df[!, i])

I suspect that this line is not aware of df[!, i] type and .!ismissing.(df[!, i]) is slow.

Few benchamarks:

df = DataFrame(
    a = rand(1000000), b = rand(Int, 1000000),
    c = allowmissing(rand(1000000)), d = repeat([1:9; missing], 100000)
)

main branch timings:

ulia> @btime completecases(df);
  5.570 ms (32 allocations: 139.50 KiB)
julia> @btime completecases(df, :a);
  147.885 μs (9 allocations: 126.52 KiB)
julia> @btime completecases(df, :b);
  147.793 μs (9 allocations: 126.52 KiB)
julia> @btime completecases(df, :c);
  158.253 μs (9 allocations: 126.52 KiB)
julia> @btime completecases(df, :d);
  158.959 μs (9 allocations: 126.52 KiB)

Timings for

res = trues(size(df, 1))
for i in 1:size(df, 2)
    v = df[!, i]
    if Missing <: eltype(v)
      res .&= .!ismissing.(v)
    end
end

and

function completecases(df::AbstractDataFrame, col::ColumnIndex)
    v = df[!, col]
    if Missing <: eltype(v)
        return .!ismissing.(v)
    else
        return trues(size(df, 1))
    end
end

julia> @btime completecases(df);
  3.148 ms (18 allocations: 130.88 KiB)
julia> @btime completecases(df, :a);
  12.735 μs (4 allocations: 122.25 KiB)
julia> @btime completecases(df, :b);
  12.269 μs (4 allocations: 122.25 KiB)
julia> @btime completecases(df, :c);
  158.393 μs (9 allocations: 126.52 KiB)
julia> @btime completecases(df, :d);
  159.116 μs (9 allocations: 126.52 KiB)

Timing for

for i in 1:size(df, 2)
    v = df[!, i]
    if Missing <: eltype(v)
      res .&= aux(v)
    end
end

and

aux(v) = .!ismissing.(v)

julia> @btime completecases(df);
  378.161 μs (16 allocations: 375.22 KiB)

To sum up, optimization with processing only missingable columns brought expected gain in performance.
However, more important is that .!ismissing.(v) is not aware of v type and is slow.
By making a simple barrier we gain a lot in time but some memory is allocated.
Maybe my suspicions are wrong, but there is a problem nevertheless.
Do you have any ideas on how to remove these allocations?

bkamins · 2021-04-18T07:17:23Z

Good catch. However, it is not related to function barrier. The core of the problem is the following I think:

julia> function f(x)
       r = trues(length(x))
       r .&= .!ismissing.(x)
       return r
       end
f (generic function with 1 method)

julia> function g(x)
       r = trues(length(x))
       y = .!ismissing.(x)
       r .&= y
       return r
       end
g (generic function with 1 method)

julia> x = rand([1, missing], 10^6);

julia> @benchmark f($x)
BenchmarkTools.Trial:
  memory estimate:  126.42 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.214 ms (0.00% GC)
  median time:      1.247 ms (0.00% GC)
  mean time:        1.339 ms (0.19% GC)
  maximum time:     5.126 ms (75.65% GC)
  --------------
  samples:          3722
  evals/sample:     1

julia> @benchmark g($x)
BenchmarkTools.Trial:
  memory estimate:  248.66 KiB
  allocs estimate:  7
  --------------
  minimum time:     136.100 μs (0.00% GC)
  median time:      146.400 μs (0.00% GC)
  mean time:        166.911 μs (3.43% GC)
  maximum time:     8.322 ms (97.25% GC)
  --------------
  samples:          10000
  evals/sample:     1

The most likely reason is that we use here BitVector which is not handled optimally in such loops. We could switch to Vecror{Bool} but I think it will be slower, and it is better to use your approach using operation splitting.

However, then it should be better to allocate an aux vector once and and update it in-place (so we allocate both res and aux only once).

@mbauman - the problem with broadcast fusion performance that we have here probably cannot be helped and we have to resolve it manually - right?

`df[!, i]` extracted to variable `res .= .!ismissing.(v)` splited for performance into two lines

pstorozenko · 2021-04-18T10:02:00Z

As you suggested I've added and preallocated aux vector.
Same benchmarks as yesterday.
The first is for res .&= .!ismissing.(v), the second for aux .= .!ismissing.(v); res .&= aux.

julia> @btime completecases(df); 
  3.154 ms (18 allocations: 130.88 KiB)
  
julia> @btime completecases(df);
  375.694 μs (18 allocations: 253.00 KiB)

bkamins · 2021-04-18T19:23:55Z

Great. Looks good. Could you please add tests to make sure we cover all possible scenarios properly? Thank you!

src/abstractdataframe/abstractdataframe.jl

bkamins · 2021-05-02T12:55:10Z

@pstorozenko - do you plan to work on this? (no rush, but usually stalled PRs get outdated and it is hard to get back to them after some time). Thank you!

pstorozenko · 2021-05-02T21:02:51Z

Yes, but you're right, thanks for pinging. I had to think for a while, what needs testing.
I added tests for cases I could think of.

test/data.jl

bkamins · 2021-05-02T21:48:37Z

Looks good. I have edited the tests a bit.

pstorozenko · 2021-05-02T21:52:30Z

Sure thing, I tried to mimic the design of the rest of tests, but a little refactor is always good.
Thanks!

bkamins · 2021-05-02T21:59:01Z

I tried to mimic the design of the rest of tests

I noticed, but these tests were written very long time ago and I thought to clean them up a bit.

test/data.jl

src/abstractdataframe/abstractdataframe.jl

bkamins · 2021-05-03T08:06:43Z

I pushed one additional fix that cleans inference issues.

nalimilan · 2021-05-03T08:08:05Z

src/abstractdataframe/abstractdataframe.jl

+        res = BitVector(undef, size(df, 1))
+        res .= .!ismissing.(v)
+        return res


Why not just this?

Suggested change

res = BitVector(undef, size(df, 1))

res .= .!ismissing.(v)

return res

return .!ismissing.(v)

because it is not type stable. I have just reversed this. The current design (with res) has no performance penalty, but pasess @inferred.

That's surprising. How about just adding ::BitVector in the function definition? That would make it clearer that the goal is to fix inference.

I thought you would ask about it :).

The ::BitVector annotation could potentially allocate once more by converting Vector{Bool} to BitVector while what I do should guarantee only that only one allocation happens because broadcasting assignment does copyto! of Broadcasted into target.

Here is an example:

julia> f1(x::AbstractVector)::BitVector = .!ismissing.(x) f1 (generic function with 1 method) julia> f2(x::AbstractVector) = (res = BitVector(undef, length(x)); res .= .!ismissing.(x); return x) f2 (generic function with 1 method) julia> using SparseArrays julia> x = sparse(1:10^7); julia> using BenchmarkTools julia> f1(x::AbstractVector)::BitVector = .!ismissing.(x) f1 (generic function with 1 method) julia> @btime f1($x); 568.808 ms (7 allocations: 87.02 MiB) julia> @btime f2($x); 525.133 ms (4 allocations: 1.20 MiB)

test/data.jl

bkamins · 2021-05-05T14:51:45Z

@nalimilan - any more comments on this?

bkamins · 2021-05-05T20:45:39Z

@pstorozenko - please review the PR (as it was updated by me). If you can think of something to change (e.g. some more tests of corner caes) please do add them. Otherwise let me know we are done and I think we are good to merge this.

bkamins · 2021-05-07T06:41:03Z

Thank you!

completecases process only missingable cols

e642415

pstorozenko mentioned this pull request Apr 17, 2021

Matchmissing == :notequal #2724

Merged

bkamins reviewed Apr 17, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

Split oprations in completecases

c4e07f7

`df[!, i]` extracted to variable `res .= .!ismissing.(v)` splited for performance into two lines

pstorozenko mentioned this pull request Apr 18, 2021

Run findall(rows) only if rows are not all true #2727

Merged

bkamins added the performance label Apr 18, 2021

bkamins added this to the 1.x milestone Apr 18, 2021

nalimilan reviewed Apr 20, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Show resolved Hide resolved

pstorozenko added 2 commits May 2, 2021 22:54

Some tests for completecases added

79e34df

Comment added

7bb1ee5

bkamins reviewed May 2, 2021

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

Update test/data.jl

0cbf724

bkamins approved these changes May 2, 2021

View reviewed changes

bkamins reviewed May 3, 2021

View reviewed changes

test/data.jl Outdated Show resolved Hide resolved

bkamins reviewed May 3, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

1f2fb7f

nalimilan reviewed May 3, 2021

View reviewed changes

bkamins reviewed May 5, 2021

View reviewed changes

test/data.jl Show resolved Hide resolved

Update test/data.jl

920bf79

nalimilan approved these changes May 5, 2021

View reviewed changes

bkamins approved these changes May 7, 2021

View reviewed changes

bkamins merged commit bcaa2e5 into JuliaData:main May 7, 2021

bkamins mentioned this pull request May 20, 2021

Fast row aggregation in DataFrames.jl #2768

Closed

pstorozenko deleted the ps/ccases_onlymissing branch May 25, 2021 14:08

Conversation

pstorozenko commented Apr 17, 2021

Uh oh!

Uh oh!

Uh oh!

bkamins commented Apr 17, 2021

Uh oh!

Uh oh!

pstorozenko commented Apr 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkamins commented Apr 18, 2021

Uh oh!

pstorozenko commented Apr 18, 2021

Uh oh!

bkamins commented Apr 18, 2021

Uh oh!

Uh oh!

bkamins commented May 2, 2021

Uh oh!

pstorozenko commented May 2, 2021

Uh oh!

Uh oh!

bkamins commented May 2, 2021

Uh oh!

pstorozenko commented May 2, 2021

Uh oh!

bkamins commented May 2, 2021

Uh oh!

Uh oh!

Uh oh!

bkamins commented May 3, 2021

Uh oh!

nalimilan May 3, 2021

Choose a reason for hiding this comment

Uh oh!

bkamins May 3, 2021

Choose a reason for hiding this comment

Uh oh!

nalimilan May 3, 2021

Choose a reason for hiding this comment

Uh oh!

bkamins May 3, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkamins commented May 5, 2021

Uh oh!

bkamins commented May 5, 2021

Uh oh!

bkamins commented May 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pstorozenko commented Apr 17, 2021 •

edited

Loading