Run `findall(rows)` only if `rows` are not all true by pstorozenko · Pull Request #2727 · JuliaData/DataFrames.jl

pstorozenko · 2021-04-18T11:40:47Z

PR that resulted from #2724 disussion.
Addition of _findall(rows::AbstractVector{Bool})
to not run findall(rows) if all(rows .== true) in
function SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}, cols)

I'm not sure if src/other/utils.jl is a good location for this function.

Benchmarks:

df = DataFrame(
    a = rand(1000000),
    b = allowmissing(rand(1000000)),
    c = [allowmissing(rand(1000000 - 1)); missing],
    d = repeat([1:9; missing], 100000)
)

Main branch

 julia> @btime dropmissing(df, :a, view=true);
  2.040 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :b, view=true);
  2.071 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :c, view=true);
  2.098 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.911 ms (15 allocations: 6.99 MiB)

julia> @btime dropmissing(df, view=true);
  7.677 ms (37 allocations: 7.00 MiB)

Main + this fix

julia> @btime dropmissing(df, :a, view=true);
  388.342 μs (13 allocations: 126.64 KiB)

julia> @btime dropmissing(df, :b, view=true);
  399.634 μs (13 allocations: 126.64 KiB)

julia> @btime dropmissing(df, :c, view=true);
  2.103 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.897 ms (15 allocations: 6.99 MiB)

julia> @btime dropmissing(df, view=true);
  7.679 ms (37 allocations: 7.00 MiB)

After completecases fix #2726

julia> @btime dropmissing(df, :a, view=true);
  1.883 ms (10 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :b, view=true);
  2.075 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :c, view=true);
  2.095 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.905 ms (15 allocations: 6.99 MiB)

julia> @btime dropmissing(df, view=true);
  2.351 ms (28 allocations: 7.12 MiB)

After completecases fix #2726 + this fix

julia> @btime dropmissing(df, :a, view=true);
  251.012 μs (8 allocations: 122.38 KiB)

julia> @btime dropmissing(df, :b, view=true);
  399.373 μs (13 allocations: 126.64 KiB)

julia> @btime dropmissing(df, :c, view=true);
  2.109 ms (15 allocations: 7.75 MiB)

julia> @btime dropmissing(df, :d, view=true);
  1.903 ms (15 allocations: 6.99 MiB)

julia> @btime dropmissing(df, view=true);
  2.355 ms (28 allocations: 7.12 MiB)

Summary:

Optimize completecases to process only missingable columns #2726 Reduces time of dropmissing(df, view=true)
This fix reduces average time for dropmissing(df, :a, view=true). In the worst case scenario (as with :c column) where there is only one missing in the end it is a bit slower, but I think that it is very rarely case. Usually we'll find missings/falses earlier.

Addition of _findall(rows::AbstractVector{Bool}) to not run `findall(rows)` if `all(rows .== true)`

src/other/utils.jl

Tests for this function added.

pstorozenko · 2021-05-09T19:50:31Z

I think this is more what you were talking about :)
I took Base.findall(B::BitArray) function and rewrote/adapted it for our needs.

In the first while true we process B in the optimized way.
After it we either returned solution or materialized result I.

In the second while true we process B in almost the same way as in Base.findall(B::BitArray).

I've tried to comment code a lot as operations are not obvious.
I've used one line ifs like Bi == length(Bc) && return start:length(B) because it makes code much cleaner, but I don't know if you like it.
Tell me what do you think about this whole approach.

In general this solution is sometimes a bit slower in case of random B.
This is because Base.findall(B::BitArray) does not try to jump over blocks with trues only.
In case of bigger blocks of trues it is faster.
Of course in case of single, continuous block of trues it is much, much faster.

Bellow I attach some benchmarks:

BD = OrderedDict(
    "T Big" => trues(100000),
    "F Big" => falses(100000),
    "T64 F64" => [trues(64); falses(64)],
    "F64 T64" => [falses(64); trues(64)],
    "F80 T100" => [falses(85); trues(100)],
    "F256 T32" => [falses(256); trues(32)],
    "F260 T32" => [falses(260); trues(32)],
    "TF Big" => [trues(100000); falses(100000)],
    "FT Big" => [falses(100000);trues(100000)] ,

    # some edge cases
    "TFT small" => [trues(85); falses(100); trues(85)],
    "FTFFT small" => [falses(64 + 32); trues(32); falses(128); trues(32)],
    "TFTF small" => [falses(64); trues(64); falses(64); trues(64)],
    "TFT small" => [trues(64); falses(10); trues(100)],

    "FTF Big" => [falses(8500); trues(100000); falses(65000)],
    "TFT Big" => [trues(8500); falses(100000); trues(65000)],
    "FTFTFTF Big" => [falses(65000); trues(65000); falses(65000); trues(65000); falses(65000); trues(65000); falses(65000)],

    "FTFR small" => [falses(85); trues(100); falses(65); rand([true, false], 20)],
    "R Big" => BitVector(rand([true, false], 200000)),
    "RF Big" => [BitVector(rand([true, false], 100000)) ; falses(100000)],
    "RT Big" => [BitVector(rand([true, false], 100000)) ; trues(100000)],
    "FTFR Big" => [falses(65000);  trues(65000);  falses(65000); rand([true, false], 20000)],
    "T256 R100" => [trues(256);  rand([true, false], 100)],
    "F256 R100" => [falses(256); rand([true, false], 100)],
)

for (l, B) in BD
    println("Processing $l")
    @show Base.findall(B) == DataFrames._findall(B)
    @btime Base.findall($B)
    @btime DataFrames._findall($B)
end

Processing T Big
Base.findall(B) == DataFrames._findall(B) = true
  109.098 μs (2 allocations: 781.33 KiB)
  398.537 ns (0 allocations: 0 bytes)
Processing F Big
Base.findall(B) == DataFrames._findall(B) = true
  430.095 ns (1 allocation: 80 bytes)
  421.095 ns (1 allocation: 80 bytes)
Processing T64 F64
Base.findall(B) == DataFrames._findall(B) = true
  104.586 ns (1 allocation: 624 bytes)
  41.959 ns (1 allocation: 80 bytes)
Processing F64 T64
Base.findall(B) == DataFrames._findall(B) = true
  104.734 ns (1 allocation: 624 bytes)
  41.498 ns (1 allocation: 80 bytes)
Processing F80 T100
Base.findall(B) == DataFrames._findall(B) = true
  143.548 ns (1 allocation: 896 bytes)
  46.941 ns (1 allocation: 80 bytes)
Processing F256 T32
Base.findall(B) == DataFrames._findall(B) = true
  73.466 ns (1 allocation: 336 bytes)
  45.835 ns (1 allocation: 80 bytes)
Processing F260 T32
Base.findall(B) == DataFrames._findall(B) = true
  73.387 ns (1 allocation: 336 bytes)
  45.847 ns (1 allocation: 80 bytes)
Processing TF Big
Base.findall(B) == DataFrames._findall(B) = true
  105.403 μs (2 allocations: 781.33 KiB)
  1.467 μs (1 allocation: 80 bytes)
Processing FT Big
Base.findall(B) == DataFrames._findall(B) = true
  105.608 μs (2 allocations: 781.33 KiB)
  1.987 μs (1 allocation: 80 bytes)
Processing TFT small
Base.findall(B) == DataFrames._findall(B) = true
  217.728 ns (1 allocation: 1.45 KiB)
  188.086 ns (2 allocations: 1.41 KiB)
Processing FTFFT small
Base.findall(B) == DataFrames._findall(B) = true
  106.897 ns (1 allocation: 624 bytes)
  112.246 ns (2 allocations: 624 bytes)
Processing TFTF small
Base.findall(B) == DataFrames._findall(B) = true
  215.025 ns (1 allocation: 1.14 KiB)
  160.212 ns (2 allocations: 1.14 KiB)
Processing FTF Big
Base.findall(B) == DataFrames._findall(B) = true
  127.988 μs (2 allocations: 781.33 KiB)
  1.767 μs (1 allocation: 80 bytes)
Processing TFT Big
Base.findall(B) == DataFrames._findall(B) = true
  95.470 μs (2 allocations: 574.33 KiB)
  66.360 μs (2 allocations: 574.39 KiB)
Processing FTFTFTF Big
Base.findall(B) == DataFrames._findall(B) = true
  253.018 μs (2 allocations: 1.49 MiB)
  194.054 μs (2 allocations: 1.49 MiB)
Processing FTFR small
Base.findall(B) == DataFrames._findall(B) = true
  189.834 ns (1 allocation: 1008 bytes)
  153.568 ns (2 allocations: 976 bytes)
Processing R Big
Base.findall(B) == DataFrames._findall(B) = true
  153.825 μs (2 allocations: 778.89 KiB)
  119.013 μs (2 allocations: 778.95 KiB)
Processing RF Big
Base.findall(B) == DataFrames._findall(B) = true
  78.821 μs (2 allocations: 389.64 KiB)
  59.672 μs (2 allocations: 389.70 KiB)
Processing RT Big
Base.findall(B) == DataFrames._findall(B) = true
  204.643 μs (2 allocations: 1.15 MiB)
  165.110 μs (2 allocations: 1.15 MiB)
Processing FTFR Big
Base.findall(B) == DataFrames._findall(B) = true
  100.556 μs (2 allocations: 586.64 KiB)
  72.579 μs (2 allocations: 586.70 KiB)
Processing T256 R100
Base.findall(B) == DataFrames._findall(B) = true
  659.301 ns (1 allocation: 2.56 KiB)
  519.550 ns (2 allocations: 2.58 KiB)
Processing F256 R100
Base.findall(B) == DataFrames._findall(B) = true
  120.717 ns (1 allocation: 544 bytes)
  131.407 ns (2 allocations: 528 bytes)

test/utils.jl

Random tests made bigger

bkamins · 2021-05-09T20:57:16Z

Could you please also resolve merge conflicts so that we can run the CI? Thank you!

With small number of random true/false generated, there is probability of getting `UnitRange{Int}` insted of `Vector{Int}`

pstorozenko · 2021-05-10T08:33:02Z

B.chunks is always a Vector{UInt64} in both 32bit and 64bit julia

pstorozenko · 2021-05-10T09:21:00Z

I don't know how it works but I've checked coverage of line 223 and it is covered in multiple tests with test T64 F64 being the prime example.

bkamins · 2021-05-10T10:55:08Z

Yes - I have checked this also :). It seems that probably the compiler somehow optimizes these lines out (not sure though).

src/other/utils.jl

bkamins · 2021-05-10T18:33:44Z

src/subdataframe/subdataframe.jl

                            "(got $(length(rows)), expected $(nrow(parent)))"))
    end
-    return SubDataFrame(parent, findall(rows), cols)
+    return SubDataFrame(parent, _findall(rows), cols)


Can you also test performance and corrected of view (it should be OK, but just to be sure as this is the point of the change we make so I prefer to double check). Thank you!

Sure, I'll check it with some more scenario than in the beginning of this PR

Fortunately it is faster :)

df = DataFrame( a = rand(1000000), b = allowmissing(rand(1000000)), c = [allowmissing(rand(1000000 - 1)); missing], d = repeat([1:9; missing], 100000), e = repeat([rand(50000); missings(50000)], 10), f = [missings(300000); ones(400000); missings(300000)], g = missings(1000000) )

Main:

julia> @btime dropmissing(df, :a, view=true); 1.872 ms (9 allocations: 7.75 MiB) julia> @btime dropmissing(df, :b, view=true); 2.075 ms (14 allocations: 7.75 MiB) julia> @btime dropmissing(df, :c, view=true); 2.111 ms (14 allocations: 7.75 MiB) julia> @btime dropmissing(df, :d, view=true); 1.913 ms (14 allocations: 6.99 MiB) julia> @btime dropmissing(df, :e, view=true); 1.117 ms (14 allocations: 3.94 MiB) julia> @btime dropmissing(df, :f, view=true); 909.713 μs (14 allocations: 3.18 MiB) julia> @btime dropmissing(df, :g, view=true); 152.216 μs (13 allocations: 126.67 KiB) julia> @btime dropmissing(df, view=true); 1.432 ms (38 allocations: 1.63 MiB)

Main + this PR:

julia> @btime dropmissing(df, :a, view=true); 17.093 μs (7 allocations: 122.34 KiB) julia> @btime dropmissing(df, :b, view=true); 162.656 μs (12 allocations: 126.59 KiB) julia> @btime dropmissing(df, :c, view=true); 168.988 μs (12 allocations: 126.59 KiB) julia> @btime dropmissing(df, :d, view=true); 1.890 ms (14 allocations: 6.99 MiB) julia> @btime dropmissing(df, :e, view=true); 1.100 ms (14 allocations: 3.94 MiB) julia> @btime dropmissing(df, :f, view=true); 166.613 μs (12 allocations: 126.59 KiB) julia> @btime dropmissing(df, :g, view=true); 188.653 μs (13 allocations: 126.67 KiB) julia> @btime dropmissing(df, view=true); 1.410 ms (38 allocations: 1.63 MiB)

src/other/utils.jl

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

test/utils.jl

bkamins · 2021-05-10T20:12:56Z

@nalimilan - after all my comments are resolved and CI passes could you please have a look (I have checked the logic of the whole code in detail and we have good test coverage so it is rather safe, but you always have a keen eye for problems). I am OK to merge the PR.

@pstorozenko - a very good PR

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

test/utils.jl

pstorozenko · 2021-05-11T16:08:50Z

Thanks!

src/other/utils.jl

nalimilan

Thanks! I haven't checked the logic as the body of the function is scary. Can you just fix the uncovered line?

src/other/utils.jl

pstorozenko

Missclick

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

pstorozenko · 2021-05-14T12:43:33Z

@nalimilan What do you mean by fix the uncovered line?
Do you mean code coverage? I'm sure all lines (except for @assert false "should not be reached") are covered by tests (I tested with good old @show), but for some reason codecov is showing differently.

bkamins · 2021-05-14T12:45:05Z

but for some reason codecov is showing differently.

I have also checked the logic that the important lines are covered.

nalimilan

Thanks!

bkamins · 2021-05-16T17:08:23Z

Thank you!

Run findall(rows) only if rows are not .==true

f0a5ea7

Addition of _findall(rows::AbstractVector{Bool}) to not run `findall(rows)` if `all(rows .== true)`

bkamins reviewed Apr 18, 2021

View reviewed changes

src/other/utils.jl Outdated Show resolved Hide resolved

bkamins added the performance label Apr 18, 2021

bkamins added this to the 1.x milestone Apr 18, 2021

Custom, optimized _findall(B::BitVector)

92f943e

Tests for this function added.

Test for type added

577dc85

bkamins reviewed May 9, 2021

View reviewed changes

test/utils.jl Outdated Show resolved Hide resolved

Some more random tests added

704e2c7

Random tests made bigger

pstorozenko and others added 4 commits May 9, 2021 23:00

Merge branch 'main' into ps/sdf_continuous

01ba69c

Added empty test and some tweeks to random tests

e3ea0be

With small number of random true/false generated, there is probability of getting `UnitRange{Int}` insted of `Vector{Int}`

B.chunks is always Vector{UInt64} even on x86

5c60e2c

Ret Vector{Int32} on x86 and Vector{Int64} on x64

422eb0b