Run findall(rows) only if rows are not all true#2727
Run findall(rows) only if rows are not all true#2727bkamins merged 17 commits intoJuliaData:mainfrom
findall(rows) only if rows are not all true#2727Conversation
Addition of _findall(rows::AbstractVector{Bool})
to not run `findall(rows)` if `all(rows .== true)`
Tests for this function added.
|
I think this is more what you were talking about :) In the first In the second I've tried to comment code a lot as operations are not obvious. In general this solution is sometimes a bit slower in case of random Bellow I attach some benchmarks: BD = OrderedDict(
"T Big" => trues(100000),
"F Big" => falses(100000),
"T64 F64" => [trues(64); falses(64)],
"F64 T64" => [falses(64); trues(64)],
"F80 T100" => [falses(85); trues(100)],
"F256 T32" => [falses(256); trues(32)],
"F260 T32" => [falses(260); trues(32)],
"TF Big" => [trues(100000); falses(100000)],
"FT Big" => [falses(100000);trues(100000)] ,
# some edge cases
"TFT small" => [trues(85); falses(100); trues(85)],
"FTFFT small" => [falses(64 + 32); trues(32); falses(128); trues(32)],
"TFTF small" => [falses(64); trues(64); falses(64); trues(64)],
"TFT small" => [trues(64); falses(10); trues(100)],
"FTF Big" => [falses(8500); trues(100000); falses(65000)],
"TFT Big" => [trues(8500); falses(100000); trues(65000)],
"FTFTFTF Big" => [falses(65000); trues(65000); falses(65000); trues(65000); falses(65000); trues(65000); falses(65000)],
"FTFR small" => [falses(85); trues(100); falses(65); rand([true, false], 20)],
"R Big" => BitVector(rand([true, false], 200000)),
"RF Big" => [BitVector(rand([true, false], 100000)) ; falses(100000)],
"RT Big" => [BitVector(rand([true, false], 100000)) ; trues(100000)],
"FTFR Big" => [falses(65000); trues(65000); falses(65000); rand([true, false], 20000)],
"T256 R100" => [trues(256); rand([true, false], 100)],
"F256 R100" => [falses(256); rand([true, false], 100)],
)
for (l, B) in BD
println("Processing $l")
@show Base.findall(B) == DataFrames._findall(B)
@btime Base.findall($B)
@btime DataFrames._findall($B)
end |
Random tests made bigger
|
Could you please also resolve merge conflicts so that we can run the CI? Thank you! |
With small number of random true/false generated, there is probability
of getting `UnitRange{Int}` insted of `Vector{Int}`
|
|
|
I don't know how it works but I've checked coverage of line 223 and it is covered in multiple tests with test |
|
Yes - I have checked this also :). It seems that probably the compiler somehow optimizes these lines out (not sure though). |
| "(got $(length(rows)), expected $(nrow(parent)))")) | ||
| end | ||
| return SubDataFrame(parent, findall(rows), cols) | ||
| return SubDataFrame(parent, _findall(rows), cols) |
There was a problem hiding this comment.
Can you also test performance and corrected of view (it should be OK, but just to be sure as this is the point of the change we make so I prefer to double check). Thank you!
There was a problem hiding this comment.
Sure, I'll check it with some more scenario than in the beginning of this PR
There was a problem hiding this comment.
Fortunately it is faster :)
df = DataFrame(
a = rand(1000000),
b = allowmissing(rand(1000000)),
c = [allowmissing(rand(1000000 - 1)); missing],
d = repeat([1:9; missing], 100000),
e = repeat([rand(50000); missings(50000)], 10),
f = [missings(300000); ones(400000); missings(300000)],
g = missings(1000000)
)
Main:
julia> @btime dropmissing(df, :a, view=true);
1.872 ms (9 allocations: 7.75 MiB)
julia> @btime dropmissing(df, :b, view=true);
2.075 ms (14 allocations: 7.75 MiB)
julia> @btime dropmissing(df, :c, view=true);
2.111 ms (14 allocations: 7.75 MiB)
julia> @btime dropmissing(df, :d, view=true);
1.913 ms (14 allocations: 6.99 MiB)
julia> @btime dropmissing(df, :e, view=true);
1.117 ms (14 allocations: 3.94 MiB)
julia> @btime dropmissing(df, :f, view=true);
909.713 μs (14 allocations: 3.18 MiB)
julia> @btime dropmissing(df, :g, view=true);
152.216 μs (13 allocations: 126.67 KiB)
julia> @btime dropmissing(df, view=true);
1.432 ms (38 allocations: 1.63 MiB)
Main + this PR:
julia> @btime dropmissing(df, :a, view=true);
17.093 μs (7 allocations: 122.34 KiB)
julia> @btime dropmissing(df, :b, view=true);
162.656 μs (12 allocations: 126.59 KiB)
julia> @btime dropmissing(df, :c, view=true);
168.988 μs (12 allocations: 126.59 KiB)
julia> @btime dropmissing(df, :d, view=true);
1.890 ms (14 allocations: 6.99 MiB)
julia> @btime dropmissing(df, :e, view=true);
1.100 ms (14 allocations: 3.94 MiB)
julia> @btime dropmissing(df, :f, view=true);
166.613 μs (12 allocations: 126.59 KiB)
julia> @btime dropmissing(df, :g, view=true);
188.653 μs (13 allocations: 126.67 KiB)
julia> @btime dropmissing(df, view=true);
1.410 ms (38 allocations: 1.63 MiB)
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
|
@nalimilan - after all my comments are resolved and CI passes could you please have a look (I have checked the logic of the whole code in detail and we have good test coverage so it is rather safe, but you always have a keen eye for problems). I am OK to merge the PR. @pstorozenko - a very good PR |
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
|
Thanks! |
nalimilan
left a comment
There was a problem hiding this comment.
Thanks! I haven't checked the logic as the body of the function is scary. Can you just fix the uncovered line?
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
@nalimilan What do you mean by fix the uncovered line? |
I have also checked the logic that the important lines are covered. |
|
Thank you! |
PR that resulted from #2724 disussion.
Addition of
_findall(rows::AbstractVector{Bool})to not run
findall(rows)ifall(rows .== true)infunction SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool}, cols)I'm not sure if
src/other/utils.jlis a good location for this function.Benchmarks:
Main branch
Main + this fix
After
completecasesfix #2726After
completecasesfix #2726 + this fixSummary:
completecasesto process only missingable columns #2726 Reduces time ofdropmissing(df, view=true)dropmissing(df, :a, view=true). In the worst case scenario (as with:ccolumn) where there is only one missing in the end it is a bit slower, but I think that it is very rarely case. Usually we'll find missings/falses earlier.