Fix type instability in sort for few columns case and fix issorted bug by bkamins · Pull Request #2746 · JuliaData/DataFrames.jl

bkamins · 2021-05-02T19:44:14Z

Fixes #2745 by loop unrolling.

Probably is is possible to do it in a smarter way using metaprogramming (but maybe not) so I am marking it as a draft.

src/abstractdataframe/sort.jl

bkamins · 2021-05-02T22:15:40Z

We have a bad design of sorting in general:

julia> df = DataFrame(x=rand(10^7), y=rand(10^7));

julia> @time sort(df, [:x, :y]);
  5.751711 seconds (1.03 M allocations: 1021.159 MiB, 4.52% gc time)

julia> @time sort(df, [:x, order(:y, rev=true)]);
 18.962380 seconds (695.47 M allocations: 11.345 GiB, 5.67% gc time)

I will push an update (and then the code will be simplified with recursion)

bkamins · 2021-05-02T22:44:37Z

@nalimilan - this should be good to have a look at. The only issue is that in general sorting is expensive to compile. I will have to think how to reduce this cost (though maybe it is not easy to do).

bkamins · 2021-05-03T08:58:46Z

Timings after this PR:

julia> df = DataFrame(x=rand(10^7), y=rand(10^7));

julia> @btime sort($df, [:x, :y]);
  5.501 s (1029745 allocations: 1021.14 MiB)

julia> @btime sort($df, [:x, order(:y, rev=true)]);
  5.426 s (1029758 allocations: 1021.14 MiB)

and

julia> function mwedates()
         #build the sample
         dts = reduce(vcat, [[Date(2011,11,11) + Day(i) for j in 1:10^4] for i in 1:100])
         mdts = dts |> Vector{Union{Date, Missing}}
         id = reduce(vcat, [[j for j in 1:10^4] for i in 1:100])
         df = DataFrame(date=dts, mdate = mdts, id=id)

         #shuffle
         df = df[randperm(10^6), :]

         print("sort date and id: ")
         @btime sort($df, [:date, :id])
         print("sort date(with missings) and id: ")
         @btime sort($df, [:mdate, :id])
         print("work around performance: ")
         @btime begin
           $df.mdateconverted = $df.mdate |> Vector{Date}
           sort($df, [:mdateconverted, :id])
         end
       end
mwedates (generic function with 1 method)

julia> mwedates()
sort date and id:   260.934 ms (64997 allocations: 92.79 MiB)
sort date(with missings) and id:   329.115 ms (64997 allocations: 92.79 MiB)
work around performance:   272.197 ms (65004 allocations: 108.05 MiB)

so all is OK (the penalty of missing has to be accepted I think as work-around uses knowledge of the data)

src/other/precompile.jl

clintonTE · 2021-05-03T19:27:36Z

so all is OK (the penalty of missing has to be accepted I think as work-around uses knowledge of the data)

Yeah, that little overhead is fantastic. Thank you!

nalimilan

Thanks! I wonder how this went unnoticed for so long.

Maybe one way to limit the compilation cost would be to sort column-wise rather than row-wise (starting with the last column)? Not sure whether that would be fast. Anyway that would require a deeper refactoring so this PR is useful even if we later change the approach.

src/abstractdataframe/sort.jl

nalimilan · 2021-05-03T20:33:23Z

src/abstractdataframe/sort.jl

 #         sort the original (presumably larger) DataFrame

-struct DFPerm{O<:Union{Ordering, AbstractVector}, T<:Tuple} <: Ordering
+struct DFPerm{O<:Union{Ordering, Tuple{Vararg{Ordering}}},


Would it make sense to make O always a Tuple{Vararg{Ordering}}? That would avoid the need for ord isa Ordering below.

The problem is that if you do sort(df) you want a single Ordering that is reused to avoid excessive compilation. I would assume that ord isa Ordering check should be optimized out by the compiler so it should have no performance penalty.

I tried simplifying it but I always ended up with compiler allocating.

I don't get it. Why would wrapping the Ordering in a one-element tuple trigger more compilation or allocations?

Ah - one element Tuple is not a problem, but in this case we would anyway have to branch if the tuple is one element or matches length of the column vector.

What allocates is if we wanted to have one tuple where each element would hold a tuple consisting of order and column.

OK. Anyway always wrapping Ordering in a tuple sounds simpler conceptually.

I will change it if I can make work it fast.

What's the conclusion regarding this?

I decided to leave the union (as in the original design) as in the end the logic was simpler (through dispatch).

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-05-03T21:43:11Z

Not sure whether that would be fast.

This would be slower AFAICT, because now we are short circuting (so mostly only one comparison is needed on the first column in typical cases)

bkamins · 2021-05-03T23:12:53Z

We cannot use recursion, as it will break on very wide data frames. I will fix it tomorrow.

nalimilan · 2021-05-04T10:02:43Z

This would be slower AFAICT, because now we are short circuting (so mostly only one comparison is needed on the first column in typical cases)

Good point. But I guess that depends on the types of sorted columns. When sorting on bitstype columns which can use optimized algorithms, I imagine that sorting one column at a time could be faster. I'm thinking about integer columns with small ranges which use counting sort in Base, floating point columns which use a special quick sort IIRC, or other types which could use radix sort (only via SortingAlgorithms currently). But we would have to be very careful to do that only for cases known to be fast anyway and that's tricky to get right.

bkamins · 2021-05-04T13:50:47Z

When sorting on bitstype columns which can use optimized algorithms

I agree, but currently we have:

julia> using DataFrames

julia> using BenchmarkTools

julia> df = DataFrame(rand(1:10, 10^6, 2), :auto);

julia> @btime sort($df, :x1);
  82.997 ms (239220 allocations: 182.06 MiB)

julia> @btime sort($df, [:x1, :x2]);
  126.545 ms (239142 allocations: 170.64 MiB)

julia> @btime sort($df.x1);
  2.694 ms (3 allocations: 7.63 MiB)

julia> t = collect(zip(df.x1, df.x2)); @btime sort($t);
  65.424 ms (4 allocations: 22.89 MiB)

so I think that the first step should be to make sorting on single column fast (which we do at some point - in general sorting and reshaping are things to work on in the near future as these were the areas here no new things were added for a long time). But for this PR I would concentrate the design on fixing type instability issues.

bkamins · 2021-05-04T17:29:33Z

Here are the benchmarks after the fix of recursion.
The conclusion is - in general we are faster, unless someone is sorting a very wide table on all columns (or in general sorting on very many columns - in this case we are slower although I use essentially the same code as previously - so it seems the compiler is not able to optimize things correctly in this case yet; I could fix it, but I think it is not worth it as sorting on super many columns is not very useful anyway).

Additionally I have discovered a bug in issorted that is fixed now.

this PR

julia> using DataFrames, Random, StatsBase, Dates, BenchmarkTools

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10^6), y=rand(10^6));

julia> @time sort(df, [:x, :y]);
  1.554144 seconds (3.50 M allocations: 274.667 MiB, 3.56% gc time, 80.51% compilation time)

julia> @btime sort($df, [:x, :y]);
  275.927 ms (64998 allocations: 84.21 MiB)

julia> @time sort(df, [:x, order(:y, rev=true)]);
  0.823464 seconds (769.23 k allocations: 123.305 MiB, 1.69% gc time, 65.93% compilation time)

julia> @btime sort($df, [:x, order(:y, rev=true)]);
  279.058 ms (65005 allocations: 84.21 MiB)

julia> function mwedates()
                #build the sample
                dts = reduce(vcat, [[Date(2011,11,11) + Day(i) for j in 1:10^4] for i in 1:100])
                mdts = dts |> Vector{Union{Date, Missing}}
                id = reduce(vcat, [[j for j in 1:10^4] for i in 1:100])
                df = DataFrame(date=dts, mdate = mdts, id=id)

                #shuffle
                df = df[randperm(10^6), :]

                print("sort date and id: ")
                @btime sort($df, [:date, :id])
                print("sort date(with missings) and id: ")
                @btime sort($df, [:mdate, :id])
                print("work around performance: ")
                @btime begin
                  $df.mdateconverted = $df.mdate |> Vector{Date}
                  sort($df, [:mdateconverted, :id])
                end
              end
mwedates (generic function with 1 method)

julia> mwedates();
sort date and id:   276.888 ms (64997 allocations: 92.79 MiB)
sort date(with missings) and id:   345.657 ms (64997 allocations: 92.79 MiB)
work around performance:   279.926 ms (65005 allocations: 108.05 MiB)

julia> df = DataFrame(ones(10,1000), :auto);

julia> @time sort(df);
  0.603286 seconds (156.87 k allocations: 9.166 MiB, 99.74% compilation time)

julia> @btime sort(df);
  364.711 μs (6400 allocations: 322.86 KiB)

julia> df = DataFrame(ones(Int, 10000, 100), :auto);

julia> @time sort(df);
  0.502271 seconds (403.55 k allocations: 29.421 MiB, 1.04% gc time, 95.94% compilation time)

julia> @btime sort(df);
  15.511 ms (326 allocations: 7.72 MiB)

julia> df = DataFrame(ones(Bool, 100000, 15), :auto);

julia> @time sort(df);
  0.607865 seconds (735.55 k allocations: 43.240 MiB, 99.39% compilation time)

julia> @btime sort(df);
  3.011 ms (71 allocations: 2.20 MiB)

current main

julia> using DataFrames, Random, StatsBase, Dates, BenchmarkTools

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10^6), y=rand(10^6));

julia> @time sort(df, [:x, :y]);
  1.624158 seconds (3.52 M allocations: 276.283 MiB, 7.45% gc time, 81.75% compilation time)

julia> @btime sort($df, [:x, :y]);
  272.999 ms (64998 allocations: 84.21 MiB)

julia> @time sort(df, [:x, order(:y, rev=true)]);
  1.499732 seconds (58.89 M allocations: 1008.693 MiB, 3.29% gc time, 33.07% compilation time)

julia> @btime sort($df, [:x, order(:y, rev=true)]);
  1.031 s (58215802 allocations: 971.52 MiB)

julia> function mwedates()
                #build the sample
                dts = reduce(vcat, [[Date(2011,11,11) + Day(i) for j in 1:10^4] for i in 1:100])
                mdts = dts |> Vector{Union{Date, Missing}}
                id = reduce(vcat, [[j for j in 1:10^4] for i in 1:100])
                df = DataFrame(date=dts, mdate = mdts, id=id)

                #shuffle
                df = df[randperm(10^6), :]

                print("sort date and id: ")
                @btime sort($df, [:date, :id])
                print("sort date(with missings) and id: ")
                @btime sort($df, [:mdate, :id])
                print("work around performance: ")
                @btime begin
                  $df.mdateconverted = $df.mdate |> Vector{Date}
                  sort($df, [:mdateconverted, :id])
                end
              end
mwedates (generic function with 1 method)

julia> mwedates();
sort date and id:   379.087 ms (64997 allocations: 92.79 MiB)
sort date(with missings) and id:   2.419 s (105424568 allocations: 1.66 GiB)
work around performance:   385.658 ms (65005 allocations: 108.05 MiB)

julia> df = DataFrame(ones(10,1000), :auto);

julia> @time sort(df);
  0.606793 seconds (152.38 k allocations: 9.070 MiB, 99.87% compilation time)

julia> @btime sort(df);
  258.662 μs (1999 allocations: 254.09 KiB)

julia> df = DataFrame(ones(Int, 10000, 100), :auto);

julia> @time sort(df);
  0.431671 seconds (419.29 k allocations: 30.190 MiB, 1.32% gc time, 98.73% compilation time)

julia> @btime sort(df);
  3.117 ms (326 allocations: 7.72 MiB)

julia> df = DataFrame(ones(Bool, 100000, 15), :auto);

julia> @time sort(df);
  0.452559 seconds (474.43 k allocations: 27.729 MiB, 1.04% gc time, 99.14% compilation time)

julia> @btime sort(df);
  3.281 ms (74 allocations: 2.20 MiB)

bkamins · 2021-05-16T16:58:22Z

@nalimilan - no rush, but it would be good to review it and merge, as it is fixing a bug in isordered.

nalimilan · 2021-05-16T17:23:26Z

src/abstractdataframe/sort.jl

+function Sort.lt(o::DFPerm{<:Any, <:Tuple}, a, b)
+    ord = o.ord
+    cols = o.cols
+    length(cols) > 16 && return  unstable_lt(ord, cols, a, b)


Add a comment explaining how the 16 threshold was chosen?

Suggested change

length(cols) > 16 && return unstable_lt(ord, cols, a, b)

# if there are too many columns fall back to type unstable mode to avoid high compilation cost

# it is expected that in practice users sort data frames on only few columns

length(cols) > 16 && return unstable_lt(ord, cols, a, b)

added. I have not tuned 16 specifically. I just assume that 16 is a safe threshold. Probably we could pass some higher number here, but I think that normally one does not sort on more than something like 4 columns.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

NEWS.md

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-05-30T21:36:47Z

Thank you!

Fix type instability in sort for few columns case

ee56676

bkamins added the performance label May 2, 2021

bkamins added this to the patch milestone May 2, 2021

bkamins mentioned this pull request May 2, 2021

Slow sorts in columns with Union{<:Any, missing} even if no missing values in the column #2745

Closed

bkamins commented May 2, 2021

View reviewed changes

src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved

bkamins commented May 2, 2021

View reviewed changes

src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

b9ac3ec

use recursion

607b89b

add tests

ce18e13

bkamins marked this pull request as ready for review May 3, 2021 08:58

bkamins commented May 3, 2021

View reviewed changes

src/other/precompile.jl Outdated Show resolved Hide resolved

Update src/other/precompile.jl

88a99bc

nalimilan reviewed May 3, 2021

View reviewed changes

Update src/abstractdataframe/sort.jl

9489491

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

refactor sort to limit recursion and fix bug in issorted

e1241d1

bkamins changed the title ~~Fix type instability in sort for few columns case~~ Fix type instability in sort for few columns case and fix issorted bug May 4, 2021

bkamins added the bug label May 4, 2021

nalimilan reviewed May 16, 2021

View reviewed changes

bkamins and others added 2 commits May 16, 2021 19:40

Update src/abstractdataframe/sort.jl

786db70

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Merge branch 'main' into bkamins-patch-1-1

a16741d

add NEWS.md mention

d53e6ae

nalimilan approved these changes May 27, 2021

View reviewed changes

NEWS.md Show resolved Hide resolved

Update NEWS.md

38ffe35

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit 4389c04 into main May 30, 2021

bkamins deleted the bkamins-patch-1-1 branch May 30, 2021 21:36

Conversation

bkamins commented May 2, 2021

Uh oh!

Uh oh!

Uh oh!

bkamins commented May 2, 2021

Uh oh!

bkamins commented May 2, 2021

Uh oh!

bkamins commented May 3, 2021

Uh oh!

Uh oh!

clintonTE commented May 3, 2021

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkamins commented May 3, 2021

Uh oh!

bkamins commented May 3, 2021

Uh oh!

nalimilan commented May 4, 2021

Uh oh!

bkamins commented May 4, 2021

Uh oh!

bkamins commented May 4, 2021

Uh oh!

bkamins commented May 16, 2021

Uh oh!

nalimilan May 16, 2021 • edited by bkamins Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkamins commented May 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nalimilan May 16, 2021 •

edited by bkamins

Loading