[BREAKING] remove median and nunique from describe by default by bkamins · Pull Request #2339 · JuliaData/DataFrames.jl

bkamins · 2020-07-31T11:16:07Z

As decided there we just drop computing :median and :nunique by default.
This is simplest to change, and if someone wants them it is easy to opt-in.

Just to get a relative impact on performance of dropping this consider:

julia> x = rand(10^8);

julia> @time mean(x)
  0.055976 seconds (1 allocation: 16 bytes)
0.5000065333980973

julia> @time median(x)
  2.046998 seconds (3 allocations: 762.940 MiB, 8.01% gc time)
0.5000076592923275

julia> @time length(unique(x))
 24.142629 seconds (112 allocations: 4.903 GiB, 0.49% gc time)
100000000

which when you have even several dozens of variables only starts to be prohibitive.

bkamins · 2020-07-31T11:18:03Z

(in particular note that even if we start adding support to threading in DataFrames.jl - which is a plan according to the responses in https://discourse.julialang.org/t/dataframes-jl-development-survey/44022) still :median and :nunique will be problematic because of their memory usage (but this is another issue).

bkamins · 2020-07-31T11:29:40Z

I have also switched unique to Set, which is faster, but still prohibitive:

julia> @time length(Set(x));
 13.259684 seconds (14 allocations: 3.375 GiB, 1.79% gc time)

nalimilan · 2020-07-31T16:11:30Z

Dropping the number of unique values sounds fine, but I'm a bit more reluctant to drop the median. It's a more robust indicator than the mean and it would be too bad not to report it by default just because in some cases it will be slow. I assume in most cases it should be fast enough.

FWIW, the carefully designed skimr R package has an interesting approach:

n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist

Minimum and maximum are reported as being the 0% and 100% percentiles, so the median is between these two. Additionally, the 25% and 75% quantiles are reported. (The rate of complete observations is somewhat redundant IMO.) What do you think?

bkamins · 2020-07-31T17:21:45Z

The problem is (for the same size of data):

julia> @time quantile(x, 0:0.25:1);
 20.659685 seconds (4 allocations: 762.940 MiB, 0.42% gc time)

so I would not include it for performance reasons.

I will revert the median as you propose, however it will mean that for 10^8 rows and 100 columns (probably the maximum size normally we should consider) it adds 5 minutes to computing time of the results.

nalimilan · 2020-07-31T17:30:34Z

0% and 100% quantiles can be computed using minimum and maximum as currently, so I wouldn't include them in the timing. Also, it turns out that quantile(x, [0.25, 0.5, 0.75]) is twice slower than quantile(x, 0.25) + quantile(x, 0.5) + quantile(x, 0.75), which is completely absurd and needs to be investigated. But yes, computing two more quantiles would be somewhat slower.

bkamins · 2020-07-31T17:45:23Z

Actually I thought that computing quantile(x, 0:0.25:1) would be fastest and then you would not compute min and max as you would get it for free. It definitely should be investigated why this slows down.

Anyway in this PR I would focus on dropping things (and I understand we agree to drop nunique and keep median by default). Let us come back to adding things when we understand better the issues with quantile in Statistics. OK?

bkamins · 2020-07-31T19:56:47Z

re-introduced :median

bkamins · 2020-08-02T07:24:17Z

Thank you!

remove median and nunique from describe by default

a2281f0

bkamins added this to the 1.0 milestone Jul 31, 2020

bkamins added breaking The proposed change is breaking. performance labels Jul 31, 2020

change unique to Set

424bf39

bkamins mentioned this pull request Jul 31, 2020

describe on columns with e.g. Integers #2269

Closed

bkamins added 2 commits July 31, 2020 21:05

reintroduce median

bf7f78a

add comma

8f81d15

nalimilan approved these changes Jul 31, 2020

View reviewed changes

bkamins merged commit 20f1125 into JuliaData:master Aug 2, 2020

bkamins deleted the faster_describe branch August 2, 2020 07:24

bkamins changed the title ~~remove median and nunique from describe by default~~ [BREAKING] remove median and nunique from describe by default Aug 7, 2020

JuliaRegistrator mentioned this pull request Nov 15, 2020

New version: DataFrames v0.22.0 JuliaRegistries/General#24650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING] remove median and nunique from describe by default#2339

[BREAKING] remove median and nunique from describe by default#2339
bkamins merged 4 commits intoJuliaData:masterfrom
bkamins:faster_describe

bkamins commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

nalimilan commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

nalimilan commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

bkamins commented Aug 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bkamins commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

nalimilan commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

nalimilan commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

bkamins commented Jul 31, 2020

Uh oh!

bkamins commented Aug 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants