[BREAKING] remove median and nunique from describe by default#2339
[BREAKING] remove median and nunique from describe by default#2339bkamins merged 4 commits intoJuliaData:masterfrom
Conversation
|
(in particular note that even if we start adding support to threading in DataFrames.jl - which is a plan according to the responses in https://discourse.julialang.org/t/dataframes-jl-development-survey/44022) still |
|
I have also switched |
|
Dropping the number of unique values sounds fine, but I'm a bit more reluctant to drop the median. It's a more robust indicator than the mean and it would be too bad not to report it by default just because in some cases it will be slow. I assume in most cases it should be fast enough. FWIW, the carefully designed skimr R package has an interesting approach: Minimum and maximum are reported as being the 0% and 100% percentiles, so the median is between these two. Additionally, the 25% and 75% quantiles are reported. (The rate of complete observations is somewhat redundant IMO.) What do you think? |
|
The problem is (for the same size of data): so I would not include it for performance reasons. I will revert the |
|
0% and 100% quantiles can be computed using minimum and maximum as currently, so I wouldn't include them in the timing. Also, it turns out that |
|
Actually I thought that computing Anyway in this PR I would focus on dropping things (and I understand we agree to drop |
|
re-introduced |
|
Thank you! |
Fixes #2269
As decided there we just drop computing
:medianand:nuniqueby default.This is simplest to change, and if someone wants them it is easy to opt-in.
Just to get a relative impact on performance of dropping this consider:
which when you have even several dozens of variables only starts to be prohibitive.