Add standard deviation and 25% and 75% quantiles to describe :detailed#2459
Add standard deviation and 25% and 75% quantiles to describe :detailed#2459
describe :detailed#2459Conversation
|
Can you please update the NEWS.md entry (I think we already have something about |
bkamins
left a comment
There was a problem hiding this comment.
What is the impact on performance here? Especially of computing quantiles?
In general I am 100% OK to add std, as for Q25 and Q75 the question is how often is this really needed as a default?
|
Performance benchmarks are a bit weird, as calling # master
julia> df = DataFrame(rand(10, 1_000_000));
julia> @btime describe(df);
4.183 s (14999664 allocations: 1.24 GiB)
julia> @btime describe(df, :mean, :min, :median, :max, :nmissing, :eltype);
3.557 s (14999720 allocations: 1.24 GiB)
julia> @btime describe(df, :mean, :std, :min, :median, :max, :nmissing, :eltype);
3.826 s (19999734 allocations: 1.32 GiB)
julia> @btime describe(df, :mean, :std, :min, :q25, :median, :q75, :max, :nmissing, :eltype);
5.041 s (19999765 allocations: 1.33 GiB)
# This PR
julia> df = DataFrame(rand(10, 1_000_000));
julia> @btime describe(df);
5.356 s (20999682 allocations: 1.35 GiB)
julia> @btime describe(df, :mean, :min, :median, :max, :nmissing, :eltype);
3.103 s (13999720 allocations: 1.21 GiB)
julia> @btime describe(df, :mean, :std, :min, :median, :max, :nmissing, :eltype);
3.495 s (18999734 allocations: 1.29 GiB)
julia> @btime describe(df, :mean, :std, :min, :q25, :median, :q75, :max, :nmissing, :eltype);
4.226 s (20999765 allocations: 1.35 GiB) |
|
OK - then let us go for it. |
|
I like std but generally don’t look at q25/q75 by default. Performance is more important to me. |
|
This is fine. Stata keeps |
|
@nalimilan - so what do we do. Keep or drop Q25 and Q75? (I am slightly leaning towards dropping them, but I am OK with both) |
|
bump - we can leave it out of 0.22 release, but I would prefer to merge it - whatever you prefer to keep here. |
describe defaultdescribe default
|
@nalimilan - please also update manual, as we showe |
|
Actually I'm not sure it's a good idea. Maybe a better solution would be to support |
|
we already have |
|
OK - so this is non-breaking then and can be post 0.22. I understand that you will make the changes? Thank you! |
describe defaultdescribe :detailed
|
Just looked at this and I don't think it's a good idea. Summary statistics with the new printing are sitting at 77 characters. I think the addition of |
What do you refer to exactly? To summary the conclusion from the discussion is:
It seems from your comment that you would be OK with this approach. |
|
Yeah sorry, I am happy with |
bkamins
left a comment
There was a problem hiding this comment.
This PR requires resolving merge conflicts.
@nalimilan - let us make a decision if we want this change or not for the 1.3 release so that the PR does not just stand open too long.
If we want to merge it then also NEWS.md entry is required.
This prints all statistics except the first and last value.
955acbf to
3ce9cb1
Compare
|
I've pushed a new version which supports |
describe :detaileddescribe :detailed
|
Thank you! |
Dispersion statistics are essential to assess the distribution of a variable.
This is inspired by the skimr R package. Note that Stata's
summarizeincludes the standard deviation too (but no quantiles, not even the median).I've been willing to try this for some time. The drawback is that this results in a wider table (~120 characters). skimr only uses one of two decimals and it does'nt print column types, which gives a significantly narrower result (note that the example printed in the doctest below is narrower than the standard case since quantiles are exactly represented with only two decimals). This problem will be alleviated a bit if we stop printing vertical bars (saving up to 18 chars).