Add standard deviation and 25% and 75% quantiles to `describe` :detailed by nalimilan · Pull Request #2459 · JuliaData/DataFrames.jl

nalimilan · 2020-09-30T17:01:41Z

Dispersion statistics are essential to assess the distribution of a variable.
This is inspired by the skimr R package. Note that Stata's summarize includes the standard deviation too (but no quantiles, not even the median).

I've been willing to try this for some time. The drawback is that this results in a wider table (~120 characters). skimr only uses one of two decimals and it does'nt print column types, which gives a significantly narrower result (note that the example printed in the doctest below is narrower than the standard case since quantiles are exactly represented with only two decimals). This problem will be alleviated a bit if we stop printing vertical bars (saving up to 18 chars).

bkamins · 2020-09-30T17:07:57Z

Can you please update the NEWS.md entry (I think we already have something about describe).

bkamins

What is the impact on performance here? Especially of computing quantiles?
In general I am 100% OK to add std, as for Q25 and Q75 the question is how often is this really needed as a default?

nalimilan · 2020-09-30T19:40:54Z

Performance benchmarks are a bit weird, as calling describe without arguments is slower than calling it with the same arguments, both on master and on this PR. Also I've made it a bit faster to compute only the median, as before we always computed q25 and q75 even when not used. But overall the impact isn't that large, so I'd say it's mostly a design decision.

# master
julia> df = DataFrame(rand(10, 1_000_000));

julia> @btime describe(df);
  4.183 s (14999664 allocations: 1.24 GiB)

julia> @btime describe(df, :mean, :min, :median, :max, :nmissing, :eltype);
  3.557 s (14999720 allocations: 1.24 GiB)

julia> @btime describe(df, :mean, :std, :min, :median, :max, :nmissing, :eltype);
  3.826 s (19999734 allocations: 1.32 GiB)

julia> @btime describe(df, :mean, :std, :min, :q25, :median, :q75, :max, :nmissing, :eltype);
  5.041 s (19999765 allocations: 1.33 GiB)

# This PR
julia> df = DataFrame(rand(10, 1_000_000));

julia> @btime describe(df);
  5.356 s (20999682 allocations: 1.35 GiB)

julia> @btime describe(df, :mean, :min, :median, :max, :nmissing, :eltype);
  3.103 s (13999720 allocations: 1.21 GiB)

julia> @btime describe(df, :mean, :std, :min, :median, :max, :nmissing, :eltype);
  3.495 s (18999734 allocations: 1.29 GiB)

julia> @btime describe(df, :mean, :std, :min, :q25, :median, :q75, :max, :nmissing, :eltype);
  4.226 s (20999765 allocations: 1.35 GiB)

bkamins · 2020-09-30T19:48:07Z

OK - then let us go for it.

tk3369 · 2020-09-30T20:18:53Z

I like std but generally don’t look at q25/q75 by default. Performance is more important to me.

pdeffebach · 2020-09-30T20:19:07Z

This is fine. Stata keeps :q25 and :q75 behind the detail option, so its natural we do as well.

bkamins · 2020-10-29T20:40:23Z

@nalimilan - so what do we do. Keep or drop Q25 and Q75? (I am slightly leaning towards dropping them, but I am OK with both)

bkamins · 2020-11-07T15:26:53Z

bump - we can leave it out of 0.22 release, but I would prefer to merge it - whatever you prefer to keep here.

bkamins · 2020-11-07T21:54:32Z

@nalimilan - please also update manual, as we showe describe there.

nalimilan · 2020-11-08T17:14:00Z

Actually I'm not sure it's a good idea. Maybe a better solution would be to support :detailed as a shortcut for adding standard deviation and quartiles, and keep the default output relatively short.

bkamins · 2020-11-08T17:34:10Z

we already have :all. So :detailed would be "in the middle"?

bkamins · 2020-11-08T22:17:05Z

OK - so this is non-breaking then and can be post 0.22. I understand that you will make the changes? Thank you!

pdeffebach · 2020-11-13T14:50:51Z

Just looked at this and I don't think it's a good idea.

Summary statistics with the new printing are sitting at 77 characters. I think the addition of :q25 and :q75 would make it too wide and harder to read.

bkamins · 2020-11-13T15:53:08Z

Just looked at this and I don't think it's a good idea.

What do you refer to exactly?

To summary the conclusion from the discussion is:

leave the default as is.
add :detailed option (similar to :all`) that would add std, Q25 and Q75 to printed statistics

It seems from your comment that you would be OK with this approach.

pdeffebach · 2020-11-13T15:53:48Z

Yeah sorry, I am happy with detail (though hope Stata doesn't get mad a la google v oracle). Just not as default.

bkamins

This PR requires resolving merge conflicts.

@nalimilan - let us make a decision if we want this change or not for the 1.3 release so that the PR does not just stand open too long.

If we want to merge it then also NEWS.md entry is required.

This prints all statistics except the first and last value.

nalimilan · 2021-11-05T18:39:58Z

I've pushed a new version which supports :detailed. The only difference with :all is that the first and last value are not reported, but these are likely to be less interesting as often datasets are not ordered.

bkamins · 2021-11-05T20:46:17Z

Thank you!

bkamins reviewed Sep 30, 2020

View reviewed changes

bkamins approved these changes Sep 30, 2020

View reviewed changes

bkamins added the breaking The proposed change is breaking. label Nov 7, 2020

bkamins added this to the 1.0 milestone Nov 7, 2020

bkamins changed the title ~~RFC: Add standard deviation and 25% and 75% quantiles to describe default~~ [BREAKING] RFC: Add standard deviation and 25% and 75% quantiles to describe default Nov 7, 2020

bkamins mentioned this pull request Nov 7, 2020

Release 0.22 tracking #2484

Closed

20 tasks

bkamins changed the title ~~[BREAKING] RFC: Add standard deviation and 25% and 75% quantiles to describe default~~ RFC: Add standard deviation and 25% and 75% quantiles to describe :detailed Nov 8, 2020

bkamins added non-breaking The proposed change is not breaking and removed breaking The proposed change is breaking. labels Nov 8, 2020

bkamins modified the milestones: 1.0, 1.x Jan 8, 2021

bkamins requested changes Nov 4, 2021

View reviewed changes

bkamins modified the milestones: 1.x, 1.3 Nov 4, 2021

bkamins added breaking The proposed change is breaking. and removed non-breaking The proposed change is not breaking labels Nov 4, 2021

Support :detailed in describe

3ce9cb1

This prints all statistics except the first and last value.

nalimilan force-pushed the nl/describe branch from 955acbf to 3ce9cb1 Compare November 5, 2021 18:37

nalimilan changed the base branch from master to main November 5, 2021 18:40

nalimilan changed the title ~~RFC: Add standard deviation and 25% and 75% quantiles to describe :detailed~~ Add standard deviation and 25% and 75% quantiles to describe :detailed Nov 5, 2021

nalimilan closed this Nov 5, 2021

nalimilan reopened this Nov 5, 2021

bkamins approved these changes Nov 5, 2021

View reviewed changes

bkamins merged commit 69a7b4c into main Nov 5, 2021

bkamins deleted the nl/describe branch November 5, 2021 20:46

nalimilan mentioned this pull request Mar 26, 2023

[Feature Request] Add Standard Deviation to summarystats()? JuliaStats/StatsBase.jl#693

Open

Conversation

nalimilan commented Sep 30, 2020

Uh oh!

bkamins commented Sep 30, 2020

Uh oh!

bkamins left a comment

Choose a reason for hiding this comment

Uh oh!

nalimilan commented Sep 30, 2020

Uh oh!

bkamins commented Sep 30, 2020

Uh oh!

tk3369 commented Sep 30, 2020

Uh oh!

pdeffebach commented Sep 30, 2020

Uh oh!

bkamins commented Oct 29, 2020

Uh oh!

bkamins commented Nov 7, 2020

Uh oh!

bkamins commented Nov 7, 2020

Uh oh!

nalimilan commented Nov 8, 2020

Uh oh!

bkamins commented Nov 8, 2020

Uh oh!

bkamins commented Nov 8, 2020

Uh oh!

pdeffebach commented Nov 13, 2020

Uh oh!

bkamins commented Nov 13, 2020

Uh oh!

pdeffebach commented Nov 13, 2020

Uh oh!

bkamins left a comment

Choose a reason for hiding this comment

Uh oh!

nalimilan commented Nov 5, 2021

Uh oh!

bkamins commented Nov 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants