Skip to content

Add standard deviation and 25% and 75% quantiles to describe :detailed#2459

Merged
bkamins merged 1 commit intomainfrom
nl/describe
Nov 5, 2021
Merged

Add standard deviation and 25% and 75% quantiles to describe :detailed#2459
bkamins merged 1 commit intomainfrom
nl/describe

Conversation

@nalimilan
Copy link
Copy Markdown
Member

Dispersion statistics are essential to assess the distribution of a variable.
This is inspired by the skimr R package. Note that Stata's summarize includes the standard deviation too (but no quantiles, not even the median).

I've been willing to try this for some time. The drawback is that this results in a wider table (~120 characters). skimr only uses one of two decimals and it does'nt print column types, which gives a significantly narrower result (note that the example printed in the doctest below is narrower than the standard case since quantiles are exactly represented with only two decimals). This problem will be alleviated a bit if we stop printing vertical bars (saving up to 18 chars).

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Sep 30, 2020

Can you please update the NEWS.md entry (I think we already have something about describe).

Copy link
Copy Markdown
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the impact on performance here? Especially of computing quantiles?
In general I am 100% OK to add std, as for Q25 and Q75 the question is how often is this really needed as a default?

@nalimilan
Copy link
Copy Markdown
Member Author

Performance benchmarks are a bit weird, as calling describe without arguments is slower than calling it with the same arguments, both on master and on this PR. Also I've made it a bit faster to compute only the median, as before we always computed q25 and q75 even when not used. But overall the impact isn't that large, so I'd say it's mostly a design decision.

# master
julia> df = DataFrame(rand(10, 1_000_000));

julia> @btime describe(df);
  4.183 s (14999664 allocations: 1.24 GiB)

julia> @btime describe(df, :mean, :min, :median, :max, :nmissing, :eltype);
  3.557 s (14999720 allocations: 1.24 GiB)

julia> @btime describe(df, :mean, :std, :min, :median, :max, :nmissing, :eltype);
  3.826 s (19999734 allocations: 1.32 GiB)

julia> @btime describe(df, :mean, :std, :min, :q25, :median, :q75, :max, :nmissing, :eltype);
  5.041 s (19999765 allocations: 1.33 GiB)

# This PR
julia> df = DataFrame(rand(10, 1_000_000));

julia> @btime describe(df);
  5.356 s (20999682 allocations: 1.35 GiB)

julia> @btime describe(df, :mean, :min, :median, :max, :nmissing, :eltype);
  3.103 s (13999720 allocations: 1.21 GiB)

julia> @btime describe(df, :mean, :std, :min, :median, :max, :nmissing, :eltype);
  3.495 s (18999734 allocations: 1.29 GiB)

julia> @btime describe(df, :mean, :std, :min, :q25, :median, :q75, :max, :nmissing, :eltype);
  4.226 s (20999765 allocations: 1.35 GiB)

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Sep 30, 2020

OK - then let us go for it.

@tk3369
Copy link
Copy Markdown
Contributor

tk3369 commented Sep 30, 2020

I like std but generally don’t look at q25/q75 by default. Performance is more important to me.

@pdeffebach
Copy link
Copy Markdown
Contributor

This is fine. Stata keeps :q25 and :q75 behind the detail option, so its natural we do as well.

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Oct 29, 2020

@nalimilan - so what do we do. Keep or drop Q25 and Q75? (I am slightly leaning towards dropping them, but I am OK with both)

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Nov 7, 2020

bump - we can leave it out of 0.22 release, but I would prefer to merge it - whatever you prefer to keep here.

@bkamins bkamins added the breaking The proposed change is breaking. label Nov 7, 2020
@bkamins bkamins added this to the 1.0 milestone Nov 7, 2020
@bkamins bkamins changed the title RFC: Add standard deviation and 25% and 75% quantiles to describe default [BREAKING] RFC: Add standard deviation and 25% and 75% quantiles to describe default Nov 7, 2020
@bkamins bkamins mentioned this pull request Nov 7, 2020
20 tasks
@bkamins
Copy link
Copy Markdown
Member

bkamins commented Nov 7, 2020

@nalimilan - please also update manual, as we showe describe there.

@nalimilan
Copy link
Copy Markdown
Member Author

Actually I'm not sure it's a good idea. Maybe a better solution would be to support :detailed as a shortcut for adding standard deviation and quartiles, and keep the default output relatively short.

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Nov 8, 2020

we already have :all. So :detailed would be "in the middle"?

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Nov 8, 2020

OK - so this is non-breaking then and can be post 0.22. I understand that you will make the changes? Thank you!

@bkamins bkamins changed the title [BREAKING] RFC: Add standard deviation and 25% and 75% quantiles to describe default RFC: Add standard deviation and 25% and 75% quantiles to describe :detailed Nov 8, 2020
@bkamins bkamins added non-breaking The proposed change is not breaking and removed breaking The proposed change is breaking. labels Nov 8, 2020
@pdeffebach
Copy link
Copy Markdown
Contributor

Just looked at this and I don't think it's a good idea.

Summary statistics with the new printing are sitting at 77 characters. I think the addition of :q25 and :q75 would make it too wide and harder to read.

@bkamins
Copy link
Copy Markdown
Member

bkamins commented Nov 13, 2020

Just looked at this and I don't think it's a good idea.

What do you refer to exactly?

To summary the conclusion from the discussion is:

  1. leave the default as is.
  2. add :detailed option (similar to :all`) that would add std, Q25 and Q75 to printed statistics

It seems from your comment that you would be OK with this approach.

@pdeffebach
Copy link
Copy Markdown
Contributor

Yeah sorry, I am happy with detail (though hope Stata doesn't get mad a la google v oracle). Just not as default.

@bkamins bkamins modified the milestones: 1.0, 1.x Jan 8, 2021
Copy link
Copy Markdown
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR requires resolving merge conflicts.

@nalimilan - let us make a decision if we want this change or not for the 1.3 release so that the PR does not just stand open too long.

If we want to merge it then also NEWS.md entry is required.

@bkamins bkamins modified the milestones: 1.x, 1.3 Nov 4, 2021
@bkamins bkamins added breaking The proposed change is breaking. and removed non-breaking The proposed change is not breaking labels Nov 4, 2021
This prints all statistics except the first and last value.
@nalimilan
Copy link
Copy Markdown
Member Author

I've pushed a new version which supports :detailed. The only difference with :all is that the first and last value are not reported, but these are likely to be less interesting as often datasets are not ordered.

@nalimilan nalimilan changed the base branch from master to main November 5, 2021 18:40
@nalimilan nalimilan changed the title RFC: Add standard deviation and 25% and 75% quantiles to describe :detailed Add standard deviation and 25% and 75% quantiles to describe :detailed Nov 5, 2021
@nalimilan nalimilan closed this Nov 5, 2021
@nalimilan nalimilan reopened this Nov 5, 2021
@bkamins bkamins merged commit 69a7b4c into main Nov 5, 2021
@bkamins bkamins deleted the nl/describe branch November 5, 2021 20:46
@bkamins
Copy link
Copy Markdown
Member

bkamins commented Nov 5, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking The proposed change is breaking.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants