Conversation
|
I think the logic is fine, though it's not easy to check without a docstring or tests. :-) I'm not sure about the name. It's true that |
|
@nalimilan I have added tests. The PR is good for review now. |
| # levels drops missing, handle the case where missing values are present | ||
| # All levels are retained, missing is added only if present | ||
| if any(ismissing, df[!, col]) | ||
| tempcol = vcat(levels(df[!, col]), missing) |
There was a problem hiding this comment.
I think this has come up several times before: it would make sense to add an argument to levels to preserve missing so that we don't have to do this, which is suboptimal for performance.
test/data.jl
Outdated
| @test isequal_coltyped(completecombinations(df1, [:a, :b], allcols=true, fill="X", allowduplicates=ad), | ||
| DataFrame(a=[1, 2, missing, 1, 2, missing], | ||
| b=[1, 1, 1, 2, 2, 2], | ||
| c=[11, 12, "X", "X", "X", missing], | ||
| d=[111, 112, "X", "X", "X", 113])) |
There was a problem hiding this comment.
By default dplyr also replaces missing values in the input with fill (explicit=TRUE). But I find this weird so I agree what you do is better.
There was a problem hiding this comment.
Yes - I made this choice intentionally. The same behavior of fill is in unstack.
test/data.jl
Outdated
| @test isequal_coltyped(completecombinations(df1, :c, allowduplicates=ad), | ||
| DataFrame(c=categorical([12, 11, 10, missing]))) | ||
| @test isequal_coltyped(completecombinations(df1, [:c, :b], allowduplicates=ad), | ||
| DataFrame(c=categorical([12, 11, 10, missing, 12, 11, 10, missing]), |
There was a problem hiding this comment.
Test that levels are preserved?
| duplicates are allowed. They are not repeated if `allcols` is `false` | ||
| only unique combinations are produced then, but if `allcols` is `true` | ||
| the duplicates are included. |
There was a problem hiding this comment.
This behavior differs from dplyr's complete, which retains duplicates. Any particular reason to drop them? Or should we allow three options about how duplicates should be handled?
There was a problem hiding this comment.
It is a consequence of the current design. I.e. it is natural to drop them in the way it is implemented.
But we can discuss what to do.
Now I am thinking that maybe we do not need the allcols kwarg and can simplify the API. Maybe we should have allcols=true always and expect from the user to pass a data frame only with columns that are wanted in the output?
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
If duplicates are allowed they are always kept. also since we always keep all cols the column order of the source data frame is retained (earlier I put the expanded columns first, but since now we always keep all columns I think it makes more sense to keep their original order). This should be good for a review. |
|
@nalimilan - no rush, but it would be good to finalize this PR before we forget what we wanted. Thank you! |
nalimilan
left a comment
There was a problem hiding this comment.
Looks good! Just minor comments.
Maybe we should check that the name sounds OK by asking people on Slack?
test/data.jl
Outdated
| @test isequal_coltyped(completecombinations(df1, [:c, :b], allowduplicates=ad), | ||
| DataFrame(a=[2; 1; fill(missing, 6)], | ||
| b=[1, 1, 1, 1, 2, 2, 2, 2], | ||
| c=categorical([12, 11, 10, missing, 12, 11, 10, missing]), | ||
| d=[112; 111; fill(missing, 5); 113])) |
There was a problem hiding this comment.
I don't see the test for levels. Am I missing it?
There was a problem hiding this comment.
I have added the tests - as usual tricky cases are present.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
Proposal for the name from Slack I like best is Other: |
|
@nalimilan - so what do you think about the best name for the function given the discussion? |
|
I also prefer |
|
OK - changed to |
|
Thank you! |
Update of #1864.