Skip to content

[BREAKING] Multicolumn transformations for GoupedDataFrame#2481

Merged
bkamins merged 28 commits intoJuliaData:masterfrom
bkamins:multicolumn_grouped_transform
Nov 1, 2020
Merged

[BREAKING] Multicolumn transformations for GoupedDataFrame#2481
bkamins merged 28 commits intoJuliaData:masterfrom
bkamins:multicolumn_grouped_transform

Conversation

@bkamins
Copy link
Copy Markdown
Member

@bkamins bkamins commented Oct 12, 2020

Fixes #2410

I am opening now to make sure that 0.22 branch of a13db50 does not get lost.

This PR is not finished. I have implemented all (in particular allowing returning multiple columns from functions) except handling AsTable and multipe Symbols as destination columns (this still needs to be implemented as it requires new logic)

@bkamins bkamins added this to the 1.0 milestone Oct 12, 2020
@bkamins bkamins added breaking The proposed change is breaking. grouping priority labels Oct 12, 2020
@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 13, 2020

OK - this PR is not cleaned, tested, nor documented yet, but the promised functionality seems (remember - not tested 😄)) to work, so please feel free to experiment with it and comment if you catch something surprising. Thank you!

@bkamins bkamins mentioned this pull request Oct 13, 2020
20 tasks
@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 15, 2020

@pdeffebach - with this PR we ensure the following invariant (which I think is relevant for DataFramesMeta.jl design). The result of:

transform(df, args...)

is always the same as the result of

transform(groupby(df, cols), args...)

(also up to errors - if one errors the other also errors)

if args is ByRow transformation.

@pdeffebach
Copy link
Copy Markdown
Contributor

Wait, why would we want that?

What if I want to perform the operation x .- mean(x)? We put a lot of effort into ensuring transform works by groups.

@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 15, 2020

Right - I forgotten to add that args must be ByRow. I will update.

@pdeffebach
Copy link
Copy Markdown
Contributor

Oh good. You scared me! Yes having the same with ByRow is the correct behavior. Thanks!

@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 15, 2020

I still have a lot of tests to write (but now the PR should pass the tests) and documentation to update.
However, @nalimilan - can you please review the changes in docs/src/man/split_apply_combine.md as correcting the documentation is usually a lengthy process, and probably we can reuse things we work-out there in the docstrings later. Thank you!

(all others: feel free to have a look at docs/src/man/split_apply_combine.md if you are interested as it specifies how new rules work)

@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 18, 2020

@nalimilan - documentation updates part 2 (and hopefully final) is pushed here. I made adjustments to the manual per the comments given + I have refactored the docstrings. Now they are consistent with the manual and reuse the same template. This has three benefits I believe:

  1. it will be easier to maintain
  2. the docstrings take less space (which is relevant as the files are very large anyway)
  3. the docstrings show a comparison of similarities and differences between available functions

Another review of documentation would be appreciated (after it is done I will review the tests of the functionality to make sure we cover everything and the PR then will be good for a final review).

@bkamins bkamins marked this pull request as ready for review October 19, 2020 15:39
@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 19, 2020

I have added additional tests of correctness of the functionality we expose. This should be good for a whole code review.

@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 22, 2020

This PR is holding implementation of where PR.

EDIT: I managed to get it without needing this, see #2496

@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 28, 2020

@nalimilan I have incorportated all the recommended changes apart from syncing manual and docs - I will do it when manual is finalized. It should be ready for another round of reviews. Thank you!

Comment on lines +49 to +52
is undefined (a typical case is that they follow the order of appreance of
respecive values in the grouping columns, but a notable exception is when the
columns are `PooledVector`s, in which case they are ordered accoring to the `pool`
field in these vectors)
Copy link
Copy Markdown
Member

@nalimilan nalimilan Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true: in general the order is really undefined due to the dict-like grouping fallback. And the pool of PooledDataArray isn't exposed to users so it's not really useful to tell give them this information.

EDIT: I was wrong, for some reason I hadn't realize that the fallback grouping method uses the order of appearance.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the description to make it more precise.

Comment on lines +143 to +145
(not all keyword arguments are supported in all cases; in general they are allowed
in situations when they are meaningful, see the documentation of the specific functions
for details):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how about making this part specific to each function's docstring so that we never mention an argument that a function doesn't support?

(BTW I don't see the change to "signature".)

bkamins and others added 2 commits October 30, 2020 15:06
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Oct 30, 2020

@nalimilan - I have incorporated the comments. Let me know if it is OK to move the descriptions from the manual to the docstring. Thank you!

@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Nov 1, 2020

TODO: add NEWS.md entry (when @nalimilan approves moving the manual text into the docstrings).

@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Nov 1, 2020

All suggestions are applied, manual and dosstrings are synchronized, and NEWS.md is updated.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 5224175 into JuliaData:master Nov 1, 2020
@bkamins bkamins deleted the multicolumn_grouped_transform branch November 1, 2020 18:22
@bkamins
Copy link
Copy Markdown
Member Author

bkamins commented Nov 1, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking The proposed change is breaking. grouping priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BREAKING] Handling of ByRow

3 participants