[BREAKING] fix isagg to correctly use a fast path by bkamins · Pull Request #2357 · JuliaData/DataFrames.jl

bkamins · 2020-08-09T10:47:24Z

also is related to https://github.com/JuliaLang/Statistics.jl/issues/50 and JuliaLang/julia#36978 but I work around it.

@pdeffebach - this PR uncovered many corner cases, a close look at what I propose to do would be welcome.

nalimilan

I guess I've been quite overoptimistic when I added this fast path... :-D

nalimilan · 2020-08-10T09:24:55Z

src/groupeddataframe/splitapplycombine.jl

@@ -761,16 +765,37 @@ end
 Reduce(f, condf=nothing, adjust=nothing) = Reduce(f, condf, adjust, false)

 check_aggregate(f::Any) = f


Could you combine validate_aggregate with check_aggregate? Basically, we have a fallback which returns f, and for some combinations of function and type we return an optimized object.

I agree. And I did it now, but it was much easier to handle them separately when developing changes :)

nalimilan · 2020-08-10T09:28:20Z

src/groupeddataframe/splitapplycombine.jl

+validate_aggregate(::typeof(prod∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = true
+
 check_aggregate(::typeof(maximum)) = Reduce(max)
+validate_aggregate(::typeof(maximum), ::AbstractVector{<:Union{Missing, Real}}) = true


I think these methods work for any type. We just have a faster path when isconcretetype(S) && hasmethod($initf, Tuple{S}), but we don't require typemin and typemax in general. In particular, we have code for CategoricalArray.

I think these methods work for any type.

I think we need this restriction:

julia> df = DataFrame(g=1, x=[[1,2,3], [1,2,3]]) 2×2 DataFrame │ Row │ g │ x │ │ │ Int64 │ Array… │ ├─────┼───────┼───────────┤ │ 1 │ 1 │ [1, 2, 3] │ │ 2 │ 1 │ [1, 2, 3] │ julia> gdf = groupby(df, :g) GroupedDataFrame with 1 group based on key: g First Group (2 rows): g = 1 │ Row │ g │ x │ │ │ Int64 │ Array… │ ├─────┼───────┼───────────┤ │ 1 │ 1 │ [1, 2, 3] │ │ 2 │ 1 │ [1, 2, 3] │ julia> combine(gdf, :x => maximum) 1×2 DataFrame │ Row │ g │ x_maximum │ │ │ Int64 │ Array… │ ├─────┼───────┼───────────┤ │ 1 │ 1 │ [1, 2, 3] │ julia> combine(gdf, :x => x -> maximum(x)) 3×2 DataFrame │ Row │ g │ x_function │ │ │ Int64 │ Int64 │ ├─────┼───────┼────────────┤ │ 1 │ 1 │ 1 │ │ 2 │ 1 │ 2 │ │ 3 │ 1 │ 3 │

but I will make the signature looser.

Damn, again that multi-column issue... So yeah, the fast path needs to be disabled when the column can contain MULTI_COLS_TYPE or AbstractVector.

nalimilan · 2020-08-10T09:33:45Z

src/groupeddataframe/splitapplycombine.jl

+
 check_aggregate(::typeof(first)) = Aggregate(first)
+validate_aggregate(::typeof(first), v::AbstractVector) = eltype(v) === Any ? false : true
+validate_aggregate(::typeof(first), ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = false


Indeed this case is a bit ugly, as code that generates a single column when a data frame stores only scalars will suddenly create multiple columns when it stores e.g. named tuples. Of course this is a more general problem than first and optimized reductions. I wonder whether we should require explicitly mentioning that you want the result to be destructured into multiple columns, e.g. via :x => MultiCol(x -> ...). Without this, we basically consider that only scalars should be stored in data frames, or many things may break.

We indeed can discuss this and this relates to what @pdeffebach wants that we allow destruction into multiple columns.

Just to be clear the MULTI_COLS_TYPE part is not really problematic in my opinion - it just makes sure we throw an error (as we disallow it now), so in the future we can process it correctly. This is what we try to do consistently everywhere, so that post 1.0 changes here will be non-breaking, as they will currently throw an error (this is your pet trick to handle SemVer AFAICT 😄).

A more problematic case is AbstractVector as it was simply inconsistent between slow and fast paths (because fast path was not expanding it and slow path does pseudo-broadcasting).

So there actually two questions:

if we want some MultiCol wrapper in the future or just allow multiple cols to be returned and silently accept it (I thought @pdeffebach wanted to silently accept this)

do we require to opt-in for pseudo-broadcasting, we never required it in the past, and it was the default, I think we should keep it.

Although it is a legacy from the very distant past. If I were designing it now I would never expand anything neither in rows nor in columns, just store one result per group in combine and then say to users to use flatten to flatten it. But this is a completely different design in comparison to what we have now, so this is just a side comment.

Actually I had forgotten that we don't allow returning MULTI_COLS_TYPE currently. So at least that's safe.

AbstractVector remains problematic though. I wonder whether there are strong use cases for pseudo-broadcasting. It would be safer to make this opt-in for 1.0, otherwise we don't allow working with data frames whose cells contain vectors.

(BTW, instead for MultiCol, maybe we could reuse AsTable, but to wrap the returned value this time.)

I have thought about it when cleaning up the rules of pseudo-broadcasting recently. Here is what we have on 0.21:

julia> df = DataFrame(g=[1,2,3], x=[1:3, 4:6, 7:9]) 3×2 DataFrame │ Row │ g │ x │ │ │ Int64 │ UnitRan… │ ├─────┼───────┼──────────┤ │ 1 │ 1 │ 1:3 │ │ 2 │ 2 │ 4:6 │ │ 3 │ 3 │ 7:9 │ julia> gdf = groupby(df, :g) GroupedDataFrame with 3 groups based on key: g First Group (1 row): g = 1 │ Row │ g │ x │ │ │ Int64 │ UnitRan… │ ├─────┼───────┼──────────┤ │ 1 │ 1 │ 1:3 │ ⋮ Last Group (1 row): g = 3 │ Row │ g │ x │ │ │ Int64 │ UnitRan… │ ├─────┼───────┼──────────┤ │ 1 │ 3 │ 7:9 │ julia> combine(gdf, :x => first) # this is a wrong result and will be fixed by this PR to match what we have below 3×2 DataFrame │ Row │ g │ x_first │ │ │ Int64 │ UnitRan… │ ├─────┼───────┼──────────┤ │ 1 │ 1 │ 1:3 │ │ 2 │ 2 │ 4:6 │ │ 3 │ 3 │ 7:9 │ julia> combine(gdf, :x => x -> first(x)) # this is the intended output 9×2 DataFrame │ Row │ g │ x_function │ │ │ Int64 │ Int64 │ ├─────┼───────┼────────────┤ │ 1 │ 1 │ 1 │ │ 2 │ 1 │ 2 │ │ 3 │ 1 │ 3 │ │ 4 │ 2 │ 4 │ │ 5 │ 2 │ 5 │ │ 6 │ 2 │ 6 │ │ 7 │ 3 │ 7 │ │ 8 │ 3 │ 8 │ │ 9 │ 3 │ 9 │ julia> combine(gdf, :x => Ref∘first) # this is a currently intended method to protect from unwrapping - just like in standard broadcasting 3×2 DataFrame │ Row │ g │ x_function │ │ │ Int64 │ UnitRange… │ ├─────┼───────┼────────────┤ │ 1 │ 1 │ 1:3 │ │ 2 │ 2 │ 4:6 │ │ 3 │ 3 │ 7:9 │

So in short - we allow working with data frames that contain vectors as:

normally vectors of vectors will be returned and it is not a problem

if user unwraps the vector (like above - with first) then Ref can be used as in Base to protect the result

Let's continue this discussion in a separate issue?

OK - can you please open the issue explaining what you would want to change? (given the three key considerations: 1) currently we unwrap vectors, 2) Ref protects, 3) in the future we want to add support for multiple columns passing)

nalimilan · 2020-08-10T09:45:24Z

src/groupeddataframe/splitapplycombine.jl

        initv = op(tmpv, tmpv)
-        x = adjust isa Nothing ? initv : adjust(initv, 1)
+        if adjust isa Nothing
+            x = Tnm <: AbstractIrrational ? float(initv) : initv


It's too bad that irrationals don't define zero and one so that zero(x) + zero(x) has the same type as x + x... I don't remember offhand why they return Bool but that may have to do with the fact that irrationals promote to whatever type they are combined with.

I have not designed it 😄, but actually it is inconsistent:

julia> zero(pi) + pi 3.141592653589793

and you get Float64. I have commented in JuliaLang/julia#36978 to keep track of it.

test/grouping.jl

nalimilan · 2020-08-10T09:58:31Z

test/grouping.jl

+                Union{Missing,Number}[1, 1.5, missing], Any[1, 1.5, missing])
+        gdf = groupby_checked(DataFrame(g=[1, 1, 1], x=col), :g)
+        if fun isa typeof(last∘skipmissing)
+            # this is another hard corner case


Why this special case? last(skipmissing(x)) doesn't work outside DataFrames (maybe it should though), and it doesn't seem to work with any type in DataFrames either.

This is a special case, because it works in fast path, but fails in slow path. I have changed the comment

Hmm, it works only with GroupedDataFrame input, not with a DataFrame input. I don't get why.

Also, I thought the goal of this PR was to make the slow and fast path behave the same, so shouldn't this always throw an error whatever the eltype?

I did not see the benefit of last∘skipmissing throwing an error in cases when we can efficiently produce a correct result. The fact that last∘skipmissing errors in Base is only due to O(1) restriction in last docstring, e.g. first∘skipmissing works in base because its docstring does not require O(1). I think we should change in Base that last∘skipmissing works.

it works only with GroupedDataFrame input, not with a DataFrame input. I don't get why.

It should work in combine both for GroupedDataFrame and DataFrame and with select on GroupedDataFrame (and pseudo-broadcasting is applied in this case). It is expected to fail for select on DataFrame. Do you observe a different behaviour?

Yeah, last(skipmissing(x)) and lastindex(skipmissing(x)) should probably be defined when x is an AbstractArray, since they don't require going over the whole collection.

It should work in combine both for GroupedDataFrame and DataFrame and with select on GroupedDataFrame (and pseudo-broadcasting is applied in this case). It is expected to fail for select on DataFrame. Do you observe a different behaviour?

I get this

julia> df = DataFrame(x=[1]) 1×1 DataFrame │ Row │ x │ │ │ Int64 │ ├─────┼───────┤ │ 1 │ 1 │ julia> combine(df, :x => last ∘ skipmissing) ERROR: MethodError: no method matching lastindex(::Base.SkipMissing{Array{Int64,1}})

Yeah - I have forgotten of this path. So you have:

julia> df = DataFrame(x=[1]) 1×1 DataFrame │ Row │ x │ │ │ Int64 │ ├─────┼───────┤ │ 1 │ 1 │ julia> combine(df, :x => last ∘ skipmissing) ERROR: MethodError: no method matching lastindex(::Base.SkipMissing{Array{Int64,1}}) julia> combine(:x => last ∘ skipmissing, df) 1×1 DataFrame │ Row │ x_last_skipmissing │ │ │ Int64 │ ├─────┼────────────────────┤ │ 1 │ 1 │

which is unfortunate.

I would not touch it though now as it is not easy to fix.

In the long term we should rewrite the whole engine for select/transform/combine to be unified. Now - unfortunately - I have designed it in stages, and when initially implementing select and transform we have not envisioned that we will have a full ecosystem that we have now, so we essentially have two processing engines - one for combine in GroupedDataFrame and the other for select in DataFrame.

This should be implemented in a way that select/transform/combine on DataFrame are always essentially calls to GroupedDataFrame on a group formed by no columns (single group). Additional things to remember when implementing it:

add better support for 0 groups

add possibility for a group to have zero rows

add requested extensions to the syntax of source_columns => function => target_col_name

OK. No hurry, maybe just copy your comment to an issue to keep your analysis available.

All this is tracked in separate issues.

test/grouping.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-08-10T19:00:10Z

And adding checks you required caught another inconsistency - this time in std for Rational :)

bkamins · 2020-08-10T22:05:50Z

tests have shown another bug on Julia 1.0 in fast path. I handle it through type piracy, but I think it is OK, as the method is present in current releases of Julia + it will be removed when Julia 1.6 becomes LTS.

src/groupeddataframe/splitapplycombine.jl

nalimilan · 2020-08-11T12:51:37Z

src/groupeddataframe/splitapplycombine.jl

+
 check_aggregate(::typeof(first)) = Aggregate(first)
+validate_aggregate(::typeof(first), v::AbstractVector) = eltype(v) === Any ? false : true
+validate_aggregate(::typeof(first), ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = false


Actually I had forgotten that we don't allow returning MULTI_COLS_TYPE currently. So at least that's safe.

AbstractVector remains problematic though. I wonder whether there are strong use cases for pseudo-broadcasting. It would be safer to make this opt-in for 1.0, otherwise we don't allow working with data frames whose cells contain vectors.

(BTW, instead for MultiCol, maybe we could reuse AsTable, but to wrap the returned value this time.)

nalimilan · 2020-08-11T13:03:12Z

Maybe it would be simpler to disable the fast path for Irrational each time we would need to special-case it? Just a thought.

src/groupeddataframe/splitapplycombine.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-08-12T15:26:11Z

would be simpler to disable the fast path for Irrational each time we would need to special-case it?

I can do it if you prefer. It will be a bit simpler code, but actually more lines of code (as I have to add special methods for this case, currently this is just 2 additional conditions)

nalimilan · 2020-08-13T07:58:03Z

I can do it if you prefer. It will be a bit simpler code, but actually more lines of code (as I have to add special methods for this case, currently this is just 2 additional conditions)

Whatever you think is the cleanest.

bkamins · 2020-08-13T18:26:57Z

Thank you!

fix isagg to correctly use a fast path

dad6118

bkamins added bug priority breaking The proposed change is breaking. labels Aug 9, 2020

bkamins added this to the 1.0 milestone Aug 9, 2020

bkamins mentioned this pull request Aug 9, 2020

[WIP] Started working on fixing fastpath combine #2352

Closed

nalimilan reviewed Aug 10, 2020

View reviewed changes

bkamins and others added 4 commits August 10, 2020 20:11

Apply suggestions from code review

d11e358

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

fixes after code review - first pass

bca411d

Merge remote-tracking branch 'origin/fix_isagg' into fix_isagg

f481296

add missing f

e5ac5bb

bkamins added 2 commits August 10, 2020 21:00

fix some additional cases

063a5e4

fix Julia 1.0

695b237

nalimilan reviewed Aug 11, 2020

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

90d79a7

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan approved these changes Aug 13, 2020

View reviewed changes

bkamins changed the title ~~fix isagg to correctly use a fast path~~ [BREAKING] fix isagg to correctly use a fast path Aug 13, 2020

bkamins merged commit 4c601bc into JuliaData:master Aug 13, 2020

bkamins deleted the fix_isagg branch August 13, 2020 18:26

JuliaRegistrator mentioned this pull request Nov 15, 2020

New version: DataFrames v0.22.0 JuliaRegistries/General#24650

Merged

nalimilan mentioned this pull request Sep 25, 2022

Avoid method dispatch ambiguities in DataFrames.jl #3179

Merged

		@@ -761,16 +765,37 @@ end
		Reduce(f, condf=nothing, adjust=nothing) = Reduce(f, condf, adjust, false)

		check_aggregate(f::Any) = f

Conversation

bkamins commented Aug 9, 2020

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkamins commented Aug 10, 2020

Uh oh!

bkamins commented Aug 10, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nalimilan commented Aug 11, 2020

Uh oh!

Uh oh!

bkamins commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Aug 13, 2020

Uh oh!

bkamins commented Aug 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bkamins commented Aug 12, 2020 •

edited

Loading