GH-35166: [C++][Compute] Increase precision of decimals in sum aggregates#44184
GH-35166: [C++][Compute] Increase precision of decimals in sum aggregates#44184zanmato1984 merged 38 commits intoapache:mainfrom
Conversation
|
|
|
@zeroshade I noticed in #43957 that you were adding in Decimal32/64 types, which I think will have the same problem that this PR addresses. I was curious if you might have interest in reviewing this PR? |
|
@khwilson Sure thing, i'll try to take a look at this in the next day or so |
|
Hi @zeroshade just checking in! Thanks again for taking a look |
mapleFU
left a comment
There was a problem hiding this comment.
This method is interesting, however, before doing that, do you think a user-side "cast" is ok?
Like:
cast(origin to decimal(large-enough)) then avg
|
Thanks for the review! By a user-side cast, do you mean that users should essentially do: select avg(cast(blah as decimal(big-precision)))instead of select avg(blah)or do you mean that this code should "inject" a cast on the "user" side? If you mean putting the cast onto the user, then I would think you'd want to add an error if the answer can't fit into the default precision, but that seems like it would be more disruptive (and out of step with how other systems handle decimal aggregates). If you mean "injecting" the cast on the user side, would that end up creating a copy of the array? |
|
No problem! Hope your travels were fun! |
|
Generally this method is ok for me, but I'm not so familiar with the "common solutions" here. I'll dive into Presto/ClickHouse to see the common pattern here |
|
I enumerated several here: #35166 (comment) Clickhouse for instance just ignores precision. |
|
Would you mind making this Ready for review? |
|
Sure! |
89f1ae9 to
01e53b3
Compare
|
@mapleFU I believe this is done now. Some notes on the diff:
And a note that quite a few tests are failing for what appears to be the same reason as #41390. Happy to address them if you'd like. |
|
I'm lukewarm about the approach here. Silently casting to the max precision discards metadata about the input; it also risks producing errors further down the line (if e.g. the max precision is deemed too large for other operations). It also doesn't automatically eliminate any potential overflow, for example: >>> a = pa.array([789.3] * 20).cast(pa.decimal128(38, 35))
>>> a
<pyarrow.lib.Decimal128Array object at 0x7f0f103ca7a0>
[
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440,
789.29999999999995452526491135358810440
]
>>> pc.sum(a)
<pyarrow.Decimal128Scalar: Decimal('-1228.11834604692408266343214451664848480')>We should instead check that the result of an aggregate fits into the resulting Decimal type, while overflows currently pass silently: >>> a = pa.array([123., 456., 789.]).cast(pa.decimal128(4, 1))
>>> a
<pyarrow.lib.Decimal128Array object at 0x7f0ed06261a0>
[
123.0,
456.0,
789.0
]
>>> pc.sum(a)
<pyarrow.Decimal128Scalar: Decimal('1368.0')>
>>> pc.sum(a).validate(full=True)
Traceback (most recent call last):
...
ArrowInvalid: Decimal value 13680 does not fit in precision of decimal128(4, 1) |
|
Two problems with just validating afterward: First, I'd expect in reasonable cases for the validation to fail. A sum of 1m decimals of approximately the same size you'd expect to have 6 more digits of precision. I assume this is why all the DBMSs I looked at increase the precision by default. Second, just checking for overflow doesn't solve the underlying problem. Consider: a = pa.array([789.3] * 18).cast(pa.decimal128(38, 35))
print(pc.sum(a))
pc.sum(a).validate(full=True) # passesIn duckdb, they implement an intermediate check to make sure that there's not an internal overflow: tab = pa.Table.from_pydict({"a": a})
duckdb.query("select sum(a) from tab")
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# duckdb.duckdb.OutOfRangeException: Out of Range Error: Overflow in HUGEINT addition:
# 157859999999999990905052982270717620880 + 78929999999999995452526491135358810440Notably, this lack of overflow checking also applies to integer sums in arrow: >>> pa.array([9223372036854775800] * 2, type=pa.int64())
<pyarrow.lib.Int64Array object at 0x10c1d8b80>
[
9223372036854775800,
9223372036854775800
]
>>> pc.sum(pa.array([9223372036854775800] * 2, type=pa.int64()))
<pyarrow.Int64Scalar: -16>
>>> pc.sum(pa.array([9223372036854775800] * 2, type=pa.int64())).validate(full=True) |
It depends obviously if all decimals are of the same sign, and what their actual magnitude is.
In the example above, I used a |
Yes, and there's already a bug open for it: #37090 |
|
Nice! I'm excited for the checked variants of sum and product! With the integer overflow example, I only meant to point out that the compute module currently allows overflows, so I think it would be unexpected for Still, I do think that users would find it unexpected to get an error if the sum fit in the underlying storage since this is how all the databases I've used (and the four I surveyed in #35166) have operated. |
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
74161e2 to
7502951
Compare
|
OK, all in all:
|
|
Would this change be part of arrow 21? |
|
@pitrou did you get a chance to look at this again? |
pitrou
left a comment
There was a problem hiding this comment.
LGTM on the principle, I posted a couple comments that you might want to act upon. I'll let @zanmato1984 make the final call.
|
CI is good, merging. Thanks a lot @khwilson for contributing this! @wirable23 Please wait for our release manager's response about into 21.0.0. Thanks. |
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit bf56a95. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Rationale for this change
As documented in #35166, when Arrow performs a sum, product, or mean of an array of type
decimalX(P, S), it returns a scalar of typedecimalX(P, S). This is true even if the aggregate does not fit in the specified precision. For instance, a sum of twodecimal128(1, 0)'s such as1 + 9is adecimal128(2, 0). But (in Python):This is recognized in the rules for binary addition and multiplication of decimals (see footnote 1 in this section), but this does not apply to array aggregates.
In #35166 I did a bit of research following a question from @westonpace , and it seems that there's no standard approach to this across DBMS's, but a common solution is to set the precision of the result of a sum to the maximum possible precision of the underlying type. That is, a sum of
decimal128(1, 0)'s becomes adecimal128(38, 0).However, products and means differ further. For instance, in both instances, duckdb converts a decimal to a double, which makes sense as the precision of the product of an array of decimals would likely be huge, e.g., an array of size N with precision 2 decimals would have precision at least 2^N.
This PR implements the minimum possible change: replace the return types of the sum aggregate of
decimal128(P, S)todecimal128(38, S),decimal256(P, S)todecimal256(76, S),decimal32(P, S)todecimal32(9, S), anddecimal64(P, S)todecimal64(18, S).What changes are included in this PR?
Are these changes tested?
They are tested in the following implementations:
Are there any user-facing changes?
Yes. This changes the return type of a scalar aggregate of decimals.
This PR includes breaking changes to public APIs.
Specifically, the return type of a scalar aggregate of decimals changes. This is unlikely to break downstream applications as the underlying data has not changed, but if an application relies on the (incorrect!) type information for some reason, it would break.