GH-35166: [C++][Compute] Increase precision of decimals in sum aggregates by khwilson · Pull Request #44184 · apache/arrow

khwilson · 2024-09-23T02:05:02Z

Rationale for this change

As documented in #35166, when Arrow performs a sum, product, or mean of an array of type decimalX(P, S), it returns a scalar of type decimalX(P, S). This is true even if the aggregate does not fit in the specified precision. For instance, a sum of two decimal128(1, 0)'s such as 1 + 9 is a decimal128(2, 0). But (in Python):

import pyarrow as pa
from decimal import Decimal

arr = pa.array([Decimal("1"), Decimal("9")], type=pa.decimal128(1, 0))
assert arr.sum().type == pa.decimal128(1, 0)

This is recognized in the rules for binary addition and multiplication of decimals (see footnote 1 in this section), but this does not apply to array aggregates.

In #35166 I did a bit of research following a question from @westonpace , and it seems that there's no standard approach to this across DBMS's, but a common solution is to set the precision of the result of a sum to the maximum possible precision of the underlying type. That is, a sum of decimal128(1, 0)'s becomes a decimal128(38, 0).

However, products and means differ further. For instance, in both instances, duckdb converts a decimal to a double, which makes sense as the precision of the product of an array of decimals would likely be huge, e.g., an array of size N with precision 2 decimals would have precision at least 2^N.

This PR implements the minimum possible change: replace the return types of the sum aggregate of decimal128(P, S) to decimal128(38, S), decimal256(P, S) to decimal256(76, S), decimal32(P, S) to decimal32(9, S), and decimal64(P, S) to decimal64(18, S).

What changes are included in this PR?

Update C++ kernels to support the change
Update docs to reflect the change
Fix tests in the languages that depend on the C++ engine
Determine if there are other languages which do not depend on the C++ engine which should also be updated

Are these changes tested?

They are tested in the following implementations:

Python
C++

Are there any user-facing changes?

Yes. This changes the return type of a scalar aggregate of decimals.

This PR includes breaking changes to public APIs.

Specifically, the return type of a scalar aggregate of decimals changes. This is unlikely to break downstream applications as the underlying data has not changed, but if an application relies on the (incorrect!) type information for some reason, it would break.

GitHub Issue: pa.compute.sum result for decimal128 doesn't fit into precision/scale #35166

github-actions · 2024-09-23T02:05:29Z

⚠️ GitHub issue #35166 has been automatically assigned in GitHub to PR creator.

khwilson · 2024-09-30T14:40:16Z

@zeroshade I noticed in #43957 that you were adding in Decimal32/64 types, which I think will have the same problem that this PR addresses. I was curious if you might have interest in reviewing this PR?

zeroshade · 2024-09-30T15:29:25Z

@khwilson Sure thing, i'll try to take a look at this in the next day or so

khwilson · 2024-10-08T23:43:27Z

Hi @zeroshade just checking in! Thanks again for taking a look

mapleFU

This method is interesting, however, before doing that, do you think a user-side "cast" is ok?
Like:

cast(origin to decimal(large-enough)) then avg

khwilson · 2024-10-09T13:49:01Z

Thanks for the review!

By a user-side cast, do you mean that users should essentially do:

select avg(cast(blah as decimal(big-precision)))

instead of

select avg(blah)

or do you mean that this code should "inject" a cast on the "user" side?

If you mean putting the cast onto the user, then I would think you'd want to add an error if the answer can't fit into the default precision, but that seems like it would be more disruptive (and out of step with how other systems handle decimal aggregates).

If you mean "injecting" the cast on the user side, would that end up creating a copy of the array?

zeroshade · 2024-10-09T16:51:29Z

@khwilson Hey sorry for the delay from me here, I've been traveling a lot lately for work and have been at ASF Community Over Code this week. I promise i'll get to this soon. In the meantime, you're in the very capable hands of @mapleFU

khwilson · 2024-10-09T18:48:59Z

No problem! Hope your travels were fun!

mapleFU · 2024-10-11T04:57:02Z

Generally this method is ok for me, but I'm not so familiar with the "common solutions" here. I'll dive into Presto/ClickHouse to see the common pattern here

khwilson · 2024-10-11T12:27:21Z

I enumerated several here: #35166 (comment)

Clickhouse for instance just ignores precision.

mapleFU · 2024-10-11T12:51:34Z

Would you mind making this Ready for review?

khwilson · 2024-10-11T13:33:52Z

Sure!

khwilson · 2024-10-13T20:33:47Z

@mapleFU I believe this is done now. Some notes on the diff:

The hash aggregates had to be updated as well (missed them in the first pass)
I've also added in Decimal32/64 support the basic aggregates (sum, product, mean, min/max, index). However, there's quite a lot of missing support for these types still in compute (most notably in casts)
Docs are updated to reflect the change

And a note that quite a few tests are failing for what appears to be the same reason as #41390. Happy to address them if you'd like.

mapleFU

Also cc @pitrou @bkietz

pitrou · 2024-10-15T16:05:57Z

I'm lukewarm about the approach here. Silently casting to the max precision discards metadata about the input; it also risks producing errors further down the line (if e.g. the max precision is deemed too large for other operations). It also doesn't automatically eliminate any potential overflow, for example:

>>> a = pa.array([789.3] * 20).cast(pa.decimal128(38, 35))
>>> a
<pyarrow.lib.Decimal128Array object at 0x7f0f103ca7a0>
[
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440,
  789.29999999999995452526491135358810440
]
>>> pc.sum(a)
<pyarrow.Decimal128Scalar: Decimal('-1228.11834604692408266343214451664848480')>

We should instead check that the result of an aggregate fits into the resulting Decimal type, while overflows currently pass silently:

>>> a = pa.array([123., 456., 789.]).cast(pa.decimal128(4, 1))
>>> a
<pyarrow.lib.Decimal128Array object at 0x7f0ed06261a0>
[
  123.0,
  456.0,
  789.0
]
>>> pc.sum(a)
<pyarrow.Decimal128Scalar: Decimal('1368.0')>
>>> pc.sum(a).validate(full=True)
Traceback (most recent call last):
  ...
ArrowInvalid: Decimal value 13680 does not fit in precision of decimal128(4, 1)

khwilson · 2024-10-15T19:24:46Z

Two problems with just validating afterward: First, I'd expect in reasonable cases for the validation to fail. A sum of 1m decimals of approximately the same size you'd expect to have 6 more digits of precision. I assume this is why all the DBMSs I looked at increase the precision by default.

Second, just checking for overflow doesn't solve the underlying problem. Consider:

a = pa.array([789.3] * 18).cast(pa.decimal128(38, 35))
print(pc.sum(a))
pc.sum(a).validate(full=True)  # passes

In duckdb, they implement an intermediate check to make sure that there's not an internal overflow:

tab = pa.Table.from_pydict({"a": a})
duckdb.query("select sum(a) from tab")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# duckdb.duckdb.OutOfRangeException: Out of Range Error: Overflow in HUGEINT addition: 
# 157859999999999990905052982270717620880 + 78929999999999995452526491135358810440

Notably, this lack of overflow checking also applies to integer sums in arrow:

>>> pa.array([9223372036854775800] * 2, type=pa.int64())
<pyarrow.lib.Int64Array object at 0x10c1d8b80>
[
  9223372036854775800,
  9223372036854775800
]
>>> pc.sum(pa.array([9223372036854775800] * 2, type=pa.int64()))
<pyarrow.Int64Scalar: -16>
>>> pc.sum(pa.array([9223372036854775800] * 2, type=pa.int64())).validate(full=True)

pitrou · 2024-10-15T20:57:22Z

Two problems with just validating afterward: First, I'd expect in reasonable cases for the validation to fail. A sum of 1m decimals of approximately the same size you'd expect to have 6 more digits of precision.

It depends obviously if all decimals are of the same sign, and what their actual magnitude is.

Second, just checking for overflow doesn't solve the underlying problem.

In the example above, I used a validate call simply to show that the result was indeed erroneous. I didn't mean we should actually call validation afterwards. We should instead check for overflow at each individual aggregation step (for each add or multiply, for example). This is required even if we were to bump the result's precision to the max.

pitrou · 2024-10-15T20:58:57Z

Notably, this lack of overflow checking also applies to integer sums in arrow:

Yes, and there's already a bug open for it: #37090

khwilson · 2024-10-16T02:18:54Z

Nice! I'm excited for the checked variants of sum and product!

With the integer overflow example, I only meant to point out that the compute module currently allows overflows, so I think it would be unexpected for sum to complain about an overflow only if the underlying type was a decimal. But if the goal with #37090 is to replace sum with a checked version, then the solution of erroring makes a lot of sense, and I'd be happy to implement it when #37536 gets merged. :-)

Still, I do think that users would find it unexpected to get an error if the sum fit in the underlying storage since this is how all the databases I've used (and the four I surveyed in #35166) have operated.

khwilson · 2024-11-23T17:04:20Z

Hi @pitrou sorry for dropping the ball on this a bit earlier. Are you still interested in merging this at some point? I'm happy to explore some alternative options for how to do the decimal expansion, or alternatively am happy to push #37536 forward

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

khwilson · 2025-06-15T18:33:18Z

OK, all in all:

@pitrou removed one of pc.sum and arr.sum in the python test
Added an example of an overflow in the sum tests
Made a small modification to the docs to clarify that current changes apply only to sums.
Rebased on main as there was some code migration

wirable23 · 2025-06-29T12:06:57Z

Would this change be part of arrow 21?

zanmato1984 · 2025-07-02T13:48:19Z

@pitrou did you get a chance to look at this again?

pitrou

LGTM on the principle, I posted a couple comments that you might want to act upon. I'll let @zanmato1984 make the final call.

zanmato1984 · 2025-07-03T08:47:59Z

I pushed several commits to address the latest comments from @pitrou and me. Those are quite trivial so I'll merge after CI is green.

@amoeba do you think we can include this in the 21.0.0 release?

zanmato1984 · 2025-07-03T09:54:52Z

CI is good, merging.

Thanks a lot @khwilson for contributing this!

@wirable23 Please wait for our release manager's response about into 21.0.0. Thanks.

conbench-apache-arrow · 2025-07-03T15:48:41Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit bf56a95.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

github-actions Bot added Component: C++ Component: Python awaiting review Awaiting review labels Sep 23, 2024

mapleFU reviewed Oct 9, 2024

View reviewed changes

Comment thread cpp/src/arrow/compute/kernels/codegen_internal.h Outdated

Comment thread cpp/src/arrow/compute/kernels/codegen_internal.cc Outdated

Comment thread cpp/src/arrow/compute/kernels/codegen_internal.h Outdated

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 9, 2024

khwilson force-pushed the increase-precision-of-decimals branch from 89f1ae9 to 01e53b3 Compare October 13, 2024 17:03

khwilson marked this pull request as ready for review October 13, 2024 20:23

khwilson requested a review from westonpace as a code owner October 13, 2024 20:23

mapleFU reviewed Oct 14, 2024

View reviewed changes

Comment thread cpp/src/arrow/compute/kernels/codegen_internal.h Outdated

github-actions Bot added the Component: Documentation label Oct 14, 2024

khwilson mentioned this pull request Apr 23, 2025

[Python] Error thrown when multiplying decimal numbers #43252

Open

zanmato1984 and others added 7 commits June 15, 2025 14:20

Format

6e0551e

Update cpp/src/arrow/compute/kernels/aggregate_test.cc

8e3da1f

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Update docs/source/cpp/compute.rst

bdc4dbf

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

remove extraneous template argument

55a2f46

format

432c57d

add overflow test

be4a043

fix merge conflict with main

7502951

khwilson force-pushed the increase-precision-of-decimals branch from 74161e2 to 7502951 Compare June 15, 2025 18:21

khwilson added 2 commits June 15, 2025 14:29

update submodule

540fb19

slight doc modification

5b666f8

pitrou reviewed Jul 2, 2025

View reviewed changes

Comment thread cpp/src/arrow/compute/kernels/hash_aggregate_numeric.cc Outdated

pitrou reviewed Jul 2, 2025

View reviewed changes

Comment thread docs/source/cpp/compute.rst Outdated

pitrou reviewed Jul 2, 2025

View reviewed changes

Comment thread docs/source/cpp/compute.rst

pitrou approved these changes Jul 2, 2025

View reviewed changes

zanmato1984 reviewed Jul 3, 2025

View reviewed changes

Comment thread docs/source/cpp/compute.rst Outdated

zanmato1984 added 4 commits July 3, 2025 14:19

Update docs/source/cpp/compute.rst

cc20e96

Use // to replace ///

8a0acde

Refine bullet points for decimal promotion for sum in doc

8d7e1b3

Use // to replace ///

d97e8b2

zanmato1984 changed the title ~~GH-35166: [C++] Increase precision of decimals in sum aggregates~~ GH-35166: [C++][Compute] Increase precision of decimals in sum aggregates Jul 3, 2025

zanmato1984 merged commit bf56a95 into apache:main Jul 3, 2025
39 checks passed

zanmato1984 removed the awaiting committer review Awaiting committer review label Jul 3, 2025

zanmato1984 mentioned this pull request Jul 3, 2025

pa.compute.sum result for decimal128 doesn't fit into precision/scale #35166

Closed

jorisvandenbossche mentioned this pull request Jul 7, 2025

TST: update expected dtype for sum of decimals with pyarrow 21+ pandas-dev/pandas#61799

Merged

Conversation

khwilson commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Sep 23, 2024

Uh oh!

khwilson commented Sep 30, 2024

Uh oh!

zeroshade commented Sep 30, 2024

Uh oh!

khwilson commented Oct 8, 2024

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khwilson commented Oct 9, 2024

Uh oh!

zeroshade commented Oct 9, 2024

Uh oh!

khwilson commented Oct 9, 2024

Uh oh!

mapleFU commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khwilson commented Oct 11, 2024

Uh oh!

mapleFU commented Oct 11, 2024

Uh oh!

khwilson commented Oct 11, 2024

Uh oh!

khwilson commented Oct 13, 2024

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou commented Oct 15, 2024

Uh oh!

khwilson commented Oct 15, 2024

Uh oh!

pitrou commented Oct 15, 2024

Uh oh!

pitrou commented Oct 15, 2024

Uh oh!

khwilson commented Oct 16, 2024

Uh oh!

khwilson commented Nov 23, 2024

Uh oh!

khwilson commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wirable23 commented Jun 29, 2025

Uh oh!

zanmato1984 commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zanmato1984 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zanmato1984 commented Jul 3, 2025

Uh oh!

Uh oh!

conbench-apache-arrow Bot commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

khwilson commented Sep 23, 2024 •

edited

Loading

mapleFU commented Oct 11, 2024 •

edited

Loading

khwilson commented Jun 15, 2025 •

edited

Loading

zanmato1984 commented Jul 3, 2025 •

edited

Loading