Skip to content

Optimize MULMOD worst cases#10088

Merged
siladu merged 8 commits intobesu-eth:mainfrom
siladu:opt-mulmod-wide-operands-fast-path
Mar 30, 2026
Merged

Optimize MULMOD worst cases#10088
siladu merged 8 commits intobesu-eth:mainfrom
siladu:opt-mulmod-wide-operands-fast-path

Conversation

@siladu
Copy link
Copy Markdown
Contributor

@siladu siladu commented Mar 23, 2026

Optimize MULMOD worst cases via multiply-then-reduce (M-R) for UInt192 and UInt128

Replace reduce-multiply-reduce (R-M-R) with multiply-then-reduce (M-R) in UInt192.mul and UInt128.mul when at least one input exceeds the modulus width.

Previously, when either operand didn't fit in the modulus, both were pre-reduced before multiplication (R-M-R). Instead, compute the full 256×256 product first, then apply a single modular reduction (M-R).

New infrastructure:

  • UInt192.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576
  • UInt128.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576

Both UInt192.mul and UInt128.mul still have two branches, but both use M-R:

  1. Fast path: both inputs fit in modulus width - use narrow multiply + reduce
  2. Otherwise: full mul256 → UInt512 → single reduce

Also adds JMH benchmark cases for MULMOD_256_64_128 and MULMOD_256_64_192 to cover mixed-width scenarios.

Testing

Benchmarks

JMH

./gradlew --no-daemon :ethereum:core:jmh -Pincludes=MulMod on m6a.2xlarge

Key observations from the data:

  • Worst case improved from 34 -> 44 MGas/s
  • Big wins: MULMOD_256_256_192 +23.6% ns / +30.8% MGas, MULMOD_256_256_128 +17.1% / +20.7%, MULMOD_192_192_128 +12.8% / +14.7%, MULMOD_256_192_128 +9.3% / +10.3%
  • Notable wins: MULMOD_192_128_128 +8.5%, MULMOD_256_128_128 +3.2%
  • Small regressions: MULMOD_64_64_64 -4.2%, MULMOD_64_64_32 -0.7% - maybe overhead of the fast-path check in small cases
  • MULMOD_256_256_256: essentially neutral (0.01%), which makes sense as the result fits 256 bits so the existing path already handles it well
  • FULL_RANDOM: +2.5% overall improvement
opt-mulmod-benchmark-comparison

mulmod_benchmark_comparison.html

evmtool block-test

bits_127: 59.47 -> 66.55 MGas/s
bits_191: 57.13 -> 65.36 MGas/s
bits_63: 103.21 -> 103.64 MGas/s
bits_255: 62.38 -> 62.16 MGas/s

#10088 (comment)

…2 and UInt128

Replace reduce-multiply-reduce (R-M-R) with multiply-then-reduce (M-R) in UInt192.mul and UInt128.mul when at least one input exceeds the modulus width.

Previously, when either operand didn't fit in the modulus, both were pre-reduced before multiplication (R-M-R). Instead, compute the full 256×256 product first, then apply a single modular reduction (M-R).

New infrastructure:
- UInt192.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576
- UInt128.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576

Both UInt192.mul and UInt128.mul still have two branches, but both use M-R:
1. Fast path: both inputs fit in modulus width - use narrow multiply + reduce
2. Otherwise: full mul256 → UInt512 → single reduce

Benchmark results (mac):
  MULMOD_256_256_192:  150ns → 100ns  (-50ns)
  MULMOD_256_256_128:  155ns → 104ns  (-51ns)
  MULMOD_256_192_192:  114ns → 110ns  ( -4ns)
  MULMOD_256_192_128:  128ns →  86ns  (-42ns)
  MULMOD_256_128_128:  101ns →  68ns  (-33ns)
  MULMOD_256_64_192:    82ns →  64ns  (-18ns)

Also adds JMH benchmark cases for MULMOD_256_64_128 and MULMOD_256_64_192 to cover mixed-width scenarios.

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

private UInt256 modReduceNormalised(final UInt512 that, final int shift, final long inv) {
UInt576 v = that.shiftLeftWide(shift);
return modReduceNormalisedSlowPath(v, shift, inv);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are executing the slowPath regardless so that just becomes the normal path? What is the difference?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its the name that's confusing, could you suggest a better name?
Maybe modReduceNormalisedWide might work?

The advantage of keeping the "slow path" is that it is in keeping with the rest of the file: this is the path that ends up maximising the reduction steps.
But I agree it is a little odd that in this narrow context, there is no "fast path".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about why no fast path ba4c7d6

Copy link
Copy Markdown
Contributor

@lu-pinto lu-pinto Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine with just collapsing the method into the other, no benefit from one calling the other if there's no performance impact. You just figured out that there's no fast path, so that's fine

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined these two methods and removed the comment 540a0e6


private UInt256 modReduceNormalised(final UInt512 that, final int shift, final long inv) {
UInt576 v = that.shiftLeftWide(shift);
return modReduceNormalisedSlowPath(v, shift, inv);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


private UInt256 modReduceNormalisedSlowPath(final UInt576 v, final int shift, final long inv) {
QR192 qr;
if (v.u8 != 0 || Long.compareUnsigned(v.u7, u2) >= 0) {
Copy link
Copy Markdown
Contributor

@lu-pinto lu-pinto Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you don't need to address the case when Long.compareUnsiged(v.u8, u2) >= 0. A few bugs where caused by lack of this branch in some types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Claude v.u8 < u2 always holds (v.u8 < 2^63, u2 >= 2^63 after normalization)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalization invariant guarantees v.u8 < u2 always: the shifted product's top limb is u7 >>> invShift where invShift >= 1, giving v.u8 < 2^63, while the normalized modulus top limb has MSB set, so u2 >= 2^63. The reduceStep precondition is always satisfied.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup true that's correct, the zero would already come in the last limb of UInt576 in the cases I was thinking about, you do the division with the zero as the last limb implicitly.
I think we can safely remove this prepending 0 in other cases, and just leave dividends of size UInt257 which is where the danger lies.

UInt256 x = (a.isUInt192()) ? a : m.modReduceNormalised(a, shift, inv);
UInt256 y = (b.isUInt192()) ? b : m.modReduceNormalised(b, shift, inv);
UInt448 prod = x.mul192(y);
int cmp = compareTo(prod);
Copy link
Copy Markdown
Contributor

@lu-pinto lu-pinto Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the compare optimization worth keeping? Usually modulus has this optimization because there's no point doing mod when modulus exceeds numerator. If you remove this optimization here, there's no other anywhere else in this code path.

Copy link
Copy Markdown
Contributor Author

@siladu siladu Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 int cmp = compareTo(prod);
 if (cmp == 0) return ZERO;
 if (cmp > 0) return prod;

With this PR, in this path:

  • For modulus m: 0 < m < 2^192 (it's a valid UInt192)
  • At least one input >= 2^192
  • Case: the other input is >= 1
    a * b >= 2^192 > m, strictly. So a * b == m is impossible.
  • Case: the other input is 0
    a * b = 0, and m > 0 (valid modulus), so 0 != m. Also impossible.

}
// reduce-multiply-reduce
// At least one exceeds 192 bits: full multiply then single reduce.
UInt512 prod = a.mul256(b);
Copy link
Copy Markdown
Contributor

@lu-pinto lu-pinto Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if there's at least one operand that is 192 bits you don't need to go full 512bits for the product, maybe that can be optimized. I'm not saying doing reduce but maybe doing mul192 instead of mul256 as the product of a 256bit with a 192bit number is 448bits maximum?

Copy link
Copy Markdown
Contributor Author

@siladu siladu Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fast "both fit in 192" path exists just above.
On this path, at least one exceeds 192 bits, they both might be 256 on this path.

I think you're talking about a third path for the mixed case where one is 192, one is 256?
I started with that and it actually benched worse, I'm working on sharing those benchmarks as well. Code looked something like this:

// Mixed: one fits, one doesn't - reduce the large one, then mul192
// (current R-M-R logic, but only 1 pre-reduction instead of 2)
int shift = Long.numberOfLeadingZeros(u2);
UInt192 m = shiftLeft(shift);
long inv = reciprocal(m.u2);
UInt256 x = (a.isUInt192()) ? a : m.modReduceNormalised(a, shift, inv);
UInt256 y = (b.isUInt192()) ? b : m.modReduceNormalised(b, shift, inv);
UInt448 prod = x.mul192(y);
// … existing compareTo/modReduce logic …

}
// reduce-multiply-reduce
// At least one exceeds 128 bits: full multiply then single reduce.
UInt512 prod = a.mul256(b);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other comment. One could be 128bit so max size of product here would be 384bits. However this type does not exist so not sure if it's worth adding it just for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree not sure it's worth adding. I think we end up with the same slower mixed result as in #10088 (comment) ?

Not sure how significant difference UInt384 type is going to make but try it and benchmark if you think it's worth it and the mixed r-m-r benchmark I already did isn't enough evidence against this.

@siladu
Copy link
Copy Markdown
Contributor Author

siladu commented Mar 25, 2026

@lu-pinto I rebenched to compare this PR (multiple-reduce only) with reduce-multiply-reduce for the mixed case (where one operand fits inside modulus), so a three-branch version of UInt128/192.mul.

This is the branch (the impl I rejected in favour of this PR): main...siladu:besu:opt-mulmod-wide-operands-fast-path-rmr-for-mixed

The two new benchmark cases MULMOD_256_64_128 and MULMOD_256_64_192 especially highlight the issue:

MULMOD_256_64_128: mr-only=127ns, rmr-for-mixed=146ns = -14.8% regression (the rmr-for-mixed path is worse here)
MULMOD_256_64_192: mr-only=128ns, rmr-for-mixed=137ns = -7.0% regression
MULMOD_192_128_128: mr-only=133ns, rmr-for-mixed=142ns = -7.1% regression
MULMOD_256_256_192: rmr-for-mixed improves slightly 178→174ns (+2.5%)
MULMOD_256_256_256: rmr-for-mixed improves 177→173ns (+2.2%)

The biggest improvements with R-M-R are for smaller modulus (~4.5%), which is likely the average mainnet case. So this PR trades off some average performance to improve worst cases.

Screenshot 2026-03-25 at 4 59 02 pm

2026-03-25_mulmod_fastpath_mr_versus_rmr-for-mixed.html

@siladu siladu self-assigned this Mar 26, 2026
@siladu
Copy link
Copy Markdown
Contributor Author

siladu commented Mar 27, 2026

block-test benchmarks on m6a.2xlarge

BEFORE THIS PR:

taskset -c 4-7 time ./build/install/besu/bin/evmtool --repeat 10 block-test ~/2026-02-09_fixtures_benchmark_v0.0.7/blockchain_tests/benchmark/compute/instruction/arithmetic/mod_arithmetic.json  --test-name "*MULMOD*100M*"
...
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_127-benchmark-gas-value_100M]
... Imported in 1681.43 ms (59.47 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_191-benchmark-gas-value_100M]
... Imported in 1750.43 ms (57.13 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_63-benchmark-gas-value_100M]
... Imported in 968.94 ms (103.21 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_255-benchmark-gas-value_100M]
... Imported in 1603.01 ms (62.38 MGas/s)

THIS PR

taskset -c 4-7 time ./build/install/besu/bin/evmtool --repeat 10 block-test ~/2026-02-09_fixtures_benchmark_v0.0.7/blockchain_tests/benchmark/compute/instruction/arithmetic/mod_arithmetic.json  --test-name "*MULMOD*100M*"
...
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_127-benchmark-gas-value_100M]
... Imported in 1502.63 ms (66.55 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_191-benchmark-gas-value_100M]
... Imported in 1529.92 ms (65.36 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_63-benchmark-gas-value_100M]
... Imported in 964.88 ms (103.64 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_255-benchmark-gas-value_100M]
... Imported in 1608.75 ms (62.16 MGas/s)

siladu and others added 2 commits March 27, 2026 12:27
Add 15 deterministic test vectors covering all branches of the new
UInt128 and UInt192 modReduceNormalisedSlowPath(UInt576) methods,
including vectors that trigger mulSubOverflow through the M-R path.
Also add a targeted random test (mulModRandomWideMR) with 1000 samples
specifically exercising the M-R path with 128-bit and 192-bit moduli.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
@siladu
Copy link
Copy Markdown
Contributor Author

siladu commented Mar 27, 2026

Add targetted test vectors:

  15 deterministic test vectors in mulModTestCases():

  ┌─────┬─────────┬─────────────┬──────────────┬─────────────────────────┐
  │  #  │  Path   │   Branch    │ Product bits │    What it exercises    │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 1   │ UInt128 │ 5 (else)    │ ~258         │ 3 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 2   │ UInt128 │ 4           │ ~321         │ 4 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 3   │ UInt128 │ 3           │ ~384         │ 5 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 4   │ UInt128 │ 2           │ ~448         │ 6 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 5   │ UInt128 │ 1 (v.u7≥u1) │ ~512         │ 7 reduceSteps, shift=0  │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 6   │ UInt128 │ 1 (v.u8≠0)  │ >512         │ 7 reduceSteps, shift=1  │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 7   │ UInt128 │ 2           │ ~448         │ mulSubOverflow (v2==u1) │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 8   │ UInt128 │ noisy       │ ~320         │ All-nonzero limbs       │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 9   │ UInt192 │ 4 (else)    │ ~321         │ 3 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 10  │ UInt192 │ 3           │ ~384         │ 4 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 11  │ UInt192 │ 2           │ ~448         │ 5 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 12  │ UInt192 │ 1 (v.u7≥u2) │ ~512         │ 6 reduceSteps, shift=0  │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 13  │ UInt192 │ 1 (v.u8≠0)  │ >512         │ 6 reduceSteps, shift=1  │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 14  │ UInt192 │ 2           │ ~448         │ mulSubOverflow (v3==u2) │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 15  │ UInt192 │ noisy       │ ~384         │ All-nonzero limbs       │
  └─────┴─────────┴─────────────┴──────────────┴─────────────────────────┘

  1 new random test mulModRandomWideMR() — 1000 samples specifically targeting the M-R path with 128-bit and 192-bit moduli and a guaranteed-256-bit first operand.

siladu added 2 commits March 27, 2026 14:38
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
@siladu siladu marked this pull request as ready for review March 29, 2026 23:28
Copilot AI review requested due to automatic review settings March 29, 2026 23:28
Copy link
Copy Markdown
Contributor

@lu-pinto lu-pinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

siladu added 3 commits March 30, 2026 09:47
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
@siladu siladu enabled auto-merge (squash) March 30, 2026 04:00
@siladu siladu merged commit cf45eb7 into besu-eth:main Mar 30, 2026
46 checks passed
@siladu siladu deleted the opt-mulmod-wide-operands-fast-path branch March 30, 2026 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants