Optimize MULMOD worst cases by siladu · Pull Request #10088 · besu-eth/besu

siladu · 2026-03-23T07:18:27Z

Optimize MULMOD worst cases via multiply-then-reduce (M-R) for UInt192 and UInt128

Replace reduce-multiply-reduce (R-M-R) with multiply-then-reduce (M-R) in UInt192.mul and UInt128.mul when at least one input exceeds the modulus width.

Previously, when either operand didn't fit in the modulus, both were pre-reduced before multiplication (R-M-R). Instead, compute the full 256×256 product first, then apply a single modular reduction (M-R).

New infrastructure:

UInt192.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576
UInt128.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576

Both UInt192.mul and UInt128.mul still have two branches, but both use M-R:

Fast path: both inputs fit in modulus width - use narrow multiply + reduce
Otherwise: full mul256 → UInt512 → single reduce

Also adds JMH benchmark cases for MULMOD_256_64_128 and MULMOD_256_64_192 to cover mixed-width scenarios.

Testing

New targeted test vectors in UInt256.mulMod and UInt256.mulModRandomWideMR
Mainnet sync
Hoodi FULL sync
3 days of open ended fuzzing https://github.com/besu-eth/besu/compare/main...siladu:besu:opt-mulmod-wide-operands-fast-path-fuzz?expand=1

Benchmarks

JMH

./gradlew --no-daemon :ethereum:core:jmh -Pincludes=MulMod on m6a.2xlarge

Key observations from the data:

Worst case improved from 34 -> 44 MGas/s
Big wins: MULMOD_256_256_192 +23.6% ns / +30.8% MGas, MULMOD_256_256_128 +17.1% / +20.7%, MULMOD_192_192_128 +12.8% / +14.7%, MULMOD_256_192_128 +9.3% / +10.3%
Notable wins: MULMOD_192_128_128 +8.5%, MULMOD_256_128_128 +3.2%
Small regressions: MULMOD_64_64_64 -4.2%, MULMOD_64_64_32 -0.7% - maybe overhead of the fast-path check in small cases
MULMOD_256_256_256: essentially neutral (0.01%), which makes sense as the result fits 256 bits so the existing path already handles it well
FULL_RANDOM: +2.5% overall improvement

mulmod_benchmark_comparison.html

evmtool block-test

bits_127: 59.47 -> 66.55 MGas/s
bits_191: 57.13 -> 65.36 MGas/s
bits_63: 103.21 -> 103.64 MGas/s
bits_255: 62.38 -> 62.16 MGas/s

#10088 (comment)

…2 and UInt128 Replace reduce-multiply-reduce (R-M-R) with multiply-then-reduce (M-R) in UInt192.mul and UInt128.mul when at least one input exceeds the modulus width. Previously, when either operand didn't fit in the modulus, both were pre-reduced before multiplication (R-M-R). Instead, compute the full 256×256 product first, then apply a single modular reduction (M-R). New infrastructure: - UInt192.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576 - UInt128.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576 Both UInt192.mul and UInt128.mul still have two branches, but both use M-R: 1. Fast path: both inputs fit in modulus width - use narrow multiply + reduce 2. Otherwise: full mul256 → UInt512 → single reduce Benchmark results (mac): MULMOD_256_256_192: 150ns → 100ns (-50ns) MULMOD_256_256_128: 155ns → 104ns (-51ns) MULMOD_256_192_192: 114ns → 110ns ( -4ns) MULMOD_256_192_128: 128ns → 86ns (-42ns) MULMOD_256_128_128: 101ns → 68ns (-33ns) MULMOD_256_64_192: 82ns → 64ns (-18ns) Also adds JMH benchmark cases for MULMOD_256_64_128 and MULMOD_256_64_192 to cover mixed-width scenarios. Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

lu-pinto · 2026-03-23T09:48:22Z

evm/src/main/java/org/hyperledger/besu/evm/UInt256.java


+    private UInt256 modReduceNormalised(final UInt512 that, final int shift, final long inv) {
+      UInt576 v = that.shiftLeftWide(shift);
+      return modReduceNormalisedSlowPath(v, shift, inv);


you are executing the slowPath regardless so that just becomes the normal path? What is the difference?

See #10088 (comment)

If its the name that's confusing, could you suggest a better name?
Maybe modReduceNormalisedWide might work?

The advantage of keeping the "slow path" is that it is in keeping with the rest of the file: this is the path that ends up maximising the reduction steps.
But I agree it is a little odd that in this narrow context, there is no "fast path".

Added a comment about why no fast path ba4c7d6

I would be fine with just collapsing the method into the other, no benefit from one calling the other if there's no performance impact. You just figured out that there's no fast path, so that's fine

Inlined these two methods and removed the comment 540a0e6

lu-pinto · 2026-03-23T09:59:28Z

evm/src/main/java/org/hyperledger/besu/evm/UInt256.java


+    private UInt256 modReduceNormalised(final UInt512 that, final int shift, final long inv) {
+      UInt576 v = that.shiftLeftWide(shift);
+      return modReduceNormalisedSlowPath(v, shift, inv);


Same as https://github.com/besu-eth/besu/pull/10088/changes#r2973939072

See #10088 (comment)

lu-pinto · 2026-03-23T10:06:20Z

evm/src/main/java/org/hyperledger/besu/evm/UInt256.java

+
+    private UInt256 modReduceNormalisedSlowPath(final UInt576 v, final int shift, final long inv) {
+      QR192 qr;
+      if (v.u8 != 0 || Long.compareUnsigned(v.u7, u2) >= 0) {


Are you sure you don't need to address the case when Long.compareUnsiged(v.u8, u2) >= 0. A few bugs where caused by lack of this branch in some types.

According to Claude v.u8 < u2 always holds (v.u8 < 2^63, u2 >= 2^63 after normalization)

The normalization invariant guarantees v.u8 < u2 always: the shifted product's top limb is u7 >>> invShift where invShift >= 1, giving v.u8 < 2^63, while the normalized modulus top limb has MSB set, so u2 >= 2^63. The reduceStep precondition is always satisfied.

Yup true that's correct, the zero would already come in the last limb of UInt576 in the cases I was thinking about, you do the division with the zero as the last limb implicitly.
I think we can safely remove this prepending 0 in other cases, and just leave dividends of size UInt257 which is where the danger lies.

lu-pinto · 2026-03-23T10:20:11Z

evm/src/main/java/org/hyperledger/besu/evm/UInt256.java

-      UInt256 x = (a.isUInt192()) ? a : m.modReduceNormalised(a, shift, inv);
-      UInt256 y = (b.isUInt192()) ? b : m.modReduceNormalised(b, shift, inv);
-      UInt448 prod = x.mul192(y);
-      int cmp = compareTo(prod);


isn't the compare optimization worth keeping? Usually modulus has this optimization because there's no point doing mod when modulus exceeds numerator. If you remove this optimization here, there's no other anywhere else in this code path.

int cmp = compareTo(prod); if (cmp == 0) return ZERO; if (cmp > 0) return prod;

With this PR, in this path:

For modulus m: 0 < m < 2^192 (it's a valid UInt192)

At least one input >= 2^192

Case: the other input is >= 1
a * b >= 2^192 > m, strictly. So a * b == m is impossible.

Case: the other input is 0
a * b = 0, and m > 0 (valid modulus), so 0 != m. Also impossible.

lu-pinto · 2026-03-23T10:27:43Z

evm/src/main/java/org/hyperledger/besu/evm/UInt256.java

      }
-      // reduce-multiply-reduce
+      // At least one exceeds 192 bits: full multiply then single reduce.
+      UInt512 prod = a.mul256(b);


hmm, if there's at least one operand that is 192 bits you don't need to go full 512bits for the product, maybe that can be optimized. I'm not saying doing reduce but maybe doing mul192 instead of mul256 as the product of a 256bit with a 192bit number is 448bits maximum?

The fast "both fit in 192" path exists just above.
On this path, at least one exceeds 192 bits, they both might be 256 on this path.

I think you're talking about a third path for the mixed case where one is 192, one is 256?
I started with that and it actually benched worse, I'm working on sharing those benchmarks as well. Code looked something like this:

// Mixed: one fits, one doesn't - reduce the large one, then mul192 // (current R-M-R logic, but only 1 pre-reduction instead of 2) int shift = Long.numberOfLeadingZeros(u2); UInt192 m = shiftLeft(shift); long inv = reciprocal(m.u2); UInt256 x = (a.isUInt192()) ? a : m.modReduceNormalised(a, shift, inv); UInt256 y = (b.isUInt192()) ? b : m.modReduceNormalised(b, shift, inv); UInt448 prod = x.mul192(y); // … existing compareTo/modReduce logic …

lu-pinto · 2026-03-23T10:36:01Z

evm/src/main/java/org/hyperledger/besu/evm/UInt256.java

      }
-      // reduce-multiply-reduce
+      // At least one exceeds 128 bits: full multiply then single reduce.
+      UInt512 prod = a.mul256(b);


Same as the other comment. One could be 128bit so max size of product here would be 384bits. However this type does not exist so not sure if it's worth adding it just for this.

Agree not sure it's worth adding. I think we end up with the same slower mixed result as in #10088 (comment) ?

Not sure how significant difference UInt384 type is going to make but try it and benchmark if you think it's worth it and the mixed r-m-r benchmark I already did isn't enough evidence against this.

siladu · 2026-03-25T07:07:11Z

@lu-pinto I rebenched to compare this PR (multiple-reduce only) with reduce-multiply-reduce for the mixed case (where one operand fits inside modulus), so a three-branch version of UInt128/192.mul.

This is the branch (the impl I rejected in favour of this PR): main...siladu:besu:opt-mulmod-wide-operands-fast-path-rmr-for-mixed

The two new benchmark cases MULMOD_256_64_128 and MULMOD_256_64_192 especially highlight the issue:

MULMOD_256_64_128: mr-only=127ns, rmr-for-mixed=146ns = -14.8% regression (the rmr-for-mixed path is worse here)
MULMOD_256_64_192: mr-only=128ns, rmr-for-mixed=137ns = -7.0% regression
MULMOD_192_128_128: mr-only=133ns, rmr-for-mixed=142ns = -7.1% regression
MULMOD_256_256_192: rmr-for-mixed improves slightly 178→174ns (+2.5%)
MULMOD_256_256_256: rmr-for-mixed improves 177→173ns (+2.2%)

The biggest improvements with R-M-R are for smaller modulus (~4.5%), which is likely the average mainnet case. So this PR trades off some average performance to improve worst cases.

2026-03-25_mulmod_fastpath_mr_versus_rmr-for-mixed.html

siladu · 2026-03-27T00:45:51Z

block-test benchmarks on m6a.2xlarge

BEFORE THIS PR:

taskset -c 4-7 time ./build/install/besu/bin/evmtool --repeat 10 block-test ~/2026-02-09_fixtures_benchmark_v0.0.7/blockchain_tests/benchmark/compute/instruction/arithmetic/mod_arithmetic.json  --test-name "*MULMOD*100M*"
...
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_127-benchmark-gas-value_100M]
... Imported in 1681.43 ms (59.47 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_191-benchmark-gas-value_100M]
... Imported in 1750.43 ms (57.13 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_63-benchmark-gas-value_100M]
... Imported in 968.94 ms (103.21 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_255-benchmark-gas-value_100M]
... Imported in 1603.01 ms (62.38 MGas/s)

THIS PR

taskset -c 4-7 time ./build/install/besu/bin/evmtool --repeat 10 block-test ~/2026-02-09_fixtures_benchmark_v0.0.7/blockchain_tests/benchmark/compute/instruction/arithmetic/mod_arithmetic.json  --test-name "*MULMOD*100M*"
...
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_127-benchmark-gas-value_100M]
... Imported in 1502.63 ms (66.55 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_191-benchmark-gas-value_100M]
... Imported in 1529.92 ms (65.36 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_63-benchmark-gas-value_100M]
... Imported in 964.88 ms (103.64 MGas/s)
... test_mod_arithmetic[fork_Osaka-blockchain_test-opcode_MULMOD-mod_bits_255-benchmark-gas-value_100M]
... Imported in 1608.75 ms (62.16 MGas/s)

Add 15 deterministic test vectors covering all branches of the new UInt128 and UInt192 modReduceNormalisedSlowPath(UInt576) methods, including vectors that trigger mulSubOverflow through the M-R path. Also add a targeted random test (mulModRandomWideMR) with 1000 samples specifically exercising the M-R path with 128-bit and 192-bit moduli. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

siladu · 2026-03-27T03:13:21Z

Add targetted test vectors:

  15 deterministic test vectors in mulModTestCases():

  ┌─────┬─────────┬─────────────┬──────────────┬─────────────────────────┐
  │  #  │  Path   │   Branch    │ Product bits │    What it exercises    │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 1   │ UInt128 │ 5 (else)    │ ~258         │ 3 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 2   │ UInt128 │ 4           │ ~321         │ 4 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 3   │ UInt128 │ 3           │ ~384         │ 5 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 4   │ UInt128 │ 2           │ ~448         │ 6 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 5   │ UInt128 │ 1 (v.u7≥u1) │ ~512         │ 7 reduceSteps, shift=0  │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 6   │ UInt128 │ 1 (v.u8≠0)  │ >512         │ 7 reduceSteps, shift=1  │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 7   │ UInt128 │ 2           │ ~448         │ mulSubOverflow (v2==u1) │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 8   │ UInt128 │ noisy       │ ~320         │ All-nonzero limbs       │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 9   │ UInt192 │ 4 (else)    │ ~321         │ 3 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 10  │ UInt192 │ 3           │ ~384         │ 4 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 11  │ UInt192 │ 2           │ ~448         │ 5 reduceSteps           │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 12  │ UInt192 │ 1 (v.u7≥u2) │ ~512         │ 6 reduceSteps, shift=0  │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 13  │ UInt192 │ 1 (v.u8≠0)  │ >512         │ 6 reduceSteps, shift=1  │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 14  │ UInt192 │ 2           │ ~448         │ mulSubOverflow (v3==u2) │
  ├─────┼─────────┼─────────────┼──────────────┼─────────────────────────┤
  │ 15  │ UInt192 │ noisy       │ ~384         │ All-nonzero limbs       │
  └─────┴─────────┴─────────────┴──────────────┴─────────────────────────┘

  1 new random test mulModRandomWideMR() — 1000 samples specifically targeting the M-R path with 128-bit and 192-bit moduli and a guaranteed-256-bit first operand.

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

lu-pinto

LGTM

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

macfarla added the performance label Mar 23, 2026

lu-pinto reviewed Mar 23, 2026

View reviewed changes

siladu self-assigned this Mar 26, 2026

siladu and others added 2 commits March 27, 2026 12:27

Improve comments

0706940

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

siladu added 2 commits March 27, 2026 14:38

More comments

ba4c7d6

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

changelog

00baa1d

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

siladu marked this pull request as ready for review March 29, 2026 23:28

Copilot AI review requested due to automatic review settings March 29, 2026 23:28

lu-pinto approved these changes Mar 29, 2026

View reviewed changes

siladu added 3 commits March 30, 2026 09:47

Merge branch 'main' into opt-mulmod-wide-operands-fast-path

7f7d33b

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

Merge branch 'main' into opt-mulmod-wide-operands-fast-path

8cc05cc

Inline methods

540a0e6

Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

siladu enabled auto-merge (squash) March 30, 2026 04:00

siladu merged commit cf45eb7 into besu-eth:main Mar 30, 2026
46 checks passed

siladu deleted the opt-mulmod-wide-operands-fast-path branch March 30, 2026 09:43

Conversation

siladu commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Benchmarks

JMH

evmtool block-test

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lu-pinto Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lu-pinto Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lu-pinto Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siladu Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lu-pinto Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siladu Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siladu commented Mar 25, 2026

Uh oh!

siladu commented Mar 27, 2026

Uh oh!

siladu commented Mar 27, 2026

Uh oh!

lu-pinto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

siladu commented Mar 23, 2026 •

edited

Loading

lu-pinto Mar 28, 2026 •

edited

Loading

lu-pinto Mar 23, 2026 •

edited

Loading

lu-pinto Mar 23, 2026 •

edited

Loading

siladu Mar 24, 2026 •

edited

Loading

lu-pinto Mar 23, 2026 •

edited

Loading

siladu Mar 24, 2026 •

edited

Loading