Conversation
…2 and UInt128 Replace reduce-multiply-reduce (R-M-R) with multiply-then-reduce (M-R) in UInt192.mul and UInt128.mul when at least one input exceeds the modulus width. Previously, when either operand didn't fit in the modulus, both were pre-reduced before multiplication (R-M-R). Instead, compute the full 256×256 product first, then apply a single modular reduction (M-R). New infrastructure: - UInt192.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576 - UInt128.modReduceNormalised(UInt512, shift, inv) + slow-path for UInt576 Both UInt192.mul and UInt128.mul still have two branches, but both use M-R: 1. Fast path: both inputs fit in modulus width - use narrow multiply + reduce 2. Otherwise: full mul256 → UInt512 → single reduce Benchmark results (mac): MULMOD_256_256_192: 150ns → 100ns (-50ns) MULMOD_256_256_128: 155ns → 104ns (-51ns) MULMOD_256_192_192: 114ns → 110ns ( -4ns) MULMOD_256_192_128: 128ns → 86ns (-42ns) MULMOD_256_128_128: 101ns → 68ns (-33ns) MULMOD_256_64_192: 82ns → 64ns (-18ns) Also adds JMH benchmark cases for MULMOD_256_64_128 and MULMOD_256_64_192 to cover mixed-width scenarios. Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
|
|
||
| private UInt256 modReduceNormalised(final UInt512 that, final int shift, final long inv) { | ||
| UInt576 v = that.shiftLeftWide(shift); | ||
| return modReduceNormalisedSlowPath(v, shift, inv); |
There was a problem hiding this comment.
you are executing the slowPath regardless so that just becomes the normal path? What is the difference?
There was a problem hiding this comment.
If its the name that's confusing, could you suggest a better name?
Maybe modReduceNormalisedWide might work?
The advantage of keeping the "slow path" is that it is in keeping with the rest of the file: this is the path that ends up maximising the reduction steps.
But I agree it is a little odd that in this narrow context, there is no "fast path".
There was a problem hiding this comment.
Added a comment about why no fast path ba4c7d6
There was a problem hiding this comment.
I would be fine with just collapsing the method into the other, no benefit from one calling the other if there's no performance impact. You just figured out that there's no fast path, so that's fine
There was a problem hiding this comment.
Inlined these two methods and removed the comment 540a0e6
|
|
||
| private UInt256 modReduceNormalised(final UInt512 that, final int shift, final long inv) { | ||
| UInt576 v = that.shiftLeftWide(shift); | ||
| return modReduceNormalisedSlowPath(v, shift, inv); |
There was a problem hiding this comment.
|
|
||
| private UInt256 modReduceNormalisedSlowPath(final UInt576 v, final int shift, final long inv) { | ||
| QR192 qr; | ||
| if (v.u8 != 0 || Long.compareUnsigned(v.u7, u2) >= 0) { |
There was a problem hiding this comment.
Are you sure you don't need to address the case when Long.compareUnsiged(v.u8, u2) >= 0. A few bugs where caused by lack of this branch in some types.
There was a problem hiding this comment.
According to Claude v.u8 < u2 always holds (v.u8 < 2^63, u2 >= 2^63 after normalization)
There was a problem hiding this comment.
The normalization invariant guarantees v.u8 < u2 always: the shifted product's top limb is u7 >>> invShift where invShift >= 1, giving v.u8 < 2^63, while the normalized modulus top limb has MSB set, so u2 >= 2^63. The reduceStep precondition is always satisfied.
There was a problem hiding this comment.
Yup true that's correct, the zero would already come in the last limb of UInt576 in the cases I was thinking about, you do the division with the zero as the last limb implicitly.
I think we can safely remove this prepending 0 in other cases, and just leave dividends of size UInt257 which is where the danger lies.
| UInt256 x = (a.isUInt192()) ? a : m.modReduceNormalised(a, shift, inv); | ||
| UInt256 y = (b.isUInt192()) ? b : m.modReduceNormalised(b, shift, inv); | ||
| UInt448 prod = x.mul192(y); | ||
| int cmp = compareTo(prod); |
There was a problem hiding this comment.
isn't the compare optimization worth keeping? Usually modulus has this optimization because there's no point doing mod when modulus exceeds numerator. If you remove this optimization here, there's no other anywhere else in this code path.
There was a problem hiding this comment.
int cmp = compareTo(prod);
if (cmp == 0) return ZERO;
if (cmp > 0) return prod;
With this PR, in this path:
- For modulus m: 0 < m < 2^192 (it's a valid UInt192)
- At least one input >= 2^192
- Case: the other input is >= 1
a * b >= 2^192 > m, strictly. So a * b == m is impossible. - Case: the other input is 0
a * b = 0, and m > 0 (valid modulus), so 0 != m. Also impossible.
| } | ||
| // reduce-multiply-reduce | ||
| // At least one exceeds 192 bits: full multiply then single reduce. | ||
| UInt512 prod = a.mul256(b); |
There was a problem hiding this comment.
hmm, if there's at least one operand that is 192 bits you don't need to go full 512bits for the product, maybe that can be optimized. I'm not saying doing reduce but maybe doing mul192 instead of mul256 as the product of a 256bit with a 192bit number is 448bits maximum?
There was a problem hiding this comment.
The fast "both fit in 192" path exists just above.
On this path, at least one exceeds 192 bits, they both might be 256 on this path.
I think you're talking about a third path for the mixed case where one is 192, one is 256?
I started with that and it actually benched worse, I'm working on sharing those benchmarks as well. Code looked something like this:
// Mixed: one fits, one doesn't - reduce the large one, then mul192
// (current R-M-R logic, but only 1 pre-reduction instead of 2)
int shift = Long.numberOfLeadingZeros(u2);
UInt192 m = shiftLeft(shift);
long inv = reciprocal(m.u2);
UInt256 x = (a.isUInt192()) ? a : m.modReduceNormalised(a, shift, inv);
UInt256 y = (b.isUInt192()) ? b : m.modReduceNormalised(b, shift, inv);
UInt448 prod = x.mul192(y);
// … existing compareTo/modReduce logic …
| } | ||
| // reduce-multiply-reduce | ||
| // At least one exceeds 128 bits: full multiply then single reduce. | ||
| UInt512 prod = a.mul256(b); |
There was a problem hiding this comment.
Same as the other comment. One could be 128bit so max size of product here would be 384bits. However this type does not exist so not sure if it's worth adding it just for this.
There was a problem hiding this comment.
Agree not sure it's worth adding. I think we end up with the same slower mixed result as in #10088 (comment) ?
Not sure how significant difference UInt384 type is going to make but try it and benchmark if you think it's worth it and the mixed r-m-r benchmark I already did isn't enough evidence against this.
|
@lu-pinto I rebenched to compare this PR (multiple-reduce only) with reduce-multiply-reduce for the mixed case (where one operand fits inside modulus), so a three-branch version of UInt128/192.mul. This is the branch (the impl I rejected in favour of this PR): main...siladu:besu:opt-mulmod-wide-operands-fast-path-rmr-for-mixed The two new benchmark cases MULMOD_256_64_128 and MULMOD_256_64_192 especially highlight the issue: The biggest improvements with R-M-R are for smaller modulus (~4.5%), which is likely the average mainnet case. So this PR trades off some average performance to improve worst cases.
|
|
block-test benchmarks on m6a.2xlarge BEFORE THIS PR: THIS PR |
Add 15 deterministic test vectors covering all branches of the new UInt128 and UInt192 modReduceNormalisedSlowPath(UInt576) methods, including vectors that trigger mulSubOverflow through the M-R path. Also add a targeted random test (mulModRandomWideMR) with 1000 samples specifically exercising the M-R path with 128-bit and 192-bit moduli. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
|
Add targetted test vectors: |
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>

Optimize MULMOD worst cases via multiply-then-reduce (M-R) for UInt192 and UInt128
Replace reduce-multiply-reduce (R-M-R) with multiply-then-reduce (M-R) in UInt192.mul and UInt128.mul when at least one input exceeds the modulus width.
Previously, when either operand didn't fit in the modulus, both were pre-reduced before multiplication (R-M-R). Instead, compute the full 256×256 product first, then apply a single modular reduction (M-R).
New infrastructure:
Both UInt192.mul and UInt128.mul still have two branches, but both use M-R:
Also adds JMH benchmark cases for MULMOD_256_64_128 and MULMOD_256_64_192 to cover mixed-width scenarios.
Testing
Benchmarks
JMH
./gradlew --no-daemon :ethereum:core:jmh -Pincludes=MulModonm6a.2xlargeKey observations from the data:
mulmod_benchmark_comparison.html
evmtool block-test
#10088 (comment)