Improve shift opcodes performance (SAR, SHR and SHL)#9796
Improve shift opcodes performance (SAR, SHR and SHL)#9796ahamlat merged 34 commits intobesu-eth:mainfrom
Conversation
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
| aPool = new Bytes[SAMPLE_SIZE]; // shift amount (pushed second, popped first) | ||
| bPool = new Bytes[SAMPLE_SIZE]; // value (pushed first, popped second) | ||
|
|
||
| final ThreadLocalRandom random = ThreadLocalRandom.current(); |
There was a problem hiding this comment.
why do we keep using ThreadLocalRandom ?
There was a problem hiding this comment.
It is more thread safe, it doesn't help a lot in this case but it is in general a best practice.
...rc/jmh/java/org/hyperledger/besu/ethereum/vm/operations/AbstractShiftOperationBenchmark.java
Outdated
Show resolved
Hide resolved
|
In the benchmarks I noticed that only the RANDOM case has randomized data. Regarding other cases, are you sure there's no data cache going on with the benchmarks? Can't we randomize the number being shifted somehow? |
| final Bytes shiftAmount = frame.popStackItem(); | ||
| final Bytes value = frame.popStackItem(); | ||
| byte[] valueBytes = value.toArrayUnsafe(); | ||
| if (Arrays.equals(valueBytes, ALL_ONES_BYTES)) { |
There was a problem hiding this comment.
Is this worth it for just this single case? This evaluation might be fast but the drawback is that you have to evaluate it for every single input case.
There was a problem hiding this comment.
Yes, this is a good fast path that can happen. The benchmarks showed that it doesn't affect a lot other cases.
There was a problem hiding this comment.
Also why do you prefer all ones to all zeros?
Please measure worst case performance without this evaluation.
There was a problem hiding this comment.
I would say -1 (all ones), is a more common pattern in EVM, even if 0 is used a lot as well.
There was a problem hiding this comment.
But I can double check the benchmarks on the overhead, and remove it if the overhead is noticeable. I can remove it, the first implementation didn't have this fast path.
|
|
||
| final int shiftBytes = shift >>> 3; // /8 | ||
| final int shiftBits = shift & 7; // %8 | ||
| final int fill = negative ? 0xFF : 0x00; |
There was a problem hiding this comment.
if all you need the negative flag for is to figure out the fill, you can just do int fill = in[0] >> 31
There was a problem hiding this comment.
It doesn't work as it is for 1 byte, maybe something that can work in this case
int fill = ((in[0] >> 7) & 0xFF);
Good point, my initial idea was to have stable results that don't change a lot to be able to compare with different versions. |
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
| public static boolean isShiftOverflow(final byte[] shiftBytes) { | ||
| for (int i = 0; i < shiftBytes.length - 1; i++) { | ||
| if (shiftBytes[i] != 0) return true; | ||
| } |
There was a problem hiding this comment.
can vectorize this zero search
There was a problem hiding this comment.
I tested two implementations that have SIMD support with Arrays.mismatch and Arrays.equals.
/** SIMD candidate: Arrays.equals against a zero buffer. */
private static boolean isShiftOverflow_arraysEquals(final byte[] shiftBytes) {
final int len = shiftBytes.length - 1;
if (len <= 0) return false;
return !Arrays.equals(shiftBytes, 0, len, ZEROS, 0, len);
}
/** SIMD candidate: Arrays.mismatch against a zero buffer. */
private static boolean isShiftOverflow_arraysMismatch(final byte[] shiftBytes) {
final int len = shiftBytes.length - 1;
if (len <= 0) return false;
return Arrays.mismatch(shiftBytes, 0, len, ZEROS, 0, len) >= 0;
}
The results are summarised ant show that loop is the best general-purpose implementation.
Relative to loop (Existing Implementation)
| Case | loop (ns) | arraysEquals (ns) | Δ vs loop | arraysMismatch (ns) | Δ vs loop | Best |
|---|---|---|---|---|---|---|
| ONE_BYTE_NO_OVERFLOW | 3.437 | 3.438 | -0.03% | 3.436 | +0.03% | ≈ identical (noise) |
| TWO_BYTE_NO_OVERFLOW | 4.577 | 5.058 | -10.5% | 5.147 | -12.4% | loop |
| TWO_BYTE_OVERFLOW | 3.806 | 5.110 | -34.3% | 5.125 | -34.6% | loop |
| FULL_32_NO_OVERFLOW | 12.585 | 7.599 | +39.6% | 7.733 | +38.5% | arraysEquals |
| FULL_32_OVERFLOW_EARLY | 3.824 | 3.975 | -3.9% | 4.038 | -5.6% | loop |
| LARGE_OVERFLOW | 3.802 | 5.125 | -34.8% | 5.125 | -34.8% | loop |
| RANDOM | 4.042 | 5.589 | -38.3% | 5.589 | -38.3% | loop |
There was a problem hiding this comment.
As it is improving the worst case, I changed the implementation to use Arrays.equals
| } | ||
|
|
||
| // Only iterate bytes that receive shifted data from the input | ||
| for (int i = 31; i >= shiftBytes; i--) { |
There was a problem hiding this comment.
have not looked into it but can't you move 4 bytes at a time with an int and only take at most 8 iterations instead of 31?
There was a problem hiding this comment.
I don't think it is worth it, we process here at maximum 32 bytes, and most of cases only few bytes from I see from mainnet data.
| if (shiftBits == 0) { | ||
| out[i] = (byte) hi; | ||
| } else { | ||
| final int lo = (src - 1 >= 0) ? (in[src - 1] & 0xFF) : fill; |
There was a problem hiding this comment.
| final int lo = (src - 1 >= 0) ? (in[src - 1] & 0xFF) : fill; | |
| final int lo = (src < 0) ? fill : (in[src - 1] & 0xFF); |
There was a problem hiding this comment.
nit[naming convention]: I also think hi and lo names mean high and low right? So I think if you are doing lo = in[src - 1] it makes more sense to call this value hi instead as the array is in BigEndian
There was a problem hiding this comment.
final int lo = (src - 1 >= 0) ? (in[src - 1] & 0xFF) : fill;
is equivalent to
final int lo = (src >= 1) ? (in[src - 1] & 0xFF) : fill;
Using it the other way would be
final int lo = (src < 1) ? fill : (in[src - 1] & 0xFF);
There was a problem hiding this comment.
In this case, hi and lo are named based on how they are used to build out[i], not on big-endian order.
hi is the byte that is shifted right.
lo is the neighboring byte that provides the carry-in bits.
If we want clearer names, curr/prev or right/left would be better than swapping hi and lo.
I will rename them with curr/prev and use
final int lo = (src < 1) ? fill : (in[src - 1] & 0xFF);
evm/src/main/java/org/hyperledger/besu/evm/operation/SarOperationOptimized.java
Outdated
Show resolved
Hide resolved
…ionOptimized.java Co-authored-by: Luis Pinto <luis.pinto@consensys.net> Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
siladu
left a comment
There was a problem hiding this comment.
Still going but posting comments so far
evm/src/main/java/org/hyperledger/besu/evm/operation/SarOperationOptimized.java
Outdated
Show resolved
Hide resolved
evm/src/main/java/org/hyperledger/besu/evm/operation/Shift256Operations.java
Outdated
Show resolved
Hide resolved
evm/src/main/java/org/hyperledger/besu/evm/operation/ShlOperationOptimized.java
Outdated
Show resolved
Hide resolved
evm/src/main/java/org/hyperledger/besu/evm/operation/ShrOperationOptimized.java
Outdated
Show resolved
Hide resolved
| var gcProfiler = _strCmdArg('gcProfiler') | ||
| if (gcProfiler != null && gcProfiler.toBoolean()) { | ||
| profilers.add('gc') | ||
| } |
CHANGELOG.md
Outdated
| - Add ability to pass a custom tracer to block simulation [#9708](https://github.com/hyperledger/besu/pull/9708) | ||
| - Add support for `4byteTracer` in `debug_trace*` methods to collect function selectors from internal calls via PR [#9642](https://github.com/hyperledger/besu/pull/9642). Thanks to [@JukLee0ira](https://github.com/JukLee0ira). | ||
| - Update assertj to v3.27.7 [#9710](https://github.com/hyperledger/besu/pull/9710) | ||
| - Improve SAR, SHR and SHL opcodes performance [#9796](https://github.com/hyperledger/besu/pull/9796) |
There was a problem hiding this comment.
nit: can add to performance section below
| excludes = _strListCmdArg('excludes', []) | ||
| var asyncProfiler = _strCmdArg('asyncProfiler') | ||
| var asyncProfilerOptions = _strCmdArg('asyncProfilerOptions', 'output=flamegraph') | ||
| zip64.set(true) |
There was a problem hiding this comment.
I fixed this on main too so watch out for conflicts/duplication
|
|
||
| // shift >= 256, push All 0s | ||
| if (isShiftOverflow(shiftBytes)) { | ||
| frame.pushStackItem(ZERO_32); |
There was a problem hiding this comment.
Why is it OK to push ZERO_32 instead of Bytes.EMPTY in this case?
There was a problem hiding this comment.
That's a good question. For me, as shift is >=256, we need to shift all bits and as it is unsigned, it should be 256 0s. I see that the existing version just pushes Bytes.EMPTY and not Bytes32.EMPTY. ZERO_32 is just a reference to Bytes32.EMPTY.
If this doesn't have an impact on SHL, SHR and SAR performance, I can use Bytes.EMPTY.
There was a problem hiding this comment.
For me it wasn't a performance question, rather a correctness question. I was surprised that what looks like a different value on the stack (zero versus empty) doesn't cause consensus issues.
The spec uses U256(0) for this case so I think yours is more correct
https://ethereum.github.io/execution-specs/src/ethereum/forks/osaka/vm/instructions/bitwise.py.html#ethereum.forks.osaka.vm.instructions.bitwise.bitwise_shl:0
There was a problem hiding this comment.
ZERO_32 is just a reference to Bytes32.EMPTY.
I guess you mean Bytes32.ZERO?
After investigating more, it seems the previous Bytes.EMPTY implementation assumes consumers will always leftPad after popping off the stack (resulting in all zeros), which was presumably an optimisation to avoid allocating the 32 byte array or speed up other cases in the consumer perhaps...but if in your case it's a reference to a constant anyway, shouldn't have much impact?
This was the PR that introduced Bytes.EMTPY and a comment that hints at the reason (although class is different) #5331/#discussion_r1166131010
The benchmark results for #5331 are pretty flat for SHL, SHR and SAR.
There was a problem hiding this comment.
I guess you mean Bytes32.ZERO?
The existing implementation pushes Bytes.EMPTY for positive overflow,
Co-authored-by: Simon Dudley <simon.dudley@consensys.net> Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
…ionOptimized.java Co-authored-by: Simon Dudley <simon.dudley@consensys.net> Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
siladu
left a comment
There was a problem hiding this comment.
Great PR. Couple of minor comments but happy to 🚢
| // Bytes at index >= (32 - shiftBytes) are guaranteed zero (already from new byte[32]) | ||
| final int limit = 32 - shiftBytes; | ||
| for (int i = 0; i < limit; i++) { | ||
| final int src = i + shiftBytes; |
There was a problem hiding this comment.
nit: currByteIndex or similar might be clearer name
There was a problem hiding this comment.
The current naming is intentional: src is the source index, curr and next are byte values at that index and the one after. Happy to rename to srcIndex 8b30c59 if it makes it clearer.
| * | ||
| * <p>SAR has additional test cases for negative/positive values to test sign extension behavior. | ||
| */ | ||
| public abstract class AbstractSarOperationBenchmark extends BinaryOperationBenchmark { |
There was a problem hiding this comment.
Is it possible to shift by a negative amount? If not, what code is guarding that?
There was a problem hiding this comment.
It is not possible as it is spec'ed. The first input (shift) of SAR is unsigned.
The line below reads the shift amount as unsigned
final int shift = shiftBytes.length == 0 ? 0 : (shiftBytes[shiftBytes.length - 1] & 0xFF);
| import org.apache.tuweni.bytes.Bytes32; | ||
|
|
||
| /** | ||
| * Property-based tests comparing original SAR operation with the optimized version. |
There was a problem hiding this comment.
What's the plan when we want to remove the old SarOperation?
There was a problem hiding this comment.
Remove old sar means that the optimized version is battle tested, so I would suggest to remove the property based testing or just comparing with raw outputs.

PR description
This PR improves worst cases and average performance on SAR, SHR and SHL opcodes.
It improves also the JMH benchmarking by adding memory allocations and GC profiling by adding the flag
-PgcProfiler=true, ex.Worst-Case Performance (EEST 100M Gas Benchmark)
SHL
SHR
SAR
The PR improves also average performance without an memory allocation overhead, the results can be found here.
Fixed Issue(s)
Thanks for sending a pull request! Have you done the following?
doc-change-requiredlabel to this PR if updates are required.Locally, you can run these tests to catch failures early:
./gradlew spotlessApply./gradlew build./gradlew acceptanceTest./gradlew integrationTest./gradlew ethereum:referenceTests:referenceTests