Skip to content

Improve shift opcodes performance (SAR, SHR and SHL)#9796

Merged
ahamlat merged 34 commits intobesu-eth:mainfrom
ahamlat:improve-shift-opcodes
Feb 19, 2026
Merged

Improve shift opcodes performance (SAR, SHR and SHL)#9796
ahamlat merged 34 commits intobesu-eth:mainfrom
ahamlat:improve-shift-opcodes

Conversation

@ahamlat
Copy link
Copy Markdown
Contributor

@ahamlat ahamlat commented Feb 12, 2026

PR description

This PR improves worst cases and average performance on SAR, SHR and SHL opcodes.
It improves also the JMH benchmarking by adding memory allocations and GC profiling by adding the flag -PgcProfiler=true , ex.

./gradlew :ethereum:core:jmh -Pincludes='Sar,Shl,Shr' -PgcProfiler=true --rerun-tasks

Worst-Case Performance (EEST 100M Gas Benchmark)

SHL

Metric Before this PR With this PR Improvement
Block import time 1557.40 ms 291.69 ms ~5.34× faster
Throughput 64.21 MGas/s 342.83 MGas/s +433.9%

SHR

Metric Before this PR With this PR Improvement
Block import time 1479.99 ms 291.16 ms ~5.08× faster
Throughput 67.57 MGas/s 343.46 MGas/s +408.3%

SAR

Metric Before this PR With this PR Improvement
Block import time 2246.50 ms 462.20 ms ~4.86× faster
Throughput 44.51 MGas/s 216.36 MGas/s +386.1%

The PR improves also average performance without an memory allocation overhead, the results can be found here.

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • spotless: ./gradlew spotlessApply
  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests
  • hive tests: Engine or other RPCs modified?

Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
@macfarla macfarla moved this to In Review in Performance Feb 13, 2026
aPool = new Bytes[SAMPLE_SIZE]; // shift amount (pushed second, popped first)
bPool = new Bytes[SAMPLE_SIZE]; // value (pushed first, popped second)

final ThreadLocalRandom random = ThreadLocalRandom.current();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we keep using ThreadLocalRandom ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more thread safe, it doesn't help a lot in this case but it is in general a best practice.

@lu-pinto
Copy link
Copy Markdown
Contributor

In the benchmarks I noticed that only the RANDOM case has randomized data. Regarding other cases, are you sure there's no data cache going on with the benchmarks? Can't we randomize the number being shifted somehow?

final Bytes shiftAmount = frame.popStackItem();
final Bytes value = frame.popStackItem();
byte[] valueBytes = value.toArrayUnsafe();
if (Arrays.equals(valueBytes, ALL_ONES_BYTES)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth it for just this single case? This evaluation might be fast but the drawback is that you have to evaluate it for every single input case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good fast path that can happen. The benchmarks showed that it doesn't affect a lot other cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why do you prefer all ones to all zeros?
Please measure worst case performance without this evaluation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say -1 (all ones), is a more common pattern in EVM, even if 0 is used a lot as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I can double check the benchmarks on the overhead, and remove it if the overhead is noticeable. I can remove it, the first implementation didn't have this fast path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There 1/2 ns overhead from the shortcut
image


final int shiftBytes = shift >>> 3; // /8
final int shiftBits = shift & 7; // %8
final int fill = negative ? 0xFF : 0x00;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if all you need the negative flag for is to figure out the fill, you can just do int fill = in[0] >> 31

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work as it is for 1 byte, maybe something that can work in this case

int fill = ((in[0] >> 7) & 0xFF);

@ahamlat
Copy link
Copy Markdown
Contributor Author

ahamlat commented Feb 13, 2026

In the benchmarks I noticed that only the RANDOM case has randomized data. Regarding other cases, are you sure there's no data cache going on with the benchmarks? Can't we randomize the number being shifted somehow?

Good point, my initial idea was to have stable results that don't change a lot to be able to compare with different versions.
I will add more random values (negative and positive) with fixed shifts, to have a mixed version.
EDIT. I updated the benchmarks with this commit 78126d8.
I generated also the new benchmarks numbers, they show consistant improvements on most of the cases : https://gist.github.com/ahamlat/afc4c359844d6bfca9b47e5285b8f2b0

Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
public static boolean isShiftOverflow(final byte[] shiftBytes) {
for (int i = 0; i < shiftBytes.length - 1; i++) {
if (shiftBytes[i] != 0) return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can vectorize this zero search

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested two implementations that have SIMD support with Arrays.mismatch and Arrays.equals.

/** SIMD candidate: Arrays.equals against a zero buffer. */
  private static boolean isShiftOverflow_arraysEquals(final byte[] shiftBytes) {
    final int len = shiftBytes.length - 1;
    if (len <= 0) return false;
    return !Arrays.equals(shiftBytes, 0, len, ZEROS, 0, len);
  }

  /** SIMD candidate: Arrays.mismatch against a zero buffer. */
  private static boolean isShiftOverflow_arraysMismatch(final byte[] shiftBytes) {
    final int len = shiftBytes.length - 1;
    if (len <= 0) return false;
    return Arrays.mismatch(shiftBytes, 0, len, ZEROS, 0, len) >= 0;
  }

The results are summarised ant show that loop is the best general-purpose implementation.

Relative to loop (Existing Implementation)

Case loop (ns) arraysEquals (ns) Δ vs loop arraysMismatch (ns) Δ vs loop Best
ONE_BYTE_NO_OVERFLOW 3.437 3.438 -0.03% 3.436 +0.03% ≈ identical (noise)
TWO_BYTE_NO_OVERFLOW 4.577 5.058 -10.5% 5.147 -12.4% loop
TWO_BYTE_OVERFLOW 3.806 5.110 -34.3% 5.125 -34.6% loop
FULL_32_NO_OVERFLOW 12.585 7.599 +39.6% 7.733 +38.5% arraysEquals
FULL_32_OVERFLOW_EARLY 3.824 3.975 -3.9% 4.038 -5.6% loop
LARGE_OVERFLOW 3.802 5.125 -34.8% 5.125 -34.8% loop
RANDOM 4.042 5.589 -38.3% 5.589 -38.3% loop

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is improving the worst case, I changed the implementation to use Arrays.equals

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// Only iterate bytes that receive shifted data from the input
for (int i = 31; i >= shiftBytes; i--) {
Copy link
Copy Markdown
Contributor

@lu-pinto lu-pinto Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not looked into it but can't you move 4 bytes at a time with an int and only take at most 8 iterations instead of 31?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worth it, we process here at maximum 32 bytes, and most of cases only few bytes from I see from mainnet data.

if (shiftBits == 0) {
out[i] = (byte) hi;
} else {
final int lo = (src - 1 >= 0) ? (in[src - 1] & 0xFF) : fill;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final int lo = (src - 1 >= 0) ? (in[src - 1] & 0xFF) : fill;
final int lo = (src < 0) ? fill : (in[src - 1] & 0xFF);

Copy link
Copy Markdown
Contributor

@lu-pinto lu-pinto Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit[naming convention]: I also think hi and lo names mean high and low right? So I think if you are doing lo = in[src - 1] it makes more sense to call this value hi instead as the array is in BigEndian

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final int lo = (src - 1 >= 0) ? (in[src - 1] & 0xFF) : fill;

is equivalent to

final int lo = (src >= 1) ? (in[src - 1] & 0xFF) : fill;

Using it the other way would be

final int lo = (src < 1) ? fill : (in[src - 1] & 0xFF);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, hi and lo are named based on how they are used to build out[i], not on big-endian order.
hi is the byte that is shifted right.
lo is the neighboring byte that provides the carry-in bits.
If we want clearer names, curr/prev or right/left would be better than swapping hi and lo.
I will rename them with curr/prev and use

final int lo = (src < 1) ? fill : (in[src - 1] & 0xFF);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahamlat and others added 6 commits February 13, 2026 21:27
…ionOptimized.java

Co-authored-by: Luis Pinto <luis.pinto@consensys.net>
Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Copy link
Copy Markdown
Contributor

@siladu siladu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still going but posting comments so far

Comment on lines +670 to +673
var gcProfiler = _strCmdArg('gcProfiler')
if (gcProfiler != null && gcProfiler.toBoolean()) {
profilers.add('gc')
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice addition 👍

CHANGELOG.md Outdated
- Add ability to pass a custom tracer to block simulation [#9708](https://github.com/hyperledger/besu/pull/9708)
- Add support for `4byteTracer` in `debug_trace*` methods to collect function selectors from internal calls via PR [#9642](https://github.com/hyperledger/besu/pull/9642). Thanks to [@JukLee0ira](https://github.com/JukLee0ira).
- Update assertj to v3.27.7 [#9710](https://github.com/hyperledger/besu/pull/9710)
- Improve SAR, SHR and SHL opcodes performance [#9796](https://github.com/hyperledger/besu/pull/9796)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can add to performance section below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excludes = _strListCmdArg('excludes', [])
var asyncProfiler = _strCmdArg('asyncProfiler')
var asyncProfilerOptions = _strCmdArg('asyncProfilerOptions', 'output=flamegraph')
zip64.set(true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this on main too so watch out for conflicts/duplication


// shift >= 256, push All 0s
if (isShiftOverflow(shiftBytes)) {
frame.pushStackItem(ZERO_32);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it OK to push ZERO_32 instead of Bytes.EMPTY in this case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. For me, as shift is >=256, we need to shift all bits and as it is unsigned, it should be 256 0s. I see that the existing version just pushes Bytes.EMPTY and not Bytes32.EMPTY. ZERO_32 is just a reference to Bytes32.EMPTY.
If this doesn't have an impact on SHL, SHR and SAR performance, I can use Bytes.EMPTY.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@siladu siladu Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it wasn't a performance question, rather a correctness question. I was surprised that what looks like a different value on the stack (zero versus empty) doesn't cause consensus issues.

The spec uses U256(0) for this case so I think yours is more correct
https://ethereum.github.io/execution-specs/src/ethereum/forks/osaka/vm/instructions/bitwise.py.html#ethereum.forks.osaka.vm.instructions.bitwise.bitwise_shl:0

Copy link
Copy Markdown
Contributor

@siladu siladu Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZERO_32 is just a reference to Bytes32.EMPTY.

I guess you mean Bytes32.ZERO?

After investigating more, it seems the previous Bytes.EMPTY implementation assumes consumers will always leftPad after popping off the stack (resulting in all zeros), which was presumably an optimisation to avoid allocating the 32 byte array or speed up other cases in the consumer perhaps...but if in your case it's a reference to a constant anyway, shouldn't have much impact?

This was the PR that introduced Bytes.EMTPY and a comment that hints at the reason (although class is different) #5331/#discussion_r1166131010

The benchmark results for #5331 are pretty flat for SHL, SHR and SAR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean Bytes32.ZERO?

The existing implementation pushes Bytes.EMPTY for positive overflow,

ahamlat and others added 6 commits February 17, 2026 15:01
Co-authored-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
…ionOptimized.java

Co-authored-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>
Copy link
Copy Markdown
Contributor

@siladu siladu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR. Couple of minor comments but happy to 🚢

// Bytes at index >= (32 - shiftBytes) are guaranteed zero (already from new byte[32])
final int limit = 32 - shiftBytes;
for (int i = 0; i < limit; i++) {
final int src = i + shiftBytes;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: currByteIndex or similar might be clearer name

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current naming is intentional: src is the source index, curr and next are byte values at that index and the one after. Happy to rename to srcIndex 8b30c59 if it makes it clearer.

*
* <p>SAR has additional test cases for negative/positive values to test sign extension behavior.
*/
public abstract class AbstractSarOperationBenchmark extends BinaryOperationBenchmark {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to shift by a negative amount? If not, what code is guarding that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible as it is spec'ed. The first input (shift) of SAR is unsigned.
The line below reads the shift amount as unsigned

    final int shift = shiftBytes.length == 0 ? 0 : (shiftBytes[shiftBytes.length - 1] & 0xFF);

import org.apache.tuweni.bytes.Bytes32;

/**
* Property-based tests comparing original SAR operation with the optimized version.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan when we want to remove the old SarOperation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove old sar means that the optimized version is battle tested, so I would suggest to remove the property based testing or just comparing with raw outputs.

Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
@ahamlat ahamlat enabled auto-merge (squash) February 19, 2026 11:40
@ahamlat ahamlat disabled auto-merge February 19, 2026 11:55
@ahamlat ahamlat merged commit 1876cd7 into besu-eth:main Feb 19, 2026
46 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in Performance Feb 19, 2026
@siladu siladu mentioned this pull request Feb 25, 2026
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants