Conversation
Summary✅ Passed: 5581 🚧 Skipped
|
|
Wow that's a huge speedup. Regarding Mac M1s or M2s, the absence of speedup is probably explained by the extremely high memory bandwidth of the Macs. Have to check the numbers as I don't know them from the top of my head. Regarding Plonk, I'm curious about the cases where it's needed, by picking decimation in-time or decimation in frequency you can choose whether you start or end in canonical or bit-reversed domain. This repo has 3 variants out of 4 (no bit-reversed to bit-reversed) https://github.com/kwantam/fffft |
|
Yep for the M1, that was my conclusion too, plus the cache sizes + cache lines are significantly bigger. For PlonK, most of the time yes, we avoid bit reverse all together, but in one or two spots, we need to convert a polynomial to canonical regular form -- didn't bench the impact yet (probably no more than 5% total perf impact on PlonK prover) but it did stand out on profiling traces. |
yelhousni
left a comment
There was a problem hiding this comment.
LGTM 👍
I still don't get the arm64 peculiarities and the optimal vs. practical choices for the tile size, but I get the overall logic.
| } | ||
|
|
||
| func bitReverseCobra(v []fr.Element) { | ||
| switch len(v) { |
There was a problem hiding this comment.
are these empirical results?
There was a problem hiding this comment.
in a sense; these methods are just generated with constant sized arrays and offsets for specific sizes.
Below a certain threshold, the naive version performs well, and after that threshold (2**27) I just didn't bother generating the methods since I don't think it's very common to bit reverse 270M+ vectors (but maybe...)
Description
Bit reversal permutation naive implementation does not scale very well due to its memory access patterns and cache associativity issues (since we are accessing elements by strides of powers of 2).
For PlonK prover that needs to do couple of these permutations on potentially large domains (> 2**25) this is very noticeable.
This PR introduces a "COBRA" bit shuffle permutation, derived from:
Larry Carter and Kang Su Gatlin, 1998
https://csaws.cs.technion.ac.il/~itai/Courses/Cache/bit.pdf
permutation in C++11 on the x86-64 architecture
Knauth, Adas, Whitfield, Wang, Ickler, Conrad, Serang, 2017
https://arxiv.org/pdf/1708.01873.pdf
https://github.com/mratsim/constantine/blob/d51699248db04e29c7b1ad97e0bafa1499db00b5/constantine/math/polynomials/fft.nim#L205
by Mamy Ratsimbazafy (@mratsim).
See code for details and benchmark section of this PR for numbers; but in practice, this is efficient for permutations over slices of field element > 2M, and on x86 architectures.
Type of change
How has this been tested?
bitreverese_test.goin each curve specific package.How has this been benchmarked?
arm64target we still use the "naive" method.hpc6aand consumer gradeamdchip are good; up to 70% faster for large sizes.On the

hpc6afor sizes going up to 2^28:Similarly on my amd-based desktop:
On the M1 macbook:
Checklist:
golangci-lintdoes not output errors locally