make sort32 fast#327
Conversation
| /// Two‑pass radix sort (base 2¹⁶) of 32‑bit float bit‑patterns, | ||
| /// descending order (largest keys first). Mirrors the JS `sort32Splats`. | ||
| #[inline(always)] | ||
| unsafe fn prefix_sum_exclusive(buckets: &mut [u32]) -> u32 { |
There was a problem hiding this comment.
Is there a specific reason this is marked unsafe? It compiles just fine without.
There was a problem hiding this comment.
i had many experiments with simd, which didn't make it marginally faster so i removed it for simplicity sake but forgot to remove the unsafe, will clean it up
|
Awesome work, gave it a try and can confirm that it improves sorting performance. In my limited testing I saw ~20% reduction in sorting time (~25% faster).
Without this change the performance gain seems to be roughly the same, or at least I didn't observe any significant difference. The majority of the benefit seems to come from making it branchless. |
|
@mrxz i squeezed a bit more performance ~<=1ms by removing more branches from hot loops, and what you noticed seems about right, it will differ from one wasm engine to another, and arch to another(specially cache sizes and arch) so it's hard to give a solid number but it'll still be a pump in performance |
try to improve the performance of sort32, on avg it's 30-40% faster .
things that changed :
pass 2 no longer re-reads
keys[],scratchstores a packed u64 of(inverted_key << 32 | original_index). pass 2 reads the high 16 bits directly from scratch withkv >> 48making it a sequential scanhistogram and scatter are now branchless to help llvm vectorize the loop
manually unrolled histogram and both scatter passes to 8-wide