Skip to content

make sort32 fast#327

Open
39ali wants to merge 2 commits intosparkjsdev:mainfrom
39ali:sort32-fast
Open

make sort32 fast#327
39ali wants to merge 2 commits intosparkjsdev:mainfrom
39ali:sort32-fast

Conversation

@39ali
Copy link
Copy Markdown

@39ali 39ali commented Apr 29, 2026

try to improve the performance of sort32, on avg it's 30-40% faster .

things that changed :

  • pass 2 no longer re-reads keys[] , scratch stores a packed u64 of (inverted_key << 32 | original_index). pass 2 reads the high 16 bits directly from scratch with kv >> 48 making it a sequential scan

  • histogram and scatter are now branchless to help llvm vectorize the loop

  • manually unrolled histogram and both scatter passes to 8-wide

Comment thread rust/spark-worker-rs/src/sort.rs Outdated
/// Two‑pass radix sort (base 2¹⁶) of 32‑bit float bit‑patterns,
/// descending order (largest keys first). Mirrors the JS `sort32Splats`.
#[inline(always)]
unsafe fn prefix_sum_exclusive(buckets: &mut [u32]) -> u32 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason this is marked unsafe? It compiles just fine without.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had many experiments with simd, which didn't make it marginally faster so i removed it for simplicity sake but forgot to remove the unsafe, will clean it up

@mrxz
Copy link
Copy Markdown
Collaborator

mrxz commented Apr 29, 2026

Awesome work, gave it a try and can confirm that it improves sorting performance. In my limited testing I saw ~20% reduction in sorting time (~25% faster).

manually unrolled histogram and both scatter passes to 8-wide

Without this change the performance gain seems to be roughly the same, or at least I didn't observe any significant difference. The majority of the benefit seems to come from making it branchless.

@39ali
Copy link
Copy Markdown
Author

39ali commented Apr 29, 2026

@mrxz i squeezed a bit more performance ~<=1ms by removing more branches from hot loops, and what you noticed seems about right, it will differ from one wasm engine to another, and arch to another(specially cache sizes and arch) so it's hard to give a solid number but it'll still be a pump in performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants