Open
Conversation
added 3 commits
April 3, 2026 10:14
…g config Add a separate `atomic_indexing` config field that controls which indexing strategy is used for atomic operations (e.g., hl.atomic_add), independent of the `indexing` field used for loads/stores. When set to "tensor_descriptor", atomic writes use TMA-based desc.atomic_add() which bypasses L1 cache and reduces contention — yielding 1.2-1.8x speedup on Blackwell GPUs for split-K patterns. Key changes: - Add codegen_atomic_add() to all IndexingStrategy subclasses - TensorDescriptorIndexingStrategy falls back to pointer when: the return value is consumed (desc returns void), sem is non-relaxed (unsupported by descriptor API), or the tensor doesn't meet descriptor requirements - BlockPtrIndexingStrategy always falls back to pointer (no atomic support) - Separate atomic_op_index counter and get_atomic_indexing_strategy() in DeviceFunction to keep atomic config independent from load/store config - Register atomic_indexing as an autotunable in ConfigSpec
- Split _count_device_loads_and_stores back to loads/stores only - Add _count_device_atomics and _register_atomic_tunables as separate functions - Add valid_atomic_indexing_types() that only allows pointer and tensor_descriptor (no block_ptr since Triton has no block_ptr atomics) - Add tests for per-op atomic_indexing list and block_ptr fallback
block_ptr is never a valid atomic_indexing choice, so this fallback method is unreachable.
Refactor the atomic codegen path so all atomic operations (add, and, or, xor, max, min, xchg) route through a generic codegen_atomic(op, ...) method on IndexingStrategy. This enables tensor_descriptor-based TMA atomics for all supported reduction ops (add, and, max, min, or, xor), with automatic fallback to pointer for unsupported ops (xchg, cas), return-value-consuming calls, and non-relaxed memory semantics.
9e75d29 to
f770145
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For kernels like matmul_split_k we can get perf improvements by having tensor_descriptor based atomic_add.
Creates a new
atomic_indexingconfig field that controls which indexing strategy is used for hl.atomic_add, independent of theindexingfield used for loads/stores (to not break any existing configs). By default we haveatomic_indexing = pointer. When set to "tensor_descriptor", Helion outputsdesc.atomic_add()instead oftl.atomic_add(...).Some decisions I'd like to flag:
atomic_indexingis separate fromindexing. This is to not break existing configs.codegen_atomic_addto both Pointer and TensorDescriptor, and inherit theis_supported()checks for TensorDescriptor.atomic_indexing = "tensor_descriptor"will silently fall back to "pointer" if either (1) sem != "relaxed" and (2) another op uses the output directly (since desc.atomic_add() returns void) and (3) if the atomic op isxchgorcas. We could instead raise an error.{add, and, max, min, or, xor}. We have a genericcodegen_atomic(op,...)rather than separate codegen_atomic_add, codegen_atomic_and, etc. methods.We see perf gains for
matmul_split_k, comparing two configs withatomic_indexing = pointervsatomic_indexing = pointer:https://gist.github.com/ethche/e36ced0f446c5836d17cb8020d57b02b