Enable tensor_descriptor based atomic ops by ethche · Pull Request #1953 · pytorch/helion

ethche · 2026-04-06T03:36:17Z

For kernels like matmul_split_k we can get perf improvements by having tensor_descriptor based atomic_add.

Creates a new atomic_indexing config field that controls which indexing strategy is used for hl.atomic_add, independent of the indexing field used for loads/stores (to not break any existing configs). By default we have atomic_indexing = pointer. When set to "tensor_descriptor", Helion outputs desc.atomic_add() instead of tl.atomic_add(...).

Some decisions I'd like to flag:

atomic_indexing is separate from indexing. This is to not break existing configs.
We re-use IndexingStrategy, and add a codegen_atomic_add to both Pointer and TensorDescriptor, and inherit the is_supported() checks for TensorDescriptor.
atomic_indexing = "tensor_descriptor" will silently fall back to "pointer" if either (1) sem != "relaxed" and (2) another op uses the output directly (since desc.atomic_add() returns void) and (3) if the atomic op is xchg or cas. We could instead raise an error.
We have support for all reduction ops supported by TMA atomics: {add, and, max, min, or, xor}. We have a generic codegen_atomic(op,...) rather than separate codegen_atomic_add, codegen_atomic_and, etc. methods.

We see perf gains for matmul_split_k, comparing two configs withatomic_indexing = pointer vs atomic_indexing = pointer:
https://gist.github.com/ethche/e36ced0f446c5836d17cb8020d57b02b

Size	Pointer TFLOPS	Pointer ms	Pointer err	TensorDesc TFLOPS	TensorDesc ms	TensorDesc err	Speedup
2048	136.0	0.126	3.20e-04	159.1	0.108	3.20e-04	1.17x
4096	152.0	0.904	5.95e-04	180.1	0.763	5.95e-04	1.19x
8192	112.8	9.745	1.94e-03	136.2	8.070	1.94e-03	1.21x

…g config Add a separate `atomic_indexing` config field that controls which indexing strategy is used for atomic operations (e.g., hl.atomic_add), independent of the `indexing` field used for loads/stores. When set to "tensor_descriptor", atomic writes use TMA-based desc.atomic_add() which bypasses L1 cache and reduces contention — yielding 1.2-1.8x speedup on Blackwell GPUs for split-K patterns. Key changes: - Add codegen_atomic_add() to all IndexingStrategy subclasses - TensorDescriptorIndexingStrategy falls back to pointer when: the return value is consumed (desc returns void), sem is non-relaxed (unsupported by descriptor API), or the tensor doesn't meet descriptor requirements - BlockPtrIndexingStrategy always falls back to pointer (no atomic support) - Separate atomic_op_index counter and get_atomic_indexing_strategy() in DeviceFunction to keep atomic config independent from load/store config - Register atomic_indexing as an autotunable in ConfigSpec

- Split _count_device_loads_and_stores back to loads/stores only - Add _count_device_atomics and _register_atomic_tunables as separate functions - Add valid_atomic_indexing_types() that only allows pointer and tensor_descriptor (no block_ptr since Triton has no block_ptr atomics) - Add tests for per-op atomic_indexing list and block_ptr fallback

block_ptr is never a valid atomic_indexing choice, so this fallback method is unreachable.

Refactor the atomic codegen path so all atomic operations (add, and, or, xor, max, min, xchg) route through a generic codegen_atomic(op, ...) method on IndexingStrategy. This enables tensor_descriptor-based TMA atomics for all supported reduction ops (add, and, max, min, or, xor), with automatic fallback to pointer for unsupported ops (xchg, cas), return-value-consuming calls, and non-relaxed memory semantics.

Ethan Che added 3 commits April 3, 2026 10:14

Remove dead codegen_atomic_add from BlockPtrIndexingStrategy

e351047

block_ptr is never a valid atomic_indexing choice, so this fallback method is unreachable.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 6, 2026

ethche force-pushed the atomic-add-indexing-strategy branch from 9e75d29 to f770145 Compare April 6, 2026 03:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable tensor_descriptor based atomic ops#1953

Enable tensor_descriptor based atomic ops#1953
ethche wants to merge 4 commits intomainfrom
atomic-add-indexing-strategy

ethche commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ethche commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant