Skip to content

Enable tensor_descriptor based atomic ops#1953

Open
ethche wants to merge 4 commits intomainfrom
atomic-add-indexing-strategy
Open

Enable tensor_descriptor based atomic ops#1953
ethche wants to merge 4 commits intomainfrom
atomic-add-indexing-strategy

Conversation

@ethche
Copy link
Copy Markdown
Contributor

@ethche ethche commented Apr 6, 2026

For kernels like matmul_split_k we can get perf improvements by having tensor_descriptor based atomic_add.

Creates a new atomic_indexing config field that controls which indexing strategy is used for hl.atomic_add, independent of the indexing field used for loads/stores (to not break any existing configs). By default we have atomic_indexing = pointer. When set to "tensor_descriptor", Helion outputs desc.atomic_add() instead of tl.atomic_add(...).

Some decisions I'd like to flag:

  • atomic_indexing is separate from indexing. This is to not break existing configs.
  • We re-use IndexingStrategy, and add a codegen_atomic_add to both Pointer and TensorDescriptor, and inherit the is_supported() checks for TensorDescriptor.
  • atomic_indexing = "tensor_descriptor" will silently fall back to "pointer" if either (1) sem != "relaxed" and (2) another op uses the output directly (since desc.atomic_add() returns void) and (3) if the atomic op is xchg or cas. We could instead raise an error.
  • We have support for all reduction ops supported by TMA atomics: {add, and, max, min, or, xor}. We have a generic codegen_atomic(op,...) rather than separate codegen_atomic_add, codegen_atomic_and, etc. methods.

We see perf gains for matmul_split_k, comparing two configs withatomic_indexing = pointer vs atomic_indexing = pointer:
https://gist.github.com/ethche/e36ced0f446c5836d17cb8020d57b02b

Size Pointer TFLOPS Pointer ms Pointer err TensorDesc TFLOPS TensorDesc ms TensorDesc err Speedup
2048 136.0 0.126 3.20e-04 159.1 0.108 3.20e-04 1.17x
4096 152.0 0.904 5.95e-04 180.1 0.763 5.95e-04 1.19x
8192 112.8 9.745 1.94e-03 136.2 8.070 1.94e-03 1.21x

Ethan Che added 3 commits April 3, 2026 10:14
…g config

Add a separate `atomic_indexing` config field that controls which indexing
strategy is used for atomic operations (e.g., hl.atomic_add), independent
of the `indexing` field used for loads/stores. When set to
"tensor_descriptor", atomic writes use TMA-based desc.atomic_add() which
bypasses L1 cache and reduces contention — yielding 1.2-1.8x speedup on
Blackwell GPUs for split-K patterns.

Key changes:
- Add codegen_atomic_add() to all IndexingStrategy subclasses
- TensorDescriptorIndexingStrategy falls back to pointer when: the return
  value is consumed (desc returns void), sem is non-relaxed (unsupported
  by descriptor API), or the tensor doesn't meet descriptor requirements
- BlockPtrIndexingStrategy always falls back to pointer (no atomic support)
- Separate atomic_op_index counter and get_atomic_indexing_strategy() in
  DeviceFunction to keep atomic config independent from load/store config
- Register atomic_indexing as an autotunable in ConfigSpec
- Split _count_device_loads_and_stores back to loads/stores only
- Add _count_device_atomics and _register_atomic_tunables as separate functions
- Add valid_atomic_indexing_types() that only allows pointer and
  tensor_descriptor (no block_ptr since Triton has no block_ptr atomics)
- Add tests for per-op atomic_indexing list and block_ptr fallback
block_ptr is never a valid atomic_indexing choice, so this fallback
method is unreachable.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 6, 2026
Refactor the atomic codegen path so all atomic operations (add, and, or,
xor, max, min, xchg) route through a generic codegen_atomic(op, ...)
method on IndexingStrategy. This enables tensor_descriptor-based TMA
atomics for all supported reduction ops (add, and, max, min, or, xor),
with automatic fallback to pointer for unsupported ops (xchg, cas),
return-value-consuming calls, and non-relaxed memory semantics.
@ethche ethche force-pushed the atomic-add-indexing-strategy branch from 9e75d29 to f770145 Compare April 6, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant