Epilogue subtiling: store indexing fix, example, and tuple output support in run_example by choijon5 · Pull Request #1907 · pytorch/helion

choijon5 · 2026-04-01T08:57:28Z

No description provided.

choijon5 · 2026-04-01T15:56:17Z

@ethche I'm having trouble having the autotuner find the subtiling config. If I run is as is, I often see a slowdown. If I seed epilogue subtiling with some configs to start quick autotuning on and compare with just native quick autotuning without epilogue subtiling, I see pretty substantial speedups.

I think the speedups are a bit inflated here since the default quick autotuning baseline lands on a bad config, but I consistently see 30%+ speedups with epilogue subtiling for these kernels.
I've uploaded the benchmarking script that I've used which contains the seeding logic for epilogue subtiling. Could you take a look and see how the autotuner can be improved for finding subtiling configs?
https://gist.github.com/choijon5/5d51f13a4dfa19ee69a42a459aec4672
I run the script with "python epilogue_subtile_vs_quick_baseline.py --seed 12345 --m 8192 --n 8192 --k 8192"

ethche · 2026-04-01T17:39:08Z

https://gist.github.com/choijon5/5d51f13a4dfa19ee69a42a459aec4672

Thank you for letting me know. I'll look into this. Two usual issues are:

The compile timeout is too small.
The configs with epilogue subtiling have slightly worse accuracy and exceed autotuner tolerance.

jansel · 2026-04-03T02:00:56Z

examples/epilogue_subtiling.py

Why do we need an example file here, shouldn't the autotuner find it automatically?

The examples are just to illustrate to users that epilogue subtiling is supported and demonstrate cases where it can provide benefits.

jansel · 2026-04-03T02:01:58Z

helion/_compiler/device_ir.py

            EnumFragment(choices=env.config_spec.valid_indexing_types()),
            length=total_count,
        )
+        env.config_spec.store_indexing_start = total_load_count


Are we assuming all loads come before all stores? Will this always be the case? Maybe we should split loads and stores in the config or find a cleaner way to represent this?

You're right. I've added tracking for the indices of stores so that we're not assuming all loads are before stores. This is a smaller change than separating loads and stores in the config.
I do think separate loads and stores is a bit more clear to users, but this is BC breaking. Happy to take this route if you think it is better in the long term.

…t ordering

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 1, 2026

choijon5 force-pushed the epilogue-subtiling branch 3 times, most recently from 4fbdeac to bafff2a Compare April 1, 2026 09:37

choijon5 requested review from ethche and jansel April 1, 2026 09:59

choijon5 requested a review from oulgen April 2, 2026 09:23

choijon5 force-pushed the epilogue-subtiling branch from bafff2a to 70a7cb0 Compare April 2, 2026 09:30

jansel reviewed Apr 3, 2026

View reviewed changes

choijon5 added 2 commits April 3, 2026 19:03

Add epilogue subtiling pass for tmem_subslice optimization

c0c3a59

Fix subtiled store indexing after config overrides

59e520f

choijon5 force-pushed the epilogue-subtiling branch from 70a7cb0 to 37769ed Compare April 4, 2026 02:11

Add epilogue subtiling example and tuple output support in run_example

6ec20ac

choijon5 force-pushed the epilogue-subtiling branch 2 times, most recently from 1902fa6 to 9325329 Compare April 4, 2026 03:13

choijon5 added 3 commits April 3, 2026 21:24

Add epilogue_subtile/flatten_loops incompatibility check and tests

89e8d02

Track store positions in indexing list instead of assuming loads-firs…

2a40825

…t ordering

Add epilogue subtiling documentation

eec168a

choijon5 force-pushed the epilogue-subtiling branch from 9325329 to eec168a Compare April 4, 2026 04:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epilogue subtiling: store indexing fix, example, and tuple output support in run_example#1907

Epilogue subtiling: store indexing fix, example, and tuple output support in run_example#1907
choijon5 wants to merge 6 commits intomainfrom
epilogue-subtiling

choijon5 commented Apr 1, 2026

Uh oh!

choijon5 commented Apr 1, 2026 •

edited

Loading

Uh oh!

ethche commented Apr 1, 2026

Uh oh!

jansel Apr 3, 2026

Uh oh!

choijon5 Apr 4, 2026

Uh oh!

jansel Apr 3, 2026

Uh oh!

choijon5 Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

choijon5 commented Apr 1, 2026

Uh oh!

choijon5 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethche commented Apr 1, 2026

Uh oh!

jansel Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

choijon5 Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

jansel Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

choijon5 Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

choijon5 commented Apr 1, 2026 •

edited

Loading