[DO NOT REVIEW] [Pallas] Add TPU nightly benchmark workflow and runner by norx1991 · Pull Request #1913 · pytorch/helion

norx1991 · 2026-04-01T20:38:51Z

Caution

The push trigger in benchmark_tpu_nightly.yml is temporary for CI testing — remove before merging.

Summary

Add a nightly CI workflow that runs Helion examples with autotuning on TPU, with results published to pytorch benchmark hub. This mirrors the GPU benchmark infrastructure (benchmark_nightly.yml + benchmarks/run.py) but uses a standalone runner since TritonBench has hard dependencies on triton and CUDA at import/install/runtime levels.

New files

benchmarks/run_tpu.py — TPU benchmark runner
- CLI: --kernel/--op (comma-separated), --output (JSON), --list-kernels
- Multi-shape benchmarking: each kernel tested at multiple input sizes (like TritonBench does for GPU)
- Per-shape accuracy check + timing vs torch baseline with speedup display
- JSON output in pytorch benchmark hub format
- 11 reliable kernels: exp, add, softmax_two_pass, welford, attention, bmm, geglu, grpo_loss, jagged_hstu_attn, low_mem_dropout, swiglu
.github/workflows/benchmark_tpu.yml — Reusable benchmark workflow
- Runner: linux.google.tpuv7x.1
- Setup: PyTorch CPU nightly, JAX/Pallas, builds torch_tpu from pinned commit
- Two-pass pattern: autotune → sleep 1min → HELION_ASSERT_CACHE_HIT=1 verify + record
- Upload to pytorch benchmark hub
.github/workflows/benchmark_tpu_nightly.yml — Nightly trigger
- Cron: daily at 2 AM PST (10 AM UTC)
- workflow_dispatch with kernels input for manual runs

Autotuning effort: `quick` vs `full`

The workflow uses HELION_AUTOTUNE_EFFORT=quick instead of the default full effort. With full effort (20 generations, 5 copies, FROM_RANDOM initial population), 5 out of 11 kernels exceed the 1200s per-kernel timeout on CI:

Timeout with full effort: welford, attention, geglu, grpo_loss, swiglu
Pass with full effort: exp, add, softmax_two_pass, bmm, low_mem_dropout, jagged_hstu_attn

With quick effort, all 11 kernels complete within the timeout. The primary goal of this nightly benchmark is to verify that Helion examples compile, run, and produce correct results on TPU — not to find optimal configs. Quick effort is sufficient for this purpose and can be upgraded once autotuning performance improves.

.github/workflows/benchmark_tpu.yml

Add a nightly CI workflow that runs Helion examples with autotuning on TPU, with results published to pytorch benchmark hub.

- gnupg -> gpg (package available on runner) - Use correct secret name (torchtpu-read-key) and repo (google-pytorch/torch_tpu) - Update jax/jaxlib to 0.9.2 matching test.yml

- Add 600s per-kernel timeout using multiprocessing to handle stuck autotuning (native C++ calls can't be interrupted by Python signals) - Set HELION_AUTOTUNE_EFFORT=quick in CI for faster autotuning (30 initial population, 5 generations vs 100/20 for full) - Timeout configurable via HELION_BENCHMARK_KERNEL_TIMEOUT env var

The pytorch/test-infra gather-* actions require pip and nvidia-ml-py, which don't work in a uv venv on TPU runners. Remove the upload job and gather-* steps; keep only the artifact upload for now.

… generations - Add --num-shapes CLI flag to control how many shapes per kernel (default: all) - Restore full shape lists but use --num-shapes 1 in CI to avoid multiplied autotuning time - Increase per-kernel timeout from 600s to 1200s (quick autotuning on v7 takes ~10min) - Set HELION_AUTOTUNE_MAX_GENERATIONS=2 to further limit autotuning time - Don't fail the job on partial kernel failures (report results for what passed)

… timeout The benchmark runner was using multiprocessing.Process (fork) for per-kernel timeouts. On Linux, forking after TPU/JAX initialization causes deadlocks because JAX's internal threads and locks don't survive fork correctly. This caused every kernel to hang for the full timeout (1200s) on CI. Replace with signal.SIGALRM which runs everything in one process, avoiding the fork-after-init issue entirely.

New kernels: attention, bmm, geglu, grpo_loss, jagged_hstu_attn, low_mem_dropout, swiglu. Total: 11 kernels (up from 4).

All kernels fail with "Default config failed while computing baseline" but the actual exception is hidden at INFO level. DEBUG will show the generated code and underlying error.

- Fix wrong kernel function name: jagged_hstu_attn -> _helion_jagged_attention_kernel - Add HELION_AUTOTUNE_EFFORT=quick to CI workflow — full effort times out for 5/11 kernels (welford, attention, geglu, grpo_loss, swiglu)

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 1, 2026

github-advanced-security bot found potential problems Apr 1, 2026

View reviewed changes

.github/workflows/benchmark_tpu.yml Fixed Show fixed Hide fixed

norx1991 changed the title ~~Add TPU nightly benchmark workflow and runner~~ [DO NOT REVIEW] [Pallas] Add TPU nightly benchmark workflow and runner Apr 1, 2026

norx1991 added 10 commits April 3, 2026 22:55

Add TPU nightly benchmark workflow and runner

5d5a4d9

Add a nightly CI workflow that runs Helion examples with autotuning on TPU, with results published to pytorch benchmark hub.

Add top-level permissions block to TPU benchmark workflows

791e792

Add temporary push trigger for CI testing (remove before merge)

323c565

Fix TPU benchmark CI: align with test.yml setup

1fbe293

- gnupg -> gpg (package available on runner) - Use correct secret name (torchtpu-read-key) and repo (google-pytorch/torch_tpu) - Update jax/jaxlib to 0.9.2 matching test.yml

Remove upload-benchmark-results job from TPU benchmark workflow

56897df

The pytorch/test-infra gather-* actions require pip and nvidia-ml-py, which don't work in a uv venv on TPU runners. Remove the upload job and gather-* steps; keep only the artifact upload for now.

Remove layer_norm from TPU benchmark (OOB slice bug, gh#1937)

a6183d1

Use default autotuning settings for TPU benchmark (match GPU)

5edc93a

norx1991 force-pushed the yifeixu/tpu-nightly-benchmark branch from 7c00634 to 5edc93a Compare April 4, 2026 06:14

norx1991 added 4 commits April 4, 2026 12:47

Add 7 more passing kernels to TPU benchmark

e156b95

New kernels: attention, bmm, geglu, grpo_loss, jagged_hstu_attn, low_mem_dropout, swiglu. Total: 11 kernels (up from 4).

Set HELION_AUTOTUNE_LOG_LEVEL=DEBUG in TPU benchmark workflow

1025e4e

All kernels fail with "Default config failed while computing baseline" but the actual exception is hidden at INFO level. DEBUG will show the generated code and underlying error.

Revert HELION_AUTOTUNE_LOG_LEVEL back to INFO for TPU benchmark

02989ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT REVIEW] [Pallas] Add TPU nightly benchmark workflow and runner#1913

[DO NOT REVIEW] [Pallas] Add TPU nightly benchmark workflow and runner#1913
norx1991 wants to merge 14 commits intomainfrom
yifeixu/tpu-nightly-benchmark

norx1991 commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

norx1991 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New files

Autotuning effort: quick vs full

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

norx1991 commented Apr 1, 2026 •

edited

Loading

Autotuning effort: `quick` vs `full`