Skip to content

[DO NOT REVIEW] [Pallas] Add TPU nightly benchmark workflow and runner#1913

Draft
norx1991 wants to merge 14 commits intomainfrom
yifeixu/tpu-nightly-benchmark
Draft

[DO NOT REVIEW] [Pallas] Add TPU nightly benchmark workflow and runner#1913
norx1991 wants to merge 14 commits intomainfrom
yifeixu/tpu-nightly-benchmark

Conversation

@norx1991
Copy link
Copy Markdown
Contributor

@norx1991 norx1991 commented Apr 1, 2026

Caution

The push trigger in benchmark_tpu_nightly.yml is temporary for CI testing — remove before merging.

Summary

Add a nightly CI workflow that runs Helion examples with autotuning on TPU, with results published to pytorch benchmark hub. This mirrors the GPU benchmark infrastructure (benchmark_nightly.yml + benchmarks/run.py) but uses a standalone runner since TritonBench has hard dependencies on triton and CUDA at import/install/runtime levels.

New files

  • benchmarks/run_tpu.py — TPU benchmark runner

    • CLI: --kernel/--op (comma-separated), --output (JSON), --list-kernels
    • Multi-shape benchmarking: each kernel tested at multiple input sizes (like TritonBench does for GPU)
    • Per-shape accuracy check + timing vs torch baseline with speedup display
    • JSON output in pytorch benchmark hub format
    • 11 reliable kernels: exp, add, softmax_two_pass, welford, attention, bmm, geglu, grpo_loss, jagged_hstu_attn, low_mem_dropout, swiglu
  • .github/workflows/benchmark_tpu.yml — Reusable benchmark workflow

    • Runner: linux.google.tpuv7x.1
    • Setup: PyTorch CPU nightly, JAX/Pallas, builds torch_tpu from pinned commit
    • Two-pass pattern: autotune → sleep 1min → HELION_ASSERT_CACHE_HIT=1 verify + record
    • Upload to pytorch benchmark hub
  • .github/workflows/benchmark_tpu_nightly.yml — Nightly trigger

    • Cron: daily at 2 AM PST (10 AM UTC)
    • workflow_dispatch with kernels input for manual runs

Autotuning effort: quick vs full

The workflow uses HELION_AUTOTUNE_EFFORT=quick instead of the default full effort. With full effort (20 generations, 5 copies, FROM_RANDOM initial population), 5 out of 11 kernels exceed the 1200s per-kernel timeout on CI:

  • Timeout with full effort: welford, attention, geglu, grpo_loss, swiglu
  • Pass with full effort: exp, add, softmax_two_pass, bmm, low_mem_dropout, jagged_hstu_attn

With quick effort, all 11 kernels complete within the timeout. The primary goal of this nightly benchmark is to verify that Helion examples compile, run, and produce correct results on TPU — not to find optimal configs. Quick effort is sufficient for this purpose and can be upgraded once autotuning performance improves.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 1, 2026
@norx1991 norx1991 changed the title Add TPU nightly benchmark workflow and runner [DO NOT REVIEW] [Pallas] Add TPU nightly benchmark workflow and runner Apr 1, 2026
norx1991 added 10 commits April 3, 2026 22:55
Add a nightly CI workflow that runs Helion examples with autotuning on TPU,
with results published to pytorch benchmark hub.
- gnupg -> gpg (package available on runner)
- Use correct secret name (torchtpu-read-key) and repo (google-pytorch/torch_tpu)
- Update jax/jaxlib to 0.9.2 matching test.yml
- Add 600s per-kernel timeout using multiprocessing to handle stuck
  autotuning (native C++ calls can't be interrupted by Python signals)
- Set HELION_AUTOTUNE_EFFORT=quick in CI for faster autotuning
  (30 initial population, 5 generations vs 100/20 for full)
- Timeout configurable via HELION_BENCHMARK_KERNEL_TIMEOUT env var
The pytorch/test-infra gather-* actions require pip and nvidia-ml-py,
which don't work in a uv venv on TPU runners. Remove the upload job
and gather-* steps; keep only the artifact upload for now.
… generations

- Add --num-shapes CLI flag to control how many shapes per kernel (default: all)
- Restore full shape lists but use --num-shapes 1 in CI to avoid multiplied autotuning time
- Increase per-kernel timeout from 600s to 1200s (quick autotuning on v7 takes ~10min)
- Set HELION_AUTOTUNE_MAX_GENERATIONS=2 to further limit autotuning time
- Don't fail the job on partial kernel failures (report results for what passed)
… timeout

The benchmark runner was using multiprocessing.Process (fork) for per-kernel
timeouts. On Linux, forking after TPU/JAX initialization causes deadlocks
because JAX's internal threads and locks don't survive fork correctly. This
caused every kernel to hang for the full timeout (1200s) on CI.

Replace with signal.SIGALRM which runs everything in one process, avoiding
the fork-after-init issue entirely.
@norx1991 norx1991 force-pushed the yifeixu/tpu-nightly-benchmark branch from 7c00634 to 5edc93a Compare April 4, 2026 06:14
norx1991 added 4 commits April 4, 2026 12:47
New kernels: attention, bmm, geglu, grpo_loss, jagged_hstu_attn,
low_mem_dropout, swiglu. Total: 11 kernels (up from 4).
All kernels fail with "Default config failed while computing baseline"
but the actual exception is hidden at INFO level. DEBUG will show
the generated code and underlying error.
- Fix wrong kernel function name: jagged_hstu_attn -> _helion_jagged_attention_kernel
- Add HELION_AUTOTUNE_EFFORT=quick to CI workflow — full effort times out
  for 5/11 kernels (welford, attention, geglu, grpo_loss, swiglu)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants