{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0 by Flamefire · Pull Request #24365 · easybuilders/easybuild-easyconfigs

Flamefire · 2025-10-24T16:33:41Z

(created using eb --new-pr)

Requires:

github-actions · 2025-10-24T16:34:27Z

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

Flamefire · 2025-12-10T15:26:53Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (total: 27 hours 27 mins 43 secs) (1 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/50cdd13305fd9a33c6140c223aeab6cd for a full test report.

boegel · 2025-12-13T10:31:58Z

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 4 out of 6 (total: 8 mins 45 secs) (6 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/e06f98f956452bcb8a132f816e55927b for a full test report.

boegel · 2025-12-13T16:00:22Z

I'm also seeing a crash with the triton_test.py script:

== FAILED: Installation ended unsuccessfully: Sanity check failed: sanity check command TRITON_HOME=$TMPDIR/eb-triton_home python
/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/test/triton_test.py 8.0 failed with exit code 1 (output: Traceback (most recent call last):
  File "/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/test/triton_test.py", line 13, in <module>
    src = triton.compiler.ASTSource(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/lib/python3.12/site-packages/triton/compiler/compiler.py", line 67, in __init__
    for k in self.signature.keys():
             ^^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'keys'

Flamefire · 2025-12-13T17:36:33Z

That's why the checksum changed: Breaking change in Triton 3.5 and I updated the test script accordingly. It is in #24793

Flamefire · 2025-12-15T16:17:23Z

I had to use tlparse 0.4.0 (also separate PR in #24882) as the older one isn't compatible with PyTorch output, see pytorch/pytorch@92c2dae

The lowest tlparse version that works is 0.3.42.

Not sure if this causes conflicts in EB. The alternative is to drop this dependency as it is optional

Flamefire · 2025-12-16T17:05:22Z

Rebased to remove EasyConfigs present in develop from this branch.

Also added 2 more patches to avoid remaining failures.

Flamefire · 2025-12-16T17:10:33Z

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (total: 4 mins 48 secs) (5 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/6f2e5c7a020ec72dff9d2f5c8220fba5 for a full test report.

Flamefire · 2025-12-18T09:53:22Z

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (total: 24 hours 42 mins 21 secs) (5 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/a6bd885643124f5fac4864060e0e18cd for a full test report.

…es: PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_readd-support-for-nvidia-cutlass-python-package.patch

verdurin · 2026-02-20T15:55:37Z

Giving this a try on our GH200 nodes.

verdurin · 2026-03-02T11:35:31Z

As discussed on Slack, the build was still going after a week when the node crashed.

Flamefire · 2026-03-07T02:43:48Z

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (total: 27 hours 21 mins 47 secs) (3 easyconfigs in total)
c52 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/329ba996b3b52bc432c63a52474dae9e for a full test report.

Flamefire · 2026-03-07T06:34:59Z

Test report by @Flamefire
FAILED
Build succeeded for 2 out of 3 (total: 45 hours 7 mins 48 secs) (3 easyconfigs in total)
i8013 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/90cec1eaa461ee3e31409057e8606117 for a full test report.

Flamefire · 2026-03-15T21:10:07Z

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (total: 27 hours 26 mins 54 secs) (3 easyconfigs in total)
i8004 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/c60221f0442df0f334bbfa21083b4885 for a full test report.

…h290

boegel · 2026-03-24T15:46:49Z

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16

boegelbot · 2026-03-24T15:50:09Z

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24365 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24365 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

exit code: 0
output:

Submitted batch job 10082

Test results coming soon (I hope)...

Details

- notification for comment with ID 4119347250 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegel · 2026-03-26T00:35:21Z

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (total: 30 hours 29 mins 36 secs) (3 easyconfigs in total)
node4308.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/0bcdbf5171c38da38e9523d11fb50d37 for a full test report.

boegel · 2026-03-26T03:44:40Z

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (total: 33 hours 38 mins 50 secs) (3 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 590.48.01, Python 3.9.21
See https://gist.github.com/boegel/547187cf1e4f7209288268edc9a50d65 for a full test report.

boegel · 2026-03-26T04:03:09Z

Test report by @boegel
FAILED
Build succeeded for 2 out of 3 (total: 33 hours 57 mins 31 secs) (3 easyconfigs in total)
node3303.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/537d0e7d10586d4c92d8aea758b7a481 for a full test report.

boegel · 2026-03-26T08:47:15Z

Last test report shows slightly elevated test failures on a V100 system, which is not unexpected at this point, I would say...

== 2026-03-26 05:03:04,054 build_log.py:454 WARNING 63 test failures, 0 test errors (out of 266247):
Failed tests (suites/files):
	distributed/optim/test_zero_redundancy_optimizer (2 failed, 10 passed, 30 skipped, 0 errors)
	distributed/test_store (1 failed, 40 passed, 12 skipped, 0 errors)
	dynamo/test_compiler_bisector (1 failed, 5 passed, 1 skipped, 0 errors)
	higher_order_ops/test_invoke_subgraph (2 failed, 65 passed, 1 skipped, 0 errors)
	inductor/test_aot_inductor (9 failed, 403 passed, 455 skipped, 0 errors)
	inductor/test_compile_subprocess (3 failed, 791 passed, 58 skipped, 0 errors)
	inductor/test_cuda_repro (2 failed, 68 passed, 8 skipped, 0 errors)
	inductor/test_custom_lowering (1 failed, 4 passed, 0 skipped, 0 errors)
	inductor/test_decompose_mem_bound_mm (4 failed, 31 passed, 2 skipped, 0 errors)
	inductor/test_inplace_padding (1 failed, 7 passed, 1 skipped, 0 errors)
	inductor/test_loop_ordering (2 failed, 45 passed, 2 skipped, 0 errors)
	inductor/test_max_autotune (10 failed, 112 passed, 60 skipped, 0 errors)
	inductor/test_memory (1 failed, 7 passed, 0 skipped, 0 errors)
	inductor/test_multi_kernel (2 failed, 17 passed, 0 skipped, 0 errors)
	inductor/test_online_softmax (17 failed, 14 passed, 0 skipped, 0 errors)
	inductor/test_pad_mm (1 failed, 17 passed, 1 skipped, 0 errors)
	inductor/test_scatter_optimization (1 failed, 7 passed, 0 skipped, 0 errors)
	inductor/test_select_algorithm (1 failed, 20 passed, 0 skipped, 0 errors)
	inductor/test_torchinductor_codegen_dynamic_shapes (1 failed, 1259 passed, 436 skipped, 0 errors)
	inductor/test_torchinductor_dynamic_shapes (1 failed, 1585 passed, 173 skipped, 0 errors)

I propose we don't make any further changes to this PR, but get it merged as is as soon as the test report from the bot comes back.

@Flamefire Let me know if you're interested in taking a look at the full log, I've saved that locally as joltik-easybuild-PyTorch-2.9.1-20260324.191022.vWiuw.log...

Flamefire · 2026-03-26T10:08:25Z

I propose we don't make any further changes to this PR, but get it merged as is as soon as the test report from the bot comes back.

@Flamefire Let me know if you're interested in taking a look at the full log, I've saved that locally as joltik-easybuild-PyTorch-2.9.1-20260324.191022.vWiuw.log...

So basically deem Volta systems as "mostly supported"?
If you want I can take a look at the worst offenders at least. But as it doesn't matter for us anymore I'm ok with leaving that as-is

boegelbot · 2026-03-26T11:03:15Z

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (total: 43 hours 12 mins 40 secs) (3 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 590.48.01, Python 3.9.25
See https://gist.github.com/boegelbot/5853d680e35ae825f683cdd7ed05b718 for a full test report.

boegel · 2026-03-30T13:37:02Z

I propose we don't make any further changes to this PR, but get it merged as is as soon as the test report from the bot comes back.
@Flamefire Let me know if you're interested in taking a look at the full log, I've saved that locally as joltik-easybuild-PyTorch-2.9.1-20260324.191022.vWiuw.log...

So basically deem Volta systems as "mostly supported"? If you want I can take a look at the worst offenders at least. But as it doesn't matter for us anymore I'm ok with leaving that as-is

Or at least be a bit more lenient with those old GPUs, since it's not a surprise that more tests are failing?

We could even bake that into the PyTorch easyblock, for example by doubling max failing tests on a V100 system for recent PyTorch versions?

boegel

lgtm

boegel · 2026-03-30T13:48:31Z

Going in, thanks @Flamefire!

Flamefire marked this pull request as draft October 24, 2025 16:33

github-actions Bot added the update label Oct 24, 2025

Thyre added the 2024a issues & PRs related to 2024a common toolchains label Oct 25, 2025

This was referenced Nov 14, 2025

pyannote.audio 4.0.1 or newer + WhisperX vscentrum/vsc-software-stack#612

Open

whisperx vscentrum/vsc-software-stack#613

Open

Flamefire marked this pull request as ready for review December 9, 2025 11:31

Flamefire changed the title ~~{ai}[foss/2024a] PyTorch v2.9.0 w/ CUDA 12.6.0~~ {ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0 Dec 9, 2025

Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 9849937 to 15b85aa Compare December 9, 2025 11:45

github-actions Bot added the new label Dec 9, 2025

boegel requested changes Dec 13, 2025

View reviewed changes

Comment thread easybuild/easyconfigs/p/PyTorch/PyTorch-2.9.1-foss-2024a-CUDA-12.6.0.eb

boegel requested changes Dec 13, 2025

View reviewed changes

Comment thread easybuild/easyconfigs/t/Triton/Triton-3.5.0-gfbf-2024a-CUDA-12.6.0.eb

Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 2b8bc42 to 64a4d67 Compare December 16, 2025 17:04

github-actions Bot removed the new label Dec 16, 2025

Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch 2 times, most recently from 01180ef to 381c028 Compare December 17, 2025 09:09

boegel added this to the release after 5.2.0 milestone Dec 18, 2025

Flamefire added 3 commits December 18, 2025 10:56

adding easyconfigs: PyTorch-2.9.0-foss-2024a-CUDA-12.6.0.eb and patch…

5cf4096

…es: PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_readd-support-for-nvidia-cutlass-python-package.patch

Add dependencies

9beff55

Use tlparse 0.4.0

867d1a2

Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 9e1d5c8 to 867d1a2 Compare December 18, 2025 09:57

boegel mentioned this pull request Dec 18, 2025

Fix issues with PyTorch test-results (XML files) parsing and add tests easybuilders/easybuild-easyblocks#3803

Merged

boegel modified the milestones: next release (5.2.1), release after 5.2.1 Feb 18, 2026

Merge branch 'develop' into 20251024183337_new_pr_PyTorch290

e1e381e

Flamefire added 3 commits March 3, 2026 16:08

Fix race condition in checking for disabled tests

b81f710

Remove pytest-shard

ed504e8

Add more patches

deda8cb

Fix using wrong OpenMP library

efa1f26

Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from dcf14bd to efa1f26 Compare March 11, 2026 11:55

Flamefire added 2 commits March 11, 2026 16:29

Skip segfaulting flex_attention suite

66ba96f

Skip some tests failing on ARM

63a8597

Merge branch 'easybuilders:develop' into 20251024183337_new_pr_PyTorc…

3977123

…h290

boegel mentioned this pull request Mar 25, 2026

bump existing SPH-EXA easyconfigs to v0.96.2 #25606

Merged

boegel approved these changes Mar 30, 2026

View reviewed changes

boegel merged commit 8c60e9a into easybuilders:develop Mar 30, 2026
6 checks passed

Flamefire deleted the 20251024183337_new_pr_PyTorch290 branch March 30, 2026 14:01

Conversation

Flamefire commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Dec 10, 2025

Uh oh!

boegel commented Dec 13, 2025

Uh oh!

Uh oh!

Uh oh!

boegel commented Dec 13, 2025

Uh oh!

Flamefire commented Dec 13, 2025

Uh oh!

Flamefire commented Dec 15, 2025

Uh oh!

Flamefire commented Dec 16, 2025

Uh oh!

Flamefire commented Dec 16, 2025

Uh oh!

Flamefire commented Dec 18, 2025

Uh oh!

verdurin commented Feb 20, 2026

Uh oh!

verdurin commented Mar 2, 2026

Uh oh!

Flamefire commented Mar 7, 2026

Uh oh!

Flamefire commented Mar 7, 2026

Uh oh!

Flamefire commented Mar 15, 2026

Uh oh!

boegel commented Mar 24, 2026

Uh oh!

boegelbot commented Mar 24, 2026

Uh oh!

boegel commented Mar 26, 2026

Uh oh!

boegel commented Mar 26, 2026

Uh oh!

boegel commented Mar 26, 2026

Uh oh!

boegel commented Mar 26, 2026

Uh oh!

Flamefire commented Mar 26, 2026

Uh oh!

boegelbot commented Mar 26, 2026

Uh oh!

boegel commented Mar 30, 2026

Uh oh!

boegel left a comment

Choose a reason for hiding this comment

Uh oh!

boegel commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Flamefire commented Oct 24, 2025 •

edited

Loading

github-actions Bot commented Oct 24, 2025 •

edited

Loading