Skip to content

{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0#24365

Merged
boegel merged 28 commits intoeasybuilders:developfrom
Flamefire:20251024183337_new_pr_PyTorch290
Mar 30, 2026
Merged

{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0#24365
boegel merged 28 commits intoeasybuilders:developfrom
Flamefire:20251024183337_new_pr_PyTorch290

Conversation

@Flamefire Flamefire marked this pull request as draft October 24, 2025 16:33
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 24, 2025

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

@Thyre Thyre added the 2024a issues & PRs related to 2024a common toolchains label Oct 25, 2025
@Flamefire Flamefire marked this pull request as ready for review December 9, 2025 11:31
@Flamefire Flamefire changed the title {ai}[foss/2024a] PyTorch v2.9.0 w/ CUDA 12.6.0 {ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0 Dec 9, 2025
@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 9849937 to 15b85aa Compare December 9, 2025 11:45
@github-actions github-actions Bot added the new label Dec 9, 2025
@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (total: 27 hours 27 mins 43 secs) (1 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/50cdd13305fd9a33c6140c223aeab6cd for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Dec 13, 2025

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 4 out of 6 (total: 8 mins 45 secs) (6 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/e06f98f956452bcb8a132f816e55927b for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Dec 13, 2025

I'm also seeing a crash with the triton_test.py script:

== FAILED: Installation ended unsuccessfully: Sanity check failed: sanity check command TRITON_HOME=$TMPDIR/eb-triton_home python
/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/test/triton_test.py 8.0 failed with exit code 1 (output: Traceback (most recent call last):
  File "/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/test/triton_test.py", line 13, in <module>
    src = triton.compiler.ASTSource(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/lib/python3.12/site-packages/triton/compiler/compiler.py", line 67, in __init__
    for k in self.signature.keys():
             ^^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'keys'

@Flamefire
Copy link
Copy Markdown
Contributor Author

That's why the checksum changed: Breaking change in Triton 3.5 and I updated the test script accordingly. It is in #24793

@Flamefire
Copy link
Copy Markdown
Contributor Author

I had to use tlparse 0.4.0 (also separate PR in #24882) as the older one isn't compatible with PyTorch output, see pytorch/pytorch@92c2dae

The lowest tlparse version that works is 0.3.42.

Not sure if this causes conflicts in EB. The alternative is to drop this dependency as it is optional

@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 2b8bc42 to 64a4d67 Compare December 16, 2025 17:04
@Flamefire
Copy link
Copy Markdown
Contributor Author

Rebased to remove EasyConfigs present in develop from this branch.

Also added 2 more patches to avoid remaining failures.

@github-actions github-actions Bot removed the new label Dec 16, 2025
@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (total: 4 mins 48 secs) (5 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/6f2e5c7a020ec72dff9d2f5c8220fba5 for a full test report.

@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch 2 times, most recently from 01180ef to 381c028 Compare December 17, 2025 09:09
@boegel boegel added this to the release after 5.2.0 milestone Dec 18, 2025
@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (total: 24 hours 42 mins 21 secs) (5 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/a6bd885643124f5fac4864060e0e18cd for a full test report.

…es: PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_readd-support-for-nvidia-cutlass-python-package.patch
@verdurin
Copy link
Copy Markdown
Member

Giving this a try on our GH200 nodes.

@verdurin
Copy link
Copy Markdown
Member

verdurin commented Mar 2, 2026

As discussed on Slack, the build was still going after a week when the node crashed.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (total: 27 hours 21 mins 47 secs) (3 easyconfigs in total)
c52 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/329ba996b3b52bc432c63a52474dae9e for a full test report.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 2 out of 3 (total: 45 hours 7 mins 48 secs) (3 easyconfigs in total)
i8013 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/90cec1eaa461ee3e31409057e8606117 for a full test report.

@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from dcf14bd to efa1f26 Compare March 11, 2026 11:55
@Flamefire
Copy link
Copy Markdown
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (total: 27 hours 26 mins 54 secs) (3 easyconfigs in total)
i8004 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/c60221f0442df0f334bbfa21083b4885 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 24, 2026

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16

@boegelbot
Copy link
Copy Markdown
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24365 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24365 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 10082

Test results coming soon (I hope)...

Details

- notification for comment with ID 4119347250 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 26, 2026

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (total: 30 hours 29 mins 36 secs) (3 easyconfigs in total)
node4308.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/0bcdbf5171c38da38e9523d11fb50d37 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 26, 2026

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (total: 33 hours 38 mins 50 secs) (3 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 590.48.01, Python 3.9.21
See https://gist.github.com/boegel/547187cf1e4f7209288268edc9a50d65 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 26, 2026

Test report by @boegel
FAILED
Build succeeded for 2 out of 3 (total: 33 hours 57 mins 31 secs) (3 easyconfigs in total)
node3303.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/537d0e7d10586d4c92d8aea758b7a481 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 26, 2026

Last test report shows slightly elevated test failures on a V100 system, which is not unexpected at this point, I would say...

== 2026-03-26 05:03:04,054 build_log.py:454 WARNING 63 test failures, 0 test errors (out of 266247):
Failed tests (suites/files):
	distributed/optim/test_zero_redundancy_optimizer (2 failed, 10 passed, 30 skipped, 0 errors)
	distributed/test_store (1 failed, 40 passed, 12 skipped, 0 errors)
	dynamo/test_compiler_bisector (1 failed, 5 passed, 1 skipped, 0 errors)
	higher_order_ops/test_invoke_subgraph (2 failed, 65 passed, 1 skipped, 0 errors)
	inductor/test_aot_inductor (9 failed, 403 passed, 455 skipped, 0 errors)
	inductor/test_compile_subprocess (3 failed, 791 passed, 58 skipped, 0 errors)
	inductor/test_cuda_repro (2 failed, 68 passed, 8 skipped, 0 errors)
	inductor/test_custom_lowering (1 failed, 4 passed, 0 skipped, 0 errors)
	inductor/test_decompose_mem_bound_mm (4 failed, 31 passed, 2 skipped, 0 errors)
	inductor/test_inplace_padding (1 failed, 7 passed, 1 skipped, 0 errors)
	inductor/test_loop_ordering (2 failed, 45 passed, 2 skipped, 0 errors)
	inductor/test_max_autotune (10 failed, 112 passed, 60 skipped, 0 errors)
	inductor/test_memory (1 failed, 7 passed, 0 skipped, 0 errors)
	inductor/test_multi_kernel (2 failed, 17 passed, 0 skipped, 0 errors)
	inductor/test_online_softmax (17 failed, 14 passed, 0 skipped, 0 errors)
	inductor/test_pad_mm (1 failed, 17 passed, 1 skipped, 0 errors)
	inductor/test_scatter_optimization (1 failed, 7 passed, 0 skipped, 0 errors)
	inductor/test_select_algorithm (1 failed, 20 passed, 0 skipped, 0 errors)
	inductor/test_torchinductor_codegen_dynamic_shapes (1 failed, 1259 passed, 436 skipped, 0 errors)
	inductor/test_torchinductor_dynamic_shapes (1 failed, 1585 passed, 173 skipped, 0 errors)

I propose we don't make any further changes to this PR, but get it merged as is as soon as the test report from the bot comes back.

@Flamefire Let me know if you're interested in taking a look at the full log, I've saved that locally as joltik-easybuild-PyTorch-2.9.1-20260324.191022.vWiuw.log...

@Flamefire
Copy link
Copy Markdown
Contributor Author

I propose we don't make any further changes to this PR, but get it merged as is as soon as the test report from the bot comes back.

@Flamefire Let me know if you're interested in taking a look at the full log, I've saved that locally as joltik-easybuild-PyTorch-2.9.1-20260324.191022.vWiuw.log...

So basically deem Volta systems as "mostly supported"?
If you want I can take a look at the worst offenders at least. But as it doesn't matter for us anymore I'm ok with leaving that as-is

@boegelbot
Copy link
Copy Markdown
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (total: 43 hours 12 mins 40 secs) (3 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 590.48.01, Python 3.9.25
See https://gist.github.com/boegelbot/5853d680e35ae825f683cdd7ed05b718 for a full test report.

@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 30, 2026

I propose we don't make any further changes to this PR, but get it merged as is as soon as the test report from the bot comes back.
@Flamefire Let me know if you're interested in taking a look at the full log, I've saved that locally as joltik-easybuild-PyTorch-2.9.1-20260324.191022.vWiuw.log...

So basically deem Volta systems as "mostly supported"? If you want I can take a look at the worst offenders at least. But as it doesn't matter for us anymore I'm ok with leaving that as-is

Or at least be a bit more lenient with those old GPUs, since it's not a surprise that more tests are failing?

We could even bake that into the PyTorch easyblock, for example by doubling max failing tests on a V100 system for recent PyTorch versions?

Copy link
Copy Markdown
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Copy Markdown
Member

boegel commented Mar 30, 2026

Going in, thanks @Flamefire!

@boegel boegel merged commit 8c60e9a into easybuilders:develop Mar 30, 2026
6 checks passed
@Flamefire Flamefire deleted the 20251024183337_new_pr_PyTorch290 branch March 30, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2024a issues & PRs related to 2024a common toolchains update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants