{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0#24365
{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0#24365boegel merged 28 commits intoeasybuilders:developfrom
Conversation
|
Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use |
9849937 to
15b85aa
Compare
|
Test report by @Flamefire |
|
Test report by @boegel |
|
I'm also seeing a crash with the |
|
That's why the checksum changed: Breaking change in Triton 3.5 and I updated the test script accordingly. It is in #24793 |
|
I had to use tlparse 0.4.0 (also separate PR in #24882) as the older one isn't compatible with PyTorch output, see pytorch/pytorch@92c2dae
Not sure if this causes conflicts in EB. The alternative is to drop this dependency as it is optional |
2b8bc42 to
64a4d67
Compare
|
Rebased to remove EasyConfigs present in develop from this branch. Also added 2 more patches to avoid remaining failures. |
|
Test report by @Flamefire |
01180ef to
381c028
Compare
|
Test report by @Flamefire |
…es: PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_readd-support-for-nvidia-cutlass-python-package.patch
9e1d5c8 to
867d1a2
Compare
|
Giving this a try on our GH200 nodes. |
|
As discussed on Slack, the build was still going after a week when the node crashed. |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
dcf14bd to
efa1f26
Compare
|
Test report by @Flamefire |
|
@boegelbot please test @ jsc-zen3-a100 |
|
@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 4119347250 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegel |
|
Test report by @boegel |
|
Test report by @boegel |
|
Last test report shows slightly elevated test failures on a V100 system, which is not unexpected at this point, I would say... I propose we don't make any further changes to this PR, but get it merged as is as soon as the test report from the bot comes back. @Flamefire Let me know if you're interested in taking a look at the full log, I've saved that locally as |
So basically deem Volta systems as "mostly supported"? |
|
Test report by @boegelbot |
Or at least be a bit more lenient with those old GPUs, since it's not a surprise that more tests are failing? We could even bake that into the PyTorch easyblock, for example by doubling max failing tests on a V100 system for recent PyTorch versions? |
|
Going in, thanks @Flamefire! |
(created using
eb --new-pr)Requires:
inductor/test_flex*tests #25492