fix possible error/crash in NCCL on x86 due to cpuid#19231
fix possible error/crash in NCCL on x86 due to cpuid#19231smoors merged 3 commits intoeasybuilders:developfrom
Conversation
|
Test report by @Flamefire |
|
@boegelbot: please test @ generoso |
|
@smoors: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 1812577674 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
FAIL (build issue) NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.1.1.eb
|
That must be a download error on your site: |
| checksums = [('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9', | ||
| 'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96')] | ||
| patches = ['NCCL-2.16.2_fix-cpuid.patch'] | ||
| checksums = [ | ||
| {'v2.18.3-1.tar.gz': '6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9'}, | ||
| {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'}, | ||
| ] |
There was a problem hiding this comment.
Oh, didn't notice that. Done 👍
There was a problem hiding this comment.
Hitting a framework bug. Shall we resolve that first or shall I revert to non-dicts?
There was a problem hiding this comment.
let's get this merged first?
checksumming is quite a mess at the moment. probably better to fix this in EB-5 so breaking changes can be made, see also easybuilders/easybuild-framework#4346
|
A college of mine tried to run GROMACS with NCCL and ran into a Segmentation fault pointing to |
|
|
@boegelbot please test @ generoso |
|
@SebastianAchilles: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 1814160413 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @SebastianAchilles |
|
Test report by @SebastianAchilles |
|
@boegelbot please test @ jsc-zen2 |
|
@SebastianAchilles: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 1814215206 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
Test report by @boegelbot |
|
Going in, thanks @Flamefire! |
(created using
eb --new-pr)I found this to be the cause for multiple failures of PyTorch on x86 (AMD EPYC) like: