{lib}[foss/2023a] TensorFlow v2.13.0 w/ CUDA 12.1.1#19182
{lib}[foss/2023a] TensorFlow v2.13.0 w/ CUDA 12.1.1#19182VRehnberg wants to merge 1 commit intoeasybuilders:developfrom
Conversation
|
Test report by @VRehnberg |
|
The build error could easily be because you should have |
Hmm, yes that would be what I expected if it weren't for the fact that the non-cuda TensorFlow seem to be working without it. https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.13.0-foss-2023a.eb Additionally, the header that is "missing" should come with JsonCpp which is a dependency. https://github.com/open-source-parsers/jsoncpp/blob/master/include/json/json.h |
|
@Flamefire Any thoughts on why the test installation would fail with " |
|
@VRehnberg Any chance you have a custom easyblock in place for TensorFlow in |
| ] | ||
| dependencies = [ | ||
| ('CUDA', '12.1.1', '', SYSTEM), | ||
| ('cuDNN', '8.9.2.26', versionsuffix, SYSTEM), |
There was a problem hiding this comment.
@VRehnberg It seems like this cuDNN version is causing trouble, I'm getting:
In file included from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:37,
from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_OperationGraph.h:36,
from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Heuristics.h:31,
from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend.h:101,
from tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:56:
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h: In member function int64_t cudnn_frontend::PointWiseDesc_v8::getPortCount() const:
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h:69:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch]
69 | switch (mode) {
| ^
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h: In member function cudnn_frontend::Operation_v8&& cudnn_frontend::OperationBuilder_v8::build_pointwise_op():
bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:413:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch]
413 | switch (m_operation.pointwise_mode) {
| ^
cc1plus: some warnings being treated as errors
see also tensorflow/tensorflow#60832, where they suggest to downgrade to an older cuDNN (ugh...)
There was a problem hiding this comment.
There are no easyconfigs yet that using a 2023a toolchain and have a cuDNN dependency, so we still have the freedom to stick to cuDNN 8.6.* here...
There was a problem hiding this comment.
We also need to stick to CUDA 11.8 though, since cuDNN 8.6 is only paired with CUDA 10.3 and 11.8 it seems, see https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/
There was a problem hiding this comment.
And CUDA 11.8 is a problem with GCC 12.x, hitting this when installing NCCL on top of CUDA 11.8.0 with GCCcore/12.3.0:
unsupported GNU version! gcc versions later than 11 are not supported!
So that tells me we're doomed to stick to foss/2022a for TensorFlow 2.13.0?
There was a problem hiding this comment.
Meh, I'll close this one then.
There was a problem hiding this comment.
Hmm, or go with another CUDA I suppose. That's what the CUDA version suffix is for I guess. For CUDA 12.3 I can't find anything about compatible GCC, but extrapolating what I could find it will probably work for CUDA 12.3 which isn't listed for CuDNN 8.9.6, but could possibly work.
There was a problem hiding this comment.
unsupported GNU version! gcc versions later than 11 are not supported!So that tells me we're doomed to stick to
foss/2022afor TensorFlow 2.13.0?
We can workaround this by forcing NVCC to accept the "incompatible" compiler: https://github.com/easybuilders/easybuild-easyconfigs/pull/18853/files#diff-c0833191974a98d7eddf20cecac9d27ec670e369f43f75f3a4bafb2261b1135fR27
Of course there is a risk that the compiler really is incompatible...
| ] | ||
| dependencies = [ | ||
| ('CUDA', '12.1.1', '', SYSTEM), | ||
| ('cuDNN', '8.9.2.26', versionsuffix, SYSTEM), |
There was a problem hiding this comment.
There are no easyconfigs yet that using a 2023a toolchain and have a cuDNN dependency, so we still have the freedom to stick to cuDNN 8.6.* here...
Oh, good catch we actually did. We almost never have any custom ones so I forgot to check, but it's removed now at least. |
|
There seems to be some issues with compatibility with cuDNN, CUDA and GCC (see boegel's comments). Probably easier to just go with foss-2022a for now. I'll close this for now. |
|
Test report by @VRehnberg |
|
I was wondering what the right way forward is here, if this means 'we can not have a TensorFlow in 2023a'. Had a small discussion with @boegel . This page of tested build configurations lists TF 2.15 + cuDNN 8.8 + CUDA 12.2 as a tested combination (and who knows, it might even work with cuDNN 8.9 too). The best way forward is probably to
|
That's what I usually do too and why there's no PyTorch 2.x with CUDA yet as there are still remaining issues with the CPU version. |
(created using
eb --new-pr)