{lib}[foss/2023a] TensorFlow v2.13.0 w/ CUDA 12.1.1 by VRehnberg · Pull Request #19182 · easybuilders/easybuild-easyconfigs

VRehnberg · 2023-11-09T15:03:04Z

(created using eb --new-pr)

VRehnberg · 2023-11-09T18:30:25Z

Test report by @VRehnberg
FAILED
Build succeeded for 14 out of 15 (1 easyconfigs in total)
alvis2-02 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 1 x NVIDIA Tesla T4, 535.104.05, Python 3.6.8
See https://gist.github.com/VRehnberg/f4440b196cb6b0337902b6d580f83bff for a full test report.

schiotz · 2023-11-10T10:49:53Z

The build error could easily be because you should have ('Python-bundle-PyPI', '2023.06') as a dependency. Most python packages that used to be in the main Python module have been moved there.

VRehnberg · 2023-11-10T12:30:02Z

The build error could easily be because you should have ('Python-bundle-PyPI', '2023.06') as a dependency. Most python packages that used to be in the main Python module have been moved there.

Hmm, yes that would be what I expected if it weren't for the fact that the non-cuda TensorFlow seem to be working without it. https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.13.0-foss-2023a.eb

Additionally, the header that is "missing" should come with JsonCpp which is a dependency. https://github.com/open-source-parsers/jsoncpp/blob/master/include/json/json.h

boegel · 2023-11-16T15:19:46Z

@Flamefire Any thoughts on why the test installation would fail with "fatal error: json/config.h: No such file or directory" while TensorFlow-2.13.0-foss-2023a.eb doesn't show that problem?

boegel · 2023-11-16T16:54:20Z

@VRehnberg Any chance you have a custom easyblock in place for TensorFlow in /apps/c3se-easyblocks/ that is causing trouble?

boegel · 2023-11-16T16:57:02Z

+]
+dependencies = [
+    ('CUDA', '12.1.1', '', SYSTEM),
+    ('cuDNN', '8.9.2.26', versionsuffix, SYSTEM),


@VRehnberg It seems like this cuDNN version is causing trouble, I'm getting:

In file included from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:37, from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_OperationGraph.h:36, from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Heuristics.h:31, from bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend.h:101, from tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:56: bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h: In member function int64_t cudnn_frontend::PointWiseDesc_v8::getPortCount() const: bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h:69:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch] 69 | switch (mode) { | ^ bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h: In member function cudnn_frontend::Operation_v8&& cudnn_frontend::OperationBuilder_v8::build_pointwise_op(): bazel-out/k8-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:413:16: error: enumeration value CUDNN_POINTWISE_RECIPROCAL not handled in switch [-Werror=switch] 413 | switch (m_operation.pointwise_mode) { | ^ cc1plus: some warnings being treated as errors

see also tensorflow/tensorflow#60832, where they suggest to downgrade to an older cuDNN (ugh...)

There are no easyconfigs yet that using a 2023a toolchain and have a cuDNN dependency, so we still have the freedom to stick to cuDNN 8.6.* here...

We also need to stick to CUDA 11.8 though, since cuDNN 8.6 is only paired with CUDA 10.3 and 11.8 it seems, see https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/

And CUDA 11.8 is a problem with GCC 12.x, hitting this when installing NCCL on top of CUDA 11.8.0 with GCCcore/12.3.0:

unsupported GNU version! gcc versions later than 11 are not supported!

So that tells me we're doomed to stick to foss/2022a for TensorFlow 2.13.0?

Meh, I'll close this one then.

Hmm, or go with another CUDA I suppose. That's what the CUDA version suffix is for I guess. For CUDA 12.3 I can't find anything about compatible GCC, but extrapolating what I could find it will probably work for CUDA 12.3 which isn't listed for CuDNN 8.9.6, but could possibly work.

unsupported GNU version! gcc versions later than 11 are not supported!

So that tells me we're doomed to stick to foss/2022a for TensorFlow 2.13.0?

We can workaround this by forcing NVCC to accept the "incompatible" compiler: https://github.com/easybuilders/easybuild-easyconfigs/pull/18853/files#diff-c0833191974a98d7eddf20cecac9d27ec670e369f43f75f3a4bafb2261b1135fR27
Of course there is a risk that the compiler really is incompatible...

boegel · 2023-11-16T16:59:57Z

+]
+dependencies = [
+    ('CUDA', '12.1.1', '', SYSTEM),
+    ('cuDNN', '8.9.2.26', versionsuffix, SYSTEM),


There are no easyconfigs yet that using a 2023a toolchain and have a cuDNN dependency, so we still have the freedom to stick to cuDNN 8.6.* here...

VRehnberg · 2023-11-17T07:27:46Z

@VRehnberg Any chance you have a custom easyblock in place for TensorFlow in /apps/c3se-easyblocks/ that is causing trouble?

Oh, good catch we actually did. We almost never have any custom ones so I forgot to check, but it's removed now at least.

VRehnberg · 2023-11-17T07:57:10Z

There seems to be some issues with compatibility with cuDNN, CUDA and GCC (see boegel's comments). Probably easier to just go with foss-2022a for now. I'll close this for now.

VRehnberg · 2023-11-17T08:30:15Z

Test report by @VRehnberg
FAILED
Build succeeded for 14 out of 15 (1 easyconfigs in total)
alvis3-18 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 2 x NVIDIA NVIDIA A100-SXM4-40GB, 535.104.05, Python 3.6.8
See https://gist.github.com/VRehnberg/d104660df9528635d1f28dd4bd59540c for a full test report.

casparvl · 2023-11-21T11:27:11Z

I was wondering what the right way forward is here, if this means 'we can not have a TensorFlow in 2023a'. Had a small discussion with @boegel . This page of tested build configurations lists TF 2.15 + cuDNN 8.8 + CUDA 12.2 as a tested combination (and who knows, it might even work with cuDNN 8.9 too).

The best way forward is probably to

first create a CPU build for TF 2.15 with foss/2023a
then create a GPU build for TF 2.15. I'd probably start with cuDNN 8.8 and CUDA 12.2 since that is the tested config. If that works out of the box, I'd try to bump it to 8.9 and CUDA 12.1.1. If that works too, that would be my preference since it means we can use what's already there in 2023a.

Flamefire · 2023-11-21T11:41:07Z

first create a CPU build for TF 2.15 with foss/2023a

That's what I usually do too and why there's no PyTorch 2.x with CUDA yet as there are still remaining issues with the CPU version.
So I support that plan.
Note that we additionally have the freedom to have both cuDNN versions with different CUDA versions due to the versionsuffix, which we can use as an escape hatch if we have to

adding easyconfigs: TensorFlow-2.13.0-foss-2023a-CUDA-12.1.1.eb

0165aea

boegel added the update label Nov 14, 2023

boegel added this to the 4.x milestone Nov 14, 2023

boegel reviewed Nov 16, 2023

View reviewed changes

boegel requested changes Nov 16, 2023

View reviewed changes

VRehnberg closed this Nov 17, 2023

casparvl mentioned this pull request Jun 3, 2024

{lib} [foss/2023a] Bazel v6.1.0, grpcio v1.57.0, TensorFlow v2.15.1 #20191

Merged

Conversation

VRehnberg commented Nov 9, 2023

Uh oh!

VRehnberg commented Nov 9, 2023

Uh oh!

schiotz commented Nov 10, 2023

Uh oh!

VRehnberg commented Nov 10, 2023

Uh oh!

boegel commented Nov 16, 2023

Uh oh!

boegel commented Nov 16, 2023

Uh oh!

boegel Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

boegel Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

boegel Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

boegel Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

VRehnberg Nov 17, 2023

Choose a reason for hiding this comment

Uh oh!

VRehnberg Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Flamefire Nov 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

boegel Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

VRehnberg commented Nov 17, 2023

Uh oh!

VRehnberg commented Nov 17, 2023

Uh oh!

VRehnberg commented Nov 17, 2023

Uh oh!

casparvl commented Nov 21, 2023

Uh oh!

Flamefire commented Nov 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

VRehnberg Nov 17, 2023 •

edited

Loading

Flamefire Nov 21, 2023 •

edited

Loading