Skip to content

Bug: GPU pipeline fails in Kubernetes/GKE (libcuda.so.1 / nvidia-smi missing from PATH) #2001

@Connah-rs

Description

@Connah-rs

Description

When running the opendronemap/odm:gpu image in a Kubernetes environment (specifically Google Kubernetes Engine) using standard GPU tolerations, the ODM pipeline fails to utilize the GPU and crashes during the openmvs stage.

The pipeline reports [INFO] No nvidia-smi detected, passes --cuda-device -2 to OpenMVS, and subsequently crashes with error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory:

[2026-03-10, 09:29:09 UTC] [INFO]    Estimating depthmaps
[2026-03-10, 09:29:09 UTC] [INFO]    No nvidia-smi detected
[2026-03-10, 09:29:09 UTC] [INFO]    running "/code/SuperBuild/install/bin/OpenMVS/DensifyPointCloud" [...] -v 0 --cuda-device -2
[2026-03-10, 09:29:09 UTC] /code/SuperBuild/install/bin/OpenMVS/DensifyPointCloud: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
[2026-03-10, 09:29:09 UTC] Child returned 127

(Log truncated for brevity)

To Reproduce

  1. Deploy opendronemap/odm:gpu in a Kubernetes cluster requesting nvidia.com/gpu: 1.
  2. Run standard ODM pipeline arguments (e.g., --dsm --dtm --pc-quality high).
  3. Observe the logs during the openmvs stage.
  4. The pipeline fails with a Child returned 127 SubprocessException.

Expected Behavior

The container should detect the mounted GPU via nvidia-smi, correctly load the NVIDIA shared libraries, and execute the OpenMVS stage using --cuda-device -1 (or the appropriate GPU ID) without crashing.

Root Cause & Workaround

Unlike docker run --gpus all (which actively alters the container's environment variables at runtime to inject NVIDIA paths), Kubernetes device plugins simply mount the hardware files into /usr/local/nvidia and rely on the image's ENV instructions to make them discoverable.

Currently, the gpu.Dockerfile causes issues in Kubernetes for two reasons:

  1. The $PATH Issue: /usr/local/nvidia/bin is missing from the system $PATH. When run.py uses subprocess.run to call nvidia-smi directly, it fails, causing the pipeline to assume no GPU exists.
  2. The $LD_LIBRARY_PATH Issue: In gpu.Dockerfile, the path is set via ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/code/SuperBuild/install/lib". While this correctly appends the ODM paths to the CUDA base image paths at build time, it prevents the Kubernetes runtime from dynamically resolving libcuda.so.1 or libnvidia-ml.so, which the K8s device plugin mounts at /usr/local/nvidia/lib64.

I successfully worked around this by manually overriding the environment variables in the Kubernetes Pod spec to explicitly include the NVIDIA mount paths:

env:
  - name: NVIDIA_DRIVER_CAPABILITIES
    value: "compute,utility"
  - name: LD_LIBRARY_PATH
    value: "/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/code/SuperBuild/install/lib"
  - name: PATH
    value: "/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

Proposed Solution

Could the specific NVIDIA runtime paths be explicitly prepended to the ENV definitions inside gpu.Dockerfile?
For example:

ENV PATH="/usr/local/nvidia/bin:$PATH" \
    LD_LIBRARY_PATH="/usr/local/nvidia/lib64:/usr/local/nvidia/lib:$LD_LIBRARY_PATH:/code/SuperBuild/install/lib"

This would make the image immediately compatible out-of-the-box for Kubernetes/Cloud deployments without users having to manually map environment variables.

A Note on Docker Tags

I noticed that the opendronemap/odm:gpu tag acts effectively as a "latest" tag and is automatically updated with commits to the master branch. This recently caused our automated pipelines to break unexpectedly (likely related to commit 44e3ff6 which appears to have introduced changes to the underlying CUDA base image, altering the default system paths that were previously working). Would it be possible to introduce versioned GPU tags (e.g., odm:3.6.0-gpu) on Docker Hub so that we can pin to stable releases in production environments?

Thank you to all the contributors for the incredible work on this project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions