New feature
When submitting to nodes with >1 GPU, Nextflow has very limited capabilities to split the work over separate GPUs.
Usage scenario
Let's imagine running on AWS Batch. We submit multiple GPU enabled tasks to the Batch service. AWS allocates them to a single, large instance with multiple GPUs, as per it's allocation strategy which prioritises the cheapest per CPU price.
In this instance, all tasks would be able to use all GPUs at the same time, leading to collisions and GPU memory issues.
We have some strategies to deal with this:
- Use a specific machine size which can only fit a single GPU enabled task
- Use environment variables such as NVIDIA_VISIBLE_DEVICES to point each task at a specific GPU
- Set maxForks to 1 to ensure only a single task is executed at once
However, we lack a way of saying "for a queue of x GPUs, assign each task to 1 GPU available"
What I want is each task to know which GPU we can use, then only use that GPU.
Suggest implementation
I don't actually have a good fix here. Perhaps using a process array with an index might help? Perhaps it's specific to each executor? But I feel like Nextflow could expose a variable to help us here.
New feature
When submitting to nodes with >1 GPU, Nextflow has very limited capabilities to split the work over separate GPUs.
Usage scenario
Let's imagine running on AWS Batch. We submit multiple GPU enabled tasks to the Batch service. AWS allocates them to a single, large instance with multiple GPUs, as per it's allocation strategy which prioritises the cheapest per CPU price.
In this instance, all tasks would be able to use all GPUs at the same time, leading to collisions and GPU memory issues.
We have some strategies to deal with this:
However, we lack a way of saying "for a queue of x GPUs, assign each task to 1 GPU available"
What I want is each task to know which GPU we can use, then only use that GPU.
Suggest implementation
I don't actually have a good fix here. Perhaps using a process array with an index might help? Perhaps it's specific to each executor? But I feel like Nextflow could expose a variable to help us here.