feat: UTD Juno cluster config#895
Conversation
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
28d45ab to
8a08ebf
Compare
|
Requesting a review on this (request for a review feature isn't active for me), thanks. Also would appreciate some guidance on how to systematically test against existing pipelines, I only tested toy CPU and GPU jobs for now. |
|
@nf-core-bot fix linting please |
| All of the intermediate files required to run the pipeline will be stored in the `work/` directory. It is recommended to delete this directory after the pipeline has finished successfully because it can get quite large, and all of the main output files will be saved in the `results/` directory anyway. | ||
|
|
||
| > [!NOTE] | ||
| > You will need an account to use the HPC cluster on Ganymede in order to run the pipeline. |
There was a problem hiding this comment.
Personal opinion; maybe it would make sense to have a single profile to for UTD systems to make life easier for users if the systems are aligned enough?
There was a problem hiding this comment.
I am not sure about the entire situation but we would certainly need an escape hatch, it doesn't seem like in the near future access to all clusters would be unified.
Or are you talking about we try to detect which system we are running on, and then select the queue?
There was a problem hiding this comment.
There may be different opinions and I'm not witholding approval because of this, but to me it typically makes a lot more sense to have a profile for the institution/site/department/provider rather than any number of different profiles for different clusters.
There was a problem hiding this comment.
@pontus due to profile/config inheritance issues with DSL2, it was recommended to us (somewhere, I can't find it now) that it's better to have singular config files per cluster rather than 'sub profiles'.
So indeed utd_juno and then in the future utd_ganymede etc, is valid.
Note it also makes it easier to deprecate older clusters etc.
There was a problem hiding this comment.
I'm not sure what issues with DSL2 that would have been, but there's definitely some things related to the upcoming strict syntax that makes it harder (but to me, not to a degree that I think the added work for configuration maintenance outweighs the burden of users).
But I can say I agree this is something one can hav different opinions about :)
To me, it seems having separate profiles make it harder to deprecate old clusters, rather than just changes in a config file and possibly docs, there's more that needs to be pulled down.
|
There's a lot of lifting to pick the right slurm options for GPU choices but I don't see anything to set |
|
GPU handling for nf-core pipelines isn't really standardised yet, and certainly not at the level queues are defined here, but still the |
|
For testing, simply running running through the test profiles for some popular pipelines seem sensible (nf-core templates define at least |
|
I had a couple discussions with edmund too about the GPU situation, especially regards to container environments, this cluster has only one node that truly has one single H100 and 1 A30 GPU you can use as a whole, the other ones you have to submit 2 or 4 runs to one machine to make use of them. It's highly related to nextflow-io/nextflow#3909. Currently my personal workaround for nextflow is a global semaphore program that assigns GPU on the fly, so I don't explicitly ask for a GPU through nextflow as of now. See: https://docs.sylabs.io/guides/3.5/user-guide/gpu.html#multiple-gpus , you need a GPU ID but I don't see a portable way to get it out of nextflow yet. |
|
As for GPU support in singularity; I'm not sure where you see that an id is required or why you think you'd need to get that out of nextflow. Essentially, there's two main ways your cluster can provide the devices you have available - one is to have a job specific The other is by all the libraries being respectful and just setting (I don't see this as required but given the other work with GPUs, it seems to me to make sense to actually help users.) And I hope you're not running singularity 3.5, that's quite old nowadays. |
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
|
Thanks for the suggestions, I think I addressed all except the discussion regarding unification, I will test some workflows out this week. |
| ## Heterogenous/GPU jobs | ||
|
|
||
| Juno is a heterogenous compute cluster, which means it can accommodate pipelines that require GPUs. | ||
| The config file has a dispatch rule that will automatically assign a queue based on the accelerator directive. You can always override this by specifying a queue directly in the `queue` directive. |
There was a problem hiding this comment.
Since the automatic selection will also assign clusterOptions that will impede scheduling, one should override that as well?
| Juno is a heterogenous compute cluster, which means it can accommodate pipelines that require GPUs. | ||
| The config file has a dispatch rule that will automatically assign a queue based on the accelerator directive. You can always override this by specifying a queue directly in the `queue` directive. | ||
|
|
||
| The supported accelerators considered by the profile are NVIDIA H100 and A30 GPUs, you can request them like this: |
There was a problem hiding this comment.
Not sure how helpful this is in the context of a nf-core config.
|
Approved, with general opinions stated before still being my opinions :) I'm not sure if the cuda load went away? If it was unclear, I think it shouldn't be done if it isn't a GPU job, but it should be in those cases. I also suspect it may be a requirement for singularity to pick up the GPU libraries to bind in correctly. |
|
Do you still plan to update/merge this PR @eternal-flame-AD ? |
|
Oops I forgot this, I will check it and update or merge it this week. Thanks for the reminder! |
|
Thanks, good. I think let's merge this for now. Sorry, the new clusters are not moderated well, and the wait times are multi-day from some less considerate users. |
name: New Config
about: A new cluster config
Please follow these steps before submitting your PR:
[WIP]in its titlemasterbranchSteps for adding a new config profile:
conf/directorydocs/directorynfcore_custom.configfile in the top-level directoryREADME.mdfile in the top-level directoryprofile:scope in.github/workflows/main.yml.github/CODEOWNERS(**/<custom-profile>** @<github-username>)