Skip to content

feat: UTD Juno cluster config#895

Merged
eternal-flame-AD merged 6 commits intonf-core:masterfrom
eternal-flame-AD:master
Jun 22, 2025
Merged

feat: UTD Juno cluster config#895
eternal-flame-AD merged 6 commits intonf-core:masterfrom
eternal-flame-AD:master

Conversation

@eternal-flame-AD
Copy link
Copy Markdown
Contributor


name: New Config
about: A new cluster config

Please follow these steps before submitting your PR:

  • If your PR is a work in progress, include [WIP] in its title
  • Your PR targets the master branch
  • You've included links to relevant issues, if any

Steps for adding a new config profile:

  • Add your custom config file to the conf/ directory
  • Add your documentation file to the docs/ directory
  • Add your custom profile to the nfcore_custom.config file in the top-level directory
  • Add your custom profile to the README.md file in the top-level directory
  • Add your profile name to the profile: scope in .github/workflows/main.yml
  • OPTIONAL: Add your custom profile path and GitHub user name to .github/CODEOWNERS (**/<custom-profile>** @<github-username>)

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
@eternal-flame-AD eternal-flame-AD changed the title [WIP] feat: UTD Juno cluster config feat: UTD Juno cluster config Apr 26, 2025
@eternal-flame-AD
Copy link
Copy Markdown
Contributor Author

Requesting a review on this (request for a review feature isn't active for me), thanks.

Also would appreciate some guidance on how to systematically test against existing pipelines, I only tested toy CPU and GPU jobs for now.

@pontus
Copy link
Copy Markdown
Collaborator

pontus commented Apr 27, 2025

@nf-core-bot fix linting please

Comment thread conf/utd_juno.config Outdated
Comment thread .github/CODEOWNERS
Comment thread docs/utd_juno.md
All of the intermediate files required to run the pipeline will be stored in the `work/` directory. It is recommended to delete this directory after the pipeline has finished successfully because it can get quite large, and all of the main output files will be saved in the `results/` directory anyway.

> [!NOTE]
> You will need an account to use the HPC cluster on Ganymede in order to run the pipeline.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal opinion; maybe it would make sense to have a single profile to for UTD systems to make life easier for users if the systems are aligned enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the entire situation but we would certainly need an escape hatch, it doesn't seem like in the near future access to all clusters would be unified.

Or are you talking about we try to detect which system we are running on, and then select the queue?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be different opinions and I'm not witholding approval because of this, but to me it typically makes a lot more sense to have a profile for the institution/site/department/provider rather than any number of different profiles for different clusters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pontus due to profile/config inheritance issues with DSL2, it was recommended to us (somewhere, I can't find it now) that it's better to have singular config files per cluster rather than 'sub profiles'.

So indeed utd_juno and then in the future utd_ganymede etc, is valid.

Note it also makes it easier to deprecate older clusters etc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what issues with DSL2 that would have been, but there's definitely some things related to the upcoming strict syntax that makes it harder (but to me, not to a degree that I think the added work for configuration maintenance outweighs the burden of users).

But I can say I agree this is something one can hav different opinions about :)

To me, it seems having separate profiles make it harder to deprecate old clusters, rather than just changes in a config file and possibly docs, there's more that needs to be pulled down.

Comment thread conf/utd_juno.config Outdated
@pontus
Copy link
Copy Markdown
Collaborator

pontus commented Apr 27, 2025

There's a lot of lifting to pick the right slurm options for GPU choices but I don't see anything to set singularity.runOptions properly to provide those GPUs inside the container.

@pontus
Copy link
Copy Markdown
Collaborator

pontus commented Apr 27, 2025

GPU handling for nf-core pipelines isn't really standardised yet, and certainly not at the level queues are defined here, but still the process_gpu is used by many modules, so it might be nice trying to provide some GPU for those.

@pontus
Copy link
Copy Markdown
Collaborator

pontus commented Apr 27, 2025

For testing, simply running running through the test profiles for some popular pipelines seem sensible (nf-core templates define at least test and test_full profiles that can be expected to be available).

@eternal-flame-AD
Copy link
Copy Markdown
Contributor Author

eternal-flame-AD commented Apr 27, 2025

I had a couple discussions with edmund too about the GPU situation, especially regards to container environments, this cluster has only one node that truly has one single H100 and 1 A30 GPU you can use as a whole, the other ones you have to submit 2 or 4 runs to one machine to make use of them. It's highly related to nextflow-io/nextflow#3909.

Currently my personal workaround for nextflow is a global semaphore program that assigns GPU on the fly, so I don't explicitly ask for a GPU through nextflow as of now.

See: https://docs.sylabs.io/guides/3.5/user-guide/gpu.html#multiple-gpus , you need a GPU ID but I don't see a portable way to get it out of nextflow yet.

@pontus
Copy link
Copy Markdown
Collaborator

pontus commented Apr 28, 2025

As for GPU support in singularity; I'm not sure where you see that an id is required or why you think you'd need to get that out of nextflow.

Essentially, there's two main ways your cluster can provide the devices you have available - one is to have a job specific /dev mount that only has the devices for the cards the job has
been granted (this works right off).

The other is by all the libraries being respectful and just setting CUDA_VISIBLE_DEVICES. I think that should work as well as singularity should inherit that into the container, but if not, you can prefix it with SINGULARITYENV_ to have it being set explicitly inside by singularity. Same with NVIDIA_VISIBLE_DEVICES and I suppose you can cross-use if needed. Anything that needs to be done there should be possible with a beforeScript.

(I don't see this as required but given the other work with GPUs, it seems to me to make sense to actually help users.)

And I hope you're not running singularity 3.5, that's quite old nowadays.

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
@eternal-flame-AD
Copy link
Copy Markdown
Contributor Author

Thanks for the suggestions, I think I addressed all except the discussion regarding unification, I will test some workflows out this week.

Comment thread docs/utd_juno.md
## Heterogenous/GPU jobs

Juno is a heterogenous compute cluster, which means it can accommodate pipelines that require GPUs.
The config file has a dispatch rule that will automatically assign a queue based on the accelerator directive. You can always override this by specifying a queue directly in the `queue` directive.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the automatic selection will also assign clusterOptions that will impede scheduling, one should override that as well?

Comment thread docs/utd_juno.md
Juno is a heterogenous compute cluster, which means it can accommodate pipelines that require GPUs.
The config file has a dispatch rule that will automatically assign a queue based on the accelerator directive. You can always override this by specifying a queue directly in the `queue` directive.

The supported accelerators considered by the profile are NVIDIA H100 and A30 GPUs, you can request them like this:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how helpful this is in the context of a nf-core config.

@pontus
Copy link
Copy Markdown
Collaborator

pontus commented Apr 30, 2025

Approved, with general opinions stated before still being my opinions :)

I'm not sure if the cuda load went away? If it was unclear, I think it shouldn't be done if it isn't a GPU job, but it should be in those cases. I also suspect it may be a requirement for singularity to pick up the GPU libraries to bind in correctly.

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Jun 18, 2025

Do you still plan to update/merge this PR @eternal-flame-AD ?

@eternal-flame-AD
Copy link
Copy Markdown
Contributor Author

Oops I forgot this, I will check it and update or merge it this week. Thanks for the reminder!

@eternal-flame-AD eternal-flame-AD self-assigned this Jun 18, 2025
@eternal-flame-AD eternal-flame-AD merged commit 72503da into nf-core:master Jun 22, 2025
151 checks passed
@eternal-flame-AD
Copy link
Copy Markdown
Contributor Author

Thanks, good. I think let's merge this for now.

Sorry, the new clusters are not moderated well, and the wait times are multi-day from some less considerate users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants