Skip to content

Add Purdue RCAC institutional profiles (Bell, Gautschi, Negishi)#1085

Merged
pontus merged 11 commits intonf-core:masterfrom
aseetharam:purdue-rcac-profiles
Apr 15, 2026
Merged

Add Purdue RCAC institutional profiles (Bell, Gautschi, Negishi)#1085
pontus merged 11 commits intonf-core:masterfrom
aseetharam:purdue-rcac-profiles

Conversation

@aseetharam
Copy link
Copy Markdown
Contributor

@aseetharam aseetharam commented Apr 13, 2026


name: New Config
about: A new cluster config

Please follow these steps before submitting your PR:

  • If your PR is a work in progress, include [WIP] in its title
  • Your PR targets the master branch
  • You've included links to relevant issues, if any

Steps for adding a new config profile:

  • Add your custom config file to the conf/ directory
  • Add your documentation file to the docs/ directory
  • Add your custom profile to the nfcore_custom.config file in the top-level directory
  • Add your profile name to the profile: scope in .github/workflows/main.yml
  • Add your custom profile path and GitHub user name to .github/CODEOWNERS

Summary

Adds three institutional profiles for Purdue University Rosen Center for Advanced Computing (RCAC) HPC clusters:

  • purdue_bell — Bell (AMD EPYC 7662 Rome, 128c/256GB CPU)
  • purdue_gautschi — Gautschi (AMD EPYC 9654 Genoa, 192c/384GB CPU + NVIDIA L40/H100 GPU)
  • purdue_negishi — Negishi (AMD EPYC 7763 Milan, 128c/256GB CPU)

Design notes

  • Separate profiles per cluster; shared structure, cluster-specific partition and resource mappings.
  • Container runtime: Apptainer (system default on all three; /usr/bin/singularity is a symlink).
  • Required user param: --cluster_account (hard-fails if unset; RCAC mandates --account on all jobs). Added to validation.ignoreParams to suppress schema warnings.
  • Opt-in --use_standby true routes eligible jobs through the 4 h standby QoS.
  • Gautschi exposes process_gpu label routing to smallgpu (L40, default) or ai (H100, via --gpu_partition=ai).
  • Bell and Negishi intentionally do not expose GPU labels: their GPU partitions are AMD ROCm hardware, incompatible with CUDA-only nf-core GPU modules.
  • Shared iGenomes mirror at /depot/itap/datasets/igenomes.
  • Container cache and work dir use $RCAC_SCRATCH with $SCRATCH and $HOME fallbacks (works in CI without RCAC env).

Testing

Each profile validated on its target cluster with:

nextflow run nf-core/demo -profile test,purdue_<cluster> --cluster_account <acct> --outdir $RCAC_SCRATCH/...

Live runs on Bell, Gautschi, and Negishi produced sacct records with the expected Partition, Account, and QOS values. Gautschi additionally validated with --use_standby true to confirm QoS propagation.

Not included

  • purdue_gilbreth (GPU-only cluster; CPU-heavy nf-core steps would waste GPU nodes). Can be added later if a GPU-pipeline use case emerges.
  • Anvil (ACCESS resource); separate effort.

Contact

Arun Seetharam, @aseetharam, aseethar@purdue.edu

@aseetharam
Copy link
Copy Markdown
Contributor Author

The three Purdue profile tests (purdue_bell, purdue_gautschi, purdue_negishi) all pass. The 15 failing checks are for other institutions' profiles (alliance_canada, bi, incliva) and repo-wide lint/config jobs. These don't appear related to my changes and look like they may be pre-existing failures on master. Could a maintainer confirm whether I need to address anything here, or if these are upstream issues?

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/purdue_negishi.md Outdated
Comment thread conf/purdue_bell.config Outdated
Comment thread docs/purdue_bell.md Outdated
Comment thread docs/purdue_gautschi.md Outdated
@aseetharam
Copy link
Copy Markdown
Contributor Author

@copilot apply changes based on the comments in this thread

3 similar comments
@aseetharam
Copy link
Copy Markdown
Contributor Author

@copilot apply changes based on the comments in this thread

@aseetharam
Copy link
Copy Markdown
Contributor Author

@copilot apply changes based on the comments in this thread

@aseetharam
Copy link
Copy Markdown
Contributor Author

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread conf/purdue_negishi.config Outdated
Comment thread conf/purdue_bell.config Outdated
Comment thread conf/purdue_negishi.config Outdated
Comment thread conf/purdue_gautschi.config Outdated
Comment thread conf/purdue_bell.config Outdated
Comment thread conf/purdue_gautschi.config Outdated
Comment thread conf/purdue_gautschi.config Outdated
Copy link
Copy Markdown
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments in first apply to all subsequent configs :)

Comment thread .github/CODEOWNERS Outdated
Comment thread conf/purdue_bell.config Outdated
Comment thread conf/purdue_bell.config Outdated
Comment thread conf/purdue_bell.config Outdated
Comment thread conf/purdue_bell.config Outdated
Comment thread conf/purdue_bell.config Outdated
Comment thread conf/purdue_bell.config Outdated
@aseetharam
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review. I've pushed commit addressing all points:

  • CODEOWNERS: switched to **/purdue_** wildcard, matches the pattern of **/crg**, **/iris**, etc.
  • Removed nextflowVersion pin from all three configs; letting pipelines decide.
  • Added use_standby (and gpu_partition on Gautschi) to validation.ignoreParams.
  • Updated error message to "profile requires..." wording.
  • Switched from throw new IllegalArgumentException to System.err.println + System.exit(1) across all validation closures so users get a clean error instead of a Java stack trace. Kept this inside the clusterOptions closures because top-level validation blocks conflicted with Nextflow 26 strict-config syntax earlier in this PR.
  • Biggest change per your comment on process_high_memory: removed all withLabel resource overrides and replaced with a dynamic top-level queue = { task.memory > 256.GB ? 'highmem' : 'cpu' } closure. The old process_long override (which stripped --qos=standby) is also gone; the standby flag is now gated by task.memory <= 256.GB && task.time <= 4.h inside clusterOptions, so it's inactive for long or high-memory jobs automatically. This means pipelines fully own their resource requests now; the profile just picks the right partition and applies the right account/QoS flags.
  • Removed trace/report/timeline/dag blocks; relying on pipeline defaults.
  • Kept withLabel: process_gpu on Gautschi only (GPU routing is label-based, not memory-based).

Docs updated to document the dynamic routing and the Slurm >= 65 / >= 49 core floors on highmem. Re-reviewed the >=48 vs >=49 convention per your Gautschi-specific suggestion and used the latter.

Thanks also @pontus for the cleaner error pattern.

@aseetharam aseetharam requested review from jfy133 and pontus April 14, 2026 15:53
System.err.println("ERROR: purdue_gautschi params.gpu_partition must be 'smallgpu' or 'ai' (got '${params.gpu_partition}').")
System.exit(1)
}
params.gpu_partition
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many GPU partitions do you have? Not a blocker, but I want to check there is not a way to have Nextflow automatically pick a partition based on other task attributions (e.g. task.memory for a largegpu particiation, for example)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two GPU partitions on Gautschi: smallgpu (2x L40, 24 h) and ai (8x H100, 14 d). I deliberately kept this as a user-selectable param rather than dynamic routing because access to GPU partitions on Gautschi is allocation-bound. Users are entitled to the partition tied to their lab's purchase. A user with a smallgpu allocation auto-routed to ai (or vice versa) would just hit a Slurm submission rejection, no fallback. Letting them set --gpu_partition matches the access model and avoids surprises. Happy to add a comment in the config explaining this.

Comment thread docs/purdue_bell.md Outdated
Comment thread docs/purdue_gautschi.md Outdated
Comment thread docs/purdue_negishi.md Outdated
@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Apr 15, 2026

@pontus any last thoughts, if not please merge if you are happy with this now :) (given we still have blocking other configs)

@pontus
Copy link
Copy Markdown
Collaborator

pontus commented Apr 15, 2026

Just to check before merging - there are checks for some environment variables (RCAC_SCRATCH, SCRATCH) and if either of those are set the apptainer cache dir is set to that. If those are job dependent it seems there will be no persistent cache (which seems bad) and if those are global it seems there will be conflicts with ownership et.c. If they are set to some place that's user-unique and repeatable, it seems good, though.

aseetharam and others added 3 commits April 15, 2026 07:05
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@aseetharam
Copy link
Copy Markdown
Contributor Author

@pontus Good catch. $RCAC_SCRATCH on Bell/Gautschi/Negishi is set centrally by RCAC and resolves to /scratch/<cluster>/<username> for every user, persistent across sessions and inherited by Slurm jobs. So it's user-unique and stable, satisfying both your concerns. RCAC also provides /usr/local/bin/findscratch which returns the same path (e.g. /scratch/gautschi/aseethar), so the convention is documented and stable on their end.
The fallback to $SCRATCH is defensive in case someone runs from an environment that overrides RCAC_SCRATCH (unusual). The final fallback to $HOME is a last-resort and not recommended (RCAC home quotas are tight) but prevents Nextflow from blowing up if both env vars are unset.

@pontus pontus merged commit 53b74a0 into nf-core:master Apr 15, 2026
148 of 161 checks passed
@aseetharam aseetharam deleted the purdue-rcac-profiles branch April 15, 2026 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants