Skip to content

allow integer sample names#1421

Merged
pinin4fjords merged 8 commits into
nf-core:devfrom
idot:allow_int_samplenames
May 11, 2026
Merged

allow integer sample names#1421
pinin4fjords merged 8 commits into
nf-core:devfrom
idot:allow_int_samplenames

Conversation

@idot
Copy link
Copy Markdown
Contributor

@idot idot commented Oct 18, 2024

int was not allowed as sample name anymore (since 3.16.0)
Validation of file failed:
-> Entry 1: Error for field 'sample' (298098): Sample name must be provided and cannot contain spaces

PR checklist

  • [*] This comment contains a description of changes (with reason).

  • CHANGELOG.md is updated.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 18, 2024

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@idot idot changed the base branch from master to dev October 18, 2024 11:45
@idot
Copy link
Copy Markdown
Contributor Author

idot commented Oct 18, 2024

fixes #1419

@pinin4fjords
Copy link
Copy Markdown
Member

pinin4fjords commented Jan 16, 2025

@idot can you confirm that you've tested the workflow with this change? Also, please update the CHANGELOG.

@idot
Copy link
Copy Markdown
Contributor Author

idot commented Mar 3, 2025

I have updated the changelog. There was a discussion on slack and the developers wanted a more comprehensive solution however in 3.18.0 the error is still there. I have tested also 3.18 with this change.

Copy link
Copy Markdown
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind that R does not allow purely numeric column names. If you try assigning one, it will be automatically prepended with X:

> example <- data.frame("123345"=LETTERS)
> head(example)
  X123345
1       A
2       B
3       C
4       D
5       E
6       F

So you will anyway end up with non-numeric sample names in your quantification and I would want to perform a very careful review of all R scripts, whether some data merging steps fail, e.g. the scaling/normalization just defaults to 1 for each sample etc.

I fear there might be more subtle issues that do not show instantly by a crashing pipeline run.

@pinin4fjords
Copy link
Copy Markdown
Member

Mind that R does not allow purely numeric column names. If you try assigning one, it will be automatically prepended with X:

> example <- data.frame("123345"=LETTERS)
> head(example)
  X123345
1       A
2       B
3       C
4       D
5       E
6       F

So you will anyway end up with non-numeric sample names in your quantification and I would want to perform a very careful review of all R scripts, whether some data merging steps fail, e.g. the scaling/normalization just defaults to 1 for each sample etc.

I fear there might be more subtle issues that do not show instantly by a crashing pipeline run.

I did at some point go through and put check.names = FALSE in various places to avoid this.

@idot said that they had test this, so fingers crossed that was effective.

@idot
Copy link
Copy Markdown
Contributor Author

idot commented Mar 13, 2025

Yes, in the R part the sample names get an X prepended

@pinin4fjords
Copy link
Copy Markdown
Member

Yes, in the R part the sample names get an X prepended

OK, then we need to do some work to address that before this is merged.

@idot idot force-pushed the allow_int_samplenames branch from aee4a2f to 48b6d12 Compare June 12, 2025 08:56
 Validation of file failed:
        -> Entry 1: Error for field 'sample' (298098): Sample name must be provided and cannot contain spaces
@idot idot force-pushed the allow_int_samplenames branch from 48b6d12 to e1c52e1 Compare June 12, 2025 08:58
idot and others added 7 commits June 17, 2025 16:26
Schema accepts ["string", "integer"] for the sample column; meta.id is
coerced to String after samplesheetToList so numeric IDs propagate as
strings through channel keys, file names, and R column headers.

Closes nf-core#1419
Sample IDs (298098, 298504, 317960, 319093) propagate as strings through
file names, merged gene-count column headers, and DESeq2 PCA output - no
R X-prefixing because every callsite already passes check.names = FALSE.
…tiQC modes

Add three more cases to tests/integer_samplenames.nf.test:

- full --skip_quantification_merge run: validates the per-sample MultiQC
  code path (one report per integer-named sample, each with its own
  table_sample_merge lookbehind);
- --aligner hisat2 stub: validates the non-STAR alignment branch;
- pseudo-only kallisto stub: validates the --skip_alignment route.

Verified per-sample MultiQC output: each sample's report carries its
integer ID verbatim, with Read 1 / Read 2 rows for PE samples and a
single row for the SE sample.
- Drop the integer-id-specific nf-test + fixture: other sample-naming
  niceties aren't tested here either.
- Strip the verbose coercion comment.
- Move the CHANGELOG entry to its numeric slot.
@pinin4fjords
Copy link
Copy Markdown
Member

pinin4fjords commented May 11, 2026

Pushed an update via maintainer-edit. Why integer sample IDs are safe now, in two steps:

1. Inside Nextflow channels: types stay consistent. With type: ["string", "integer"], nf-schema can hand us an Integer in meta.id. The multiqc_rnaseq subworkflow joins channels keyed by meta.id (typed) against per-sample TSVs whose IDs come from filename parsing (always String). Integer-keyed vs String-keyed .join silently drops samples in Groovy. So this push adds a meta.id as String coercion right after samplesheetToList in workflows/rnaseq/main.nf, before any downstream channel work.

2. Inside R aggregators: no X prefix. Audited the three places that build sample-column tables, all already pass check.names = FALSE:

  • bin/deseq2_qc.r:58
  • modules/nf-core/summarizedexperiment/.../summarizedexperiment.r:16-33
  • modules/nf-core/tximeta/tximport/templates/tximport.r:46, 101, 112, 125, 212

bin/dupradar.r and the Python scripts only use meta.id as a filename prefix, so no column-name path. So @MatthiasZepper's concern doesn't materialise in this pipeline.

Empirical check. Ran the pipeline against a samplesheet with the IDs from #1419 (298098, 298504, 317960, 319093). Five configurations: default full+stub, --skip_quantification_merge full, --aligner hisat2 stub, pseudo-only kallisto stub. All pass; integer IDs verbatim in every output:

$ head -2 salmon.merged.gene_counts.tsv
gene_id	gene_name	298098	298504	317960	319093
Gfp_transgene_gene	Gfp_transgene_gene	0	0	0	0

$ head -5 deseq2.pca.vals.txt
"sample"	"PC1: 52% variance"	"PC2: 27% variance"
"298098"	-1.30898138940436	0.439079416717309
"298504"	-0.872656356507467	-0.654613144978441
"319093"	0.969970988497281	1.06605337716581
"317960"	1.21166675741454	-0.850519648904681

$ cut -f1 multiqc_general_stats.txt | head -7
Sample
298098
298098 Read 1
298098 Read 2
298504
298504 Read 1
298504 Read 2

Per-sample MultiQC under --skip_quantification_merge produced four reports at output/<sample>/multiqc/star_salmon/<sample>_multiqc_report.html, each carrying its own integer ID.

Also in this push: reverted the unrelated umi dedup fix for mqc commit (can ship separately if still wanted), merged in current origin/dev (branch was ~1300 commits behind), and slotted the CHANGELOG entry into its numeric position. No regression test for integer IDs, mirroring the rest of the suite which doesn't carry per-naming-quirk cases.

@pinin4fjords
Copy link
Copy Markdown
Member

pinin4fjords commented May 11, 2026

I'm going to merge this now, I think it should be fine. We'll deal with any final corners if and when they pop up for for now I see no need to block integer IDs.

(Edit: once I figure out why there's a snapshot mismatch)

@pinin4fjords pinin4fjords enabled auto-merge May 11, 2026 13:35
@pinin4fjords pinin4fjords dismissed MatthiasZepper’s stale review May 11, 2026 14:13

I've tested this, I think we've headed off the R issues.

@pinin4fjords pinin4fjords merged commit 9db8dbf into nf-core:dev May 11, 2026
123 of 125 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants