allow integer sample names#1421
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.0.2. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
|
fixes #1419 |
|
@idot can you confirm that you've tested the workflow with this change? Also, please update the CHANGELOG. |
|
I have updated the changelog. There was a discussion on slack and the developers wanted a more comprehensive solution however in 3.18.0 the error is still there. I have tested also 3.18 with this change. |
There was a problem hiding this comment.
Mind that R does not allow purely numeric column names. If you try assigning one, it will be automatically prepended with X:
> example <- data.frame("123345"=LETTERS)
> head(example)
X123345
1 A
2 B
3 C
4 D
5 E
6 FSo you will anyway end up with non-numeric sample names in your quantification and I would want to perform a very careful review of all R scripts, whether some data merging steps fail, e.g. the scaling/normalization just defaults to 1 for each sample etc.
I fear there might be more subtle issues that do not show instantly by a crashing pipeline run.
I did at some point go through and put @idot said that they had test this, so fingers crossed that was effective. |
|
Yes, in the R part the sample names get an X prepended |
OK, then we need to do some work to address that before this is merged. |
aee4a2f to
48b6d12
Compare
Validation of file failed:
-> Entry 1: Error for field 'sample' (298098): Sample name must be provided and cannot contain spaces
48b6d12 to
e1c52e1
Compare
This reverts commit 7d3daa4.
Schema accepts ["string", "integer"] for the sample column; meta.id is coerced to String after samplesheetToList so numeric IDs propagate as strings through channel keys, file names, and R column headers. Closes nf-core#1419
Sample IDs (298098, 298504, 317960, 319093) propagate as strings through file names, merged gene-count column headers, and DESeq2 PCA output - no R X-prefixing because every callsite already passes check.names = FALSE.
…tiQC modes Add three more cases to tests/integer_samplenames.nf.test: - full --skip_quantification_merge run: validates the per-sample MultiQC code path (one report per integer-named sample, each with its own table_sample_merge lookbehind); - --aligner hisat2 stub: validates the non-STAR alignment branch; - pseudo-only kallisto stub: validates the --skip_alignment route. Verified per-sample MultiQC output: each sample's report carries its integer ID verbatim, with Read 1 / Read 2 rows for PE samples and a single row for the SE sample.
- Drop the integer-id-specific nf-test + fixture: other sample-naming niceties aren't tested here either. - Strip the verbose coercion comment. - Move the CHANGELOG entry to its numeric slot.
|
Pushed an update via maintainer-edit. Why integer sample IDs are safe now, in two steps: 1. Inside Nextflow channels: types stay consistent. With 2. Inside R aggregators: no
Empirical check. Ran the pipeline against a samplesheet with the IDs from #1419 ( Per-sample MultiQC under Also in this push: reverted the unrelated |
|
I'm going to merge this now, I think it should be fine. We'll deal with any final corners if and when they pop up for for now I see no need to block integer IDs. (Edit: once I figure out why there's a snapshot mismatch) |
I've tested this, I think we've headed off the R issues.
int was not allowed as sample name anymore (since 3.16.0)
Validation of file failed:
-> Entry 1: Error for field 'sample' (298098): Sample name must be provided and cannot contain spaces
PR checklist
[*] This comment contains a description of changes (with reason).
CHANGELOG.mdis updated.