Is your feature request related to a problem? Please describe.
[Despite semi extensive discussion - just realised this isn't described anywhere]
Currently nf-core/nextflow default is to just process together FASTQ files with related Forward/Reverse pairs.
It does not deal with multiple lanes, at least in the way aDNA people normally do, such as a NextSeq where a single library injection is spread across 4 related-lanes of a single flow cell. Or in some cases when a single library is sequenced across multiple lanes of a single or multiple HiSeq flow cells.
We typically want to combine all sets of FASTQs from different lanes together, most importantly prior DeDuplication. This is as we may have PCR duplicates from the same library but across different lanes, and to accurately calculate duplication rate/cluster factor (an important QC metric), we need to account for all duplicates together.
TL;DR Ideally we would like to have a method to indicate how/when to combine FASTQ files from multiple lanes.
Describe the solution you'd like
After previous discussion, provide a TSV file which indicates with one column the base name of all the FASTQs to merge, and accordingly which files are associated with which base name.
I personally think everything from FASTQC to samtools filtering can be done on each read pair independently, but combining of BAM files must be performed before DeDuplication. This mainly due to metric calculations up until that point can be calculated manually by summing across each FASTQ/BAM, but at this point data generation for the metric needs to have all reads from a library analysed at once.
Note, the input channel for Qualimap would have to be changed to account for this (not coming straight from samtools filtering (line 937 of main.nf).
This is something we can discuss though.
Describe alternatives you've considered
Some form of flag with a regular expression of how to combine; but due to large heterogeneity in the way files are named at different sequencing centres this was deemed to complicated to account for all cases.
Additional context
N/A
Is your feature request related to a problem? Please describe.
[Despite semi extensive discussion - just realised this isn't described anywhere]
Currently nf-core/nextflow default is to just process together FASTQ files with related Forward/Reverse pairs.
It does not deal with multiple lanes, at least in the way aDNA people normally do, such as a NextSeq where a single library injection is spread across 4 related-lanes of a single flow cell. Or in some cases when a single library is sequenced across multiple lanes of a single or multiple HiSeq flow cells.
We typically want to combine all sets of FASTQs from different lanes together, most importantly prior DeDuplication. This is as we may have PCR duplicates from the same library but across different lanes, and to accurately calculate duplication rate/cluster factor (an important QC metric), we need to account for all duplicates together.
TL;DR Ideally we would like to have a method to indicate how/when to combine FASTQ files from multiple lanes.
Describe the solution you'd like
After previous discussion, provide a TSV file which indicates with one column the base name of all the FASTQs to merge, and accordingly which files are associated with which base name.
I personally think everything from FASTQC to samtools filtering can be done on each read pair independently, but combining of BAM files must be performed before DeDuplication. This mainly due to metric calculations up until that point can be calculated manually by summing across each FASTQ/BAM, but at this point data generation for the metric needs to have all reads from a library analysed at once.
Note, the input channel for Qualimap would have to be changed to account for this (not coming straight from samtools filtering (line 937 of main.nf).
This is something we can discuss though.
Describe alternatives you've considered
Some form of flag with a regular expression of how to combine; but due to large heterogeneity in the way files are named at different sequencing centres this was deemed to complicated to account for all cases.
Additional context
N/A