After discussion with @apeltzer , the problem is indeed the use of prefix variables. Trying to shorten file names consistently is very difficult because of split delimiters are not the same between different sequencing centres. E.g. some places will use _ to delimit sections of file names, others will use ..
As an alternative to the 'prefix' system, we propose a tagging system, where each module has a 'descriptive' and unique tag appended to the output BAM/FASTQ file (but with the input file's suffix stripped).
For example, each output file would be generated following something like this (using bash notation):
${input_file%%.bam}.xxX.bam
where xxX is the module name.
(Note that the first processing step of the 'raw' input file will need consider that both fastq/fq is accepted, everything downstream would be specified by the pipeline so not an issue)
I make the following proposal based on the current version of EAGER (-r 2.0.6 )
| Tool |
Variant |
Proposed tag |
Example |
| Input bam 2 fastq |
NA |
sfq |
.sfq.fq.gz |
| FastP Poly-G |
PolyG |
plG |
.plG.fq.gz |
| AdapterRemoval |
trimmed & collapsed |
arB |
.arB.fq.gz |
| AdapterRemoval |
collapsed only |
arC |
.arC.fq.gz |
| AdapterRemoval |
trimmed only |
arT |
.arT.fq.gz |
| (AdapterRemoval)* |
(merged only) |
(arM) |
.arM.fq.gz |
| bwa |
aln |
bwA |
.bwA.bam |
| bwa |
mem |
bwM |
.bwM.bam |
| circularMapper |
bwa |
bwC |
.bwC.bam |
| samtools |
sort |
srt |
.srt.bam |
| samtools |
filter (mapped+unmapped) |
flB |
.flB.bam |
| samtools |
filter (mapped only) |
flM |
.flM.bam |
| samtools |
filter (unmapped only) |
flU |
.flU.bam |
| dedup |
NA |
Ddp |
.Ddp.bam |
| markduplicates |
NA |
Mdp |
.Mdp.bam |
| pmdtools |
NA |
pmd |
.pmd.bam |
| bamutils |
trimBam |
buT |
.buT.bam |
(* not implemented AFAIK but maybe should be?)
Rules:
- three character code (alphanumeric)
- tools abbreviations are lower case
- if there are multiple variants of a particular tool, the variants are indicated with capitals
- steps applied multiple times in the pipeline (e.g. samtools sort) do not need a 'tool' code if a verb is used.
- each character code is unique to that particular step - if multi-step commands used, they must be combined with the previous processing step (e.g.
bwA.srt.bam, and later bwA.srt.Ddp.srt.bam).
For example, an input file with the name: ABM006.A0101_S0_L001_R1_000.bam would have the following code:
ABM006.A0101_S0_L001_R1_000.sfq.plG.arB.bwA.srt.flM.Ddp.srt.pmd.buT.bam [76 characters]
Indicating that it has been converted to FASTQ, polyG trimmed, adapterRemoval trimmed and collapsed, mapped with bwA, sorted by samtools, filtered for mapped reads only, deduplicated with DeDup, sorted again with satmools, somethign with PMDtools, and ends trimmed with bamUtils.
Originally posted by @jfy133 in #178 (comment)
After discussion with @apeltzer , the problem is indeed the use of prefix variables. Trying to shorten file names consistently is very difficult because of split delimiters are not the same between different sequencing centres. E.g. some places will use
_to delimit sections of file names, others will use..As an alternative to the 'prefix' system, we propose a tagging system, where each module has a 'descriptive' and unique tag appended to the output BAM/FASTQ file (but with the input file's suffix stripped).
For example, each output file would be generated following something like this (using bash notation):
${input_file%%.bam}.xxX.bam
where xxX is the module name.
(Note that the first processing step of the 'raw' input file will need consider that both fastq/fq is accepted, everything downstream would be specified by the pipeline so not an issue)
I make the following proposal based on the current version of EAGER (
-r 2.0.6)(* not implemented AFAIK but maybe should be?)
Rules:
bwA.srt.bam, and laterbwA.srt.Ddp.srt.bam).For example, an input file with the name: ABM006.A0101_S0_L001_R1_000.bam would have the following code:
ABM006.A0101_S0_L001_R1_000.sfq.plG.arB.bwA.srt.flM.Ddp.srt.pmd.buT.bam [76 characters]Indicating that it has been converted to FASTQ, polyG trimmed, adapterRemoval trimmed and collapsed, mapped with bwA, sorted by samtools, filtered for mapped reads only, deduplicated with DeDup, sorted again with satmools, somethign with PMDtools, and ends trimmed with bamUtils.
Originally posted by @jfy133 in #178 (comment)