Skip to content

Advanced FileNaming #179

@apeltzer

Description

@apeltzer

After discussion with @apeltzer , the problem is indeed the use of prefix variables. Trying to shorten file names consistently is very difficult because of split delimiters are not the same between different sequencing centres. E.g. some places will use _ to delimit sections of file names, others will use ..

As an alternative to the 'prefix' system, we propose a tagging system, where each module has a 'descriptive' and unique tag appended to the output BAM/FASTQ file (but with the input file's suffix stripped).

For example, each output file would be generated following something like this (using bash notation):

${input_file%%.bam}.xxX.bam

where xxX is the module name.

(Note that the first processing step of the 'raw' input file will need consider that both fastq/fq is accepted, everything downstream would be specified by the pipeline so not an issue)

I make the following proposal based on the current version of EAGER (-r 2.0.6 )


Tool Variant Proposed tag Example
Input bam 2 fastq NA sfq .sfq.fq.gz
FastP Poly-G PolyG plG .plG.fq.gz
AdapterRemoval trimmed & collapsed arB .arB.fq.gz
AdapterRemoval collapsed only arC .arC.fq.gz
AdapterRemoval trimmed only arT .arT.fq.gz
(AdapterRemoval)* (merged only) (arM) .arM.fq.gz
bwa aln bwA .bwA.bam
bwa mem bwM .bwM.bam
circularMapper bwa bwC .bwC.bam
samtools sort srt .srt.bam
samtools filter (mapped+unmapped) flB .flB.bam
samtools filter (mapped only) flM .flM.bam
samtools filter (unmapped only) flU .flU.bam
dedup NA Ddp .Ddp.bam
markduplicates NA Mdp .Mdp.bam
pmdtools NA pmd .pmd.bam
bamutils trimBam buT .buT.bam

(* not implemented AFAIK but maybe should be?)

Rules:

  • three character code (alphanumeric)
  • tools abbreviations are lower case
  • if there are multiple variants of a particular tool, the variants are indicated with capitals
  • steps applied multiple times in the pipeline (e.g. samtools sort) do not need a 'tool' code if a verb is used.
  • each character code is unique to that particular step - if multi-step commands used, they must be combined with the previous processing step (e.g. bwA.srt.bam, and later bwA.srt.Ddp.srt.bam).

For example, an input file with the name: ABM006.A0101_S0_L001_R1_000.bam would have the following code:

ABM006.A0101_S0_L001_R1_000.sfq.plG.arB.bwA.srt.flM.Ddp.srt.pmd.buT.bam [76 characters]

Indicating that it has been converted to FASTQ, polyG trimmed, adapterRemoval trimmed and collapsed, mapped with bwA, sorted by samtools, filtered for mapped reads only, deduplicated with DeDup, sorted again with satmools, somethign with PMDtools, and ends trimmed with bamUtils.

Originally posted by @jfy133 in #178 (comment)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions