Skip to content

Improve BWA index workflow #596

@jfy133

Description

@jfy133

Currently we by default unzip the FASTA file

eager/main.nf

Lines 280 to 300 in f2a326d

if("${params.fasta}".endsWith(".gz")){
process unzip_reference{
tag "${zipped_fasta}"
input:
path zipped_fasta from file(params.fasta) // path doesn't like it if a string of an object is not prefaced with a root dir (/), so use file() to resolve string before parsing to `path`
output:
path "$unzip" into ch_fasta into ch_fasta_for_bwaindex,ch_fasta_for_bt2index,ch_fasta_for_faidx,ch_fasta_for_seqdict,ch_fasta_for_circulargenerator,ch_fasta_for_circularmapper,ch_fasta_for_damageprofiler,ch_fasta_for_qualimap,ch_fasta_for_pmdtools,ch_fasta_for_genotyping_ug,ch_fasta_for_genotyping_hc,ch_fasta_for_genotyping_freebayes,ch_fasta_for_genotyping_pileupcaller,ch_fasta_for_vcf2genome,ch_fasta_for_multivcfanalyzer,ch_fasta_for_genotyping_angsd
script:
unzip = zipped_fasta.toString() - '.gz'
"""
pigz -f -d -p ${task.cpus} $zipped_fasta
"""
}
} else {
fasta_for_indexing = Channel
.fromPath("${params.fasta}", checkIfExists: true)
.into{ ch_fasta_for_bwaindex; ch_fasta_for_bt2index; ch_fasta_for_faidx; ch_fasta_for_seqdict; ch_fasta_for_circulargenerator; ch_fasta_for_circularmapper; ch_fasta_for_damageprofiler; ch_fasta_for_qualimap; ch_fasta_for_pmdtools; ch_fasta_for_genotyping_ug; ch_fasta__for_genotyping_hc; ch_fasta_for_genotyping_hc; ch_fasta_for_genotyping_freebayes; ch_fasta_for_genotyping_pileupcaller; ch_fasta_for_vcf2genome; ch_fasta_for_multivcfanalyzer;ch_fasta_for_genotyping_angsd }
}

however, both bwa index and picard CreateSequenceDictinary does allow indexing of gzipped files. This can be confusing for people who have already BWA indexed their genomes while gzipped, and this then gets out of sync with the nf-core/eager bwa mapping command where it has unzipped the FASTA (so now doesn't have .gz) but then the actual supplied indicies do have the .gz suffix.

It would make more intuitive sense that we skip the whole unzipping thing unless we are running samtools faidx. However we need to check if this may affect downstream analysis.

In the meantime we should document this properly.

Originally reported by @marcel-keller

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions