Merge pull request #557 from nf-core/malt-sam-output

jfy133 · web-flow · commit cf41572c0b24 · 2020-09-25T15:20:34.000+02:00
Malt sam output
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -165,7 +165,7 @@ jobs:
           for i in index0.idx ref.db ref.idx ref.inf table0.db table0.idx taxonomy.idx taxonomy.map taxonomy.tre; do wget https://github.com/nf-core/test-datasets/raw/eager/databases/malt/"$i" -P databases/malt/; done
       - name: METAGENOMIC Run the basic pipeline but with unmapped reads going into MALT
         run: |
-          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bam_filtering  --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/"
+          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bam_filtering  --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/" --malt_sam_output
       - name: MALTEXTRACT Download resource files
         run: |
             mkdir -p databases/maltextract
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -34,6 +34,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 * Updated template to nf-core/tools 1.10.2
 * [#544](https://github.com/nf-core/eager/pull/544) Add script to perform bam filtering on fragment length
 * [#456](https://github.com/nf-core/eager/pull/546) Bumps the base (default) runtime of all processes to 4 hours, and set shorter timelimits for test profiles (1 hour)
+* [#552](https://github.com/nf-core/eager/issues/552) Adds optional creation of MALT SAM files alongside RMA6 files.
 
 ### `Fixed`
 
diff --git a/docs/output.md b/docs/output.md
@@ -10,20 +10,65 @@
     - [Secondary Output Directories](#secondary-output-directories)
   - [MultiQC Report](#multiqc-report)
     - [General Stats Table](#general-stats-table)
+      - [Background](#background)
+      - [Table](#table)
     - [FastQC](#fastqc)
+      - [Background](#background-1)
+      - [Sequence Counts](#sequence-counts)
+      - [Sequence Quality Histograms](#sequence-quality-histograms)
+      - [Per Sequence Quality Scores](#per-sequence-quality-scores)
+      - [Per Base Sequencing Content](#per-base-sequencing-content)
+      - [Per Sequence GC Content](#per-sequence-gc-content)
+      - [Per Base N Content](#per-base-n-content)
+      - [Sequence Duplication Levels](#sequence-duplication-levels)
+      - [Overrepresented sequences](#overrepresented-sequences)
+      - [Adapter Content](#adapter-content)
     - [FastP](#fastp)
+      - [Background](#background-2)
+      - [GC Content](#gc-content)
     - [AdapterRemoval](#adapterremoval)
+      - [Background](#background-3)
+      - [Retained and Discarded Reads Plot](#retained-and-discarded-reads-plot)
+      - [Length Distribution Plot](#length-distribution-plot)
     - [Bowtie2](#bowtie2)
+      - [Background](#background-4)
+      - [Single/Paired-end alignments](#singlepaired-end-alignments)
     - [MALT](#malt)
+      - [Background](#background-5)
+      - [Metagenomic Mappability](#metagenomic-mappability)
+      - [Taxonomic assignment success](#taxonomic-assignment-success)
     - [Kraken](#kraken)
+      - [Background](#background-6)
+      - [Top Taxa](#top-taxa)
     - [Samtools](#samtools)
+      - [Background](#background-7)
+      - [Flagstat Plot](#flagstat-plot)
     - [DeDup](#dedup)
+      - [Background](#background-8)
+      - [DeDup Plot](#dedup-plot)
     - [Picard](#picard)
+      - [Background](#background-9)
+      - [Mark Duplicates](#mark-duplicates)
     - [Preseq](#preseq)
+      - [Background](#background-10)
+      - [Complexity Curve](#complexity-curve)
     - [DamageProfiler](#damageprofiler)
+      - [Background](#background-11)
+      - [Misincorporation Plots](#misincorporation-plots)
+      - [Length Distribution](#length-distribution)
     - [QualiMap](#qualimap)
+      - [Background](#background-12)
+      - [Coverage Histogram](#coverage-histogram)
+      - [Cumulative Genome Coverage](#cumulative-genome-coverage)
+      - [GC Content Distribution](#gc-content-distribution)
     - [Sex.DetERRmine](#sexdeterrmine)
+      - [Background](#background-13)
+      - [Relative Coverage](#relative-coverage)
+      - [Read Counts](#read-counts)
     - [MultiVCFAnalyzer](#multivcfanalyzer)
+      - [Background](#background-14)
+      - [Summary metrics](#summary-metrics)
+      - [Call statistics barplot](#call-statistics-barplot)
   - [Output Files](#output-files)
 
 ## Introduction
@@ -682,7 +727,7 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
 - `nuclear_contamination/` this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
 - `bedtools/` this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
 - `metagenomic_classification/` This contains the output for a given metagenomic classifier.
-  - Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc.
+  - Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested.
   - Kraken will contain the Kraken output and report files, as well as a merged Taxon count table.
 - `maltextract/` this will contain a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
 - `consensus_sequence/` this contains three FASTA files from VCF2Genome, of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively.
diff --git a/docs/usage.md b/docs/usage.md
@@ -188,6 +188,7 @@
       - [`--malt_min_support_percent`](#--malt_min_support_percent)
       - [`--malt_max_queries`](#--malt_max_queries)
       - [`--malt_memory_mode`](#--malt_memory_mode)
+      - [`--malt_sam_output`](#--malt_sam_output)
     - [Metagenomic Authentication](#metagenomic-authentication)
       - [`--run_maltextract`](#--run_maltextract)
       - [`--maltextract_taxon_list`](#--maltextract_taxon_list)
@@ -2021,6 +2022,15 @@ many remote file-systems such as GPFS. Default is `'load'`.
 
 Only when `--metagenomic_tool malt` is also supplied.
 
+#### `--malt_sam_output`
+
+Specify to _also_ produce gzipped SAM files of all alignments and un-aligned
+reads in addition to RMA6 files. These are **not** soft-clipped or in 'sparse'
+format. Can be useful for downstream analyses due to more common file format.
+
+> :warning: can result in very large run output directories as this is
+> essentially duplication of the RMA6 files.
+
 ### Metagenomic Authentication
 
 #### `--run_maltextract`
diff --git a/main.nf b/main.nf
@@ -206,6 +206,7 @@ def helpMessage() {
       --malt_min_support_percent [num]       Specify the minimum percentage of reads a taxon of sample total is required to have to be retained for MALT. Default: Default: ${params.malt_min_support_percent}
       --malt_max_queries [num]               Specify the maximium number of queries a read can have for MALT. Default: ${params.malt_max_queries}
       --malt_memory_mode [str]               Specify the memory load method. Do not use 'map' with GPFS file systems for MALT as can be very slow. Options: 'load', 'page', 'map'. Default: '${params.malt_memory_mode}'
+      --malt_sam_output [bool]               Specify to also produce SAM alignment files. Note this includes both aligned and unaligned reads, and are gzipped. Note this will result in very large file sizes.
 
     Metagenomic Authentication
       --run_maltextract [bool]                  Turn on MaltExtract for MALT aDNA characteristics authentication
@@ -2887,22 +2888,25 @@ process malt {
   file db from ch_db_for_malt
 
   output:
-  file "*.rma6" into ch_rma_for_maltExtract
-  file "malt.log" into ch_malt_for_multiqc
+  path("*.rma6") into ch_rma_for_maltExtract
+  path("*.sam.gz") optional true
+  path("malt.log") into ch_malt_for_multiqc
 
   script:
   if ( "${params.malt_min_support_mode}" == "percent" ) {
     min_supp = "-supp ${params.malt_min_support_percent}" 
   } else if ( "${params.malt_min_support_mode}" == "reads" ) {
     min_supp = "-sup ${params.metagenomic_min_support_reads}"
   }
+  def sam_out = params.malt_sam_output ? "-a . -f SAM" : ""
   """
   malt-run \
   -J-Xmx${task.memory.toGiga()}g \
   -t ${task.cpus} \
   -v \
   -o . \
   -d ${db} \
+  ${sam_out} \
   -id ${params.percent_identity} \
   -m ${params.malt_mode} \
   -at ${params.malt_alignment_mode} \
diff --git a/nextflow.config b/nextflow.config
@@ -195,6 +195,7 @@ params {
   malt_min_support_percent = 0.01
   malt_max_queries = 100
   malt_memory_mode = 'load'
+  malt_sam_output = false
 
   // maltextract - only including number 
   // parameters if default documented or duplicate of MALT
diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -1250,6 +1250,12 @@
                     "description": "Specify the memory load method. Do not use 'map' with GPFS file systems for MALT as can be very slow. Options: 'load', 'page', 'map'.",
                     "fa_icon": "fas fa-memory",
                     "help_text": "How to load the database into memory. Options are `'load'`, `'page'` or `'map'`. 'load' directly loads the entire database into memory prior seed look up, this is slow but compatible with all servers/file systems. `'page'` and `'map'` perform a sort of 'chunked' database loading, allow seed look up prior entire database loading. Note that Page and Map modes do not work properly not with many remote file-systems such as GPFS. Default is `'load'`.\n\nOnly when `--metagenomic_tool malt` is also supplied."
+                },
+                "malt_sam_output": {
+                    "type": "boolean",
+                    "description": "Specify to also produce SAM alignment files. Note this includes both aligned and unaligned reads, and are gzipped. Note this will result in very large file sizes.",
+                    "fa_icon": "fas fa-file-alt",
+                    "help_text": "Specify to _also_ produce gzipped SAM files of all alignments and un-aligned reads in addition to RMA6 files. These are **not** soft-clipped or in 'sparse' format. Can be useful for downstream analyses due to more common file format. :warning: can result in very large run output directories as this is essentially duplication of the RMA6 files."
                 }
             },
             "fa_icon": "fas fa-search"