Skip to content

Commit cf41572

Browse files
authored
Merge pull request #557 from nf-core/malt-sam-output
Malt sam output
2 parents 7e19c62 + eb84ad0 commit cf41572

7 files changed

Lines changed: 71 additions & 4 deletions

File tree

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ jobs:
165165
for i in index0.idx ref.db ref.idx ref.inf table0.db table0.idx taxonomy.idx taxonomy.map taxonomy.tre; do wget https://github.com/nf-core/test-datasets/raw/eager/databases/malt/"$i" -P databases/malt/; done
166166
- name: METAGENOMIC Run the basic pipeline but with unmapped reads going into MALT
167167
run: |
168-
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bam_filtering --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/"
168+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bam_filtering --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/" --malt_sam_output
169169
- name: MALTEXTRACT Download resource files
170170
run: |
171171
mkdir -p databases/maltextract

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
3434
* Updated template to nf-core/tools 1.10.2
3535
* [#544](https://github.com/nf-core/eager/pull/544) Add script to perform bam filtering on fragment length
3636
* [#456](https://github.com/nf-core/eager/pull/546) Bumps the base (default) runtime of all processes to 4 hours, and set shorter timelimits for test profiles (1 hour)
37+
* [#552](https://github.com/nf-core/eager/issues/552) Adds optional creation of MALT SAM files alongside RMA6 files.
3738

3839
### `Fixed`
3940

docs/output.md

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,65 @@
1010
- [Secondary Output Directories](#secondary-output-directories)
1111
- [MultiQC Report](#multiqc-report)
1212
- [General Stats Table](#general-stats-table)
13+
- [Background](#background)
14+
- [Table](#table)
1315
- [FastQC](#fastqc)
16+
- [Background](#background-1)
17+
- [Sequence Counts](#sequence-counts)
18+
- [Sequence Quality Histograms](#sequence-quality-histograms)
19+
- [Per Sequence Quality Scores](#per-sequence-quality-scores)
20+
- [Per Base Sequencing Content](#per-base-sequencing-content)
21+
- [Per Sequence GC Content](#per-sequence-gc-content)
22+
- [Per Base N Content](#per-base-n-content)
23+
- [Sequence Duplication Levels](#sequence-duplication-levels)
24+
- [Overrepresented sequences](#overrepresented-sequences)
25+
- [Adapter Content](#adapter-content)
1426
- [FastP](#fastp)
27+
- [Background](#background-2)
28+
- [GC Content](#gc-content)
1529
- [AdapterRemoval](#adapterremoval)
30+
- [Background](#background-3)
31+
- [Retained and Discarded Reads Plot](#retained-and-discarded-reads-plot)
32+
- [Length Distribution Plot](#length-distribution-plot)
1633
- [Bowtie2](#bowtie2)
34+
- [Background](#background-4)
35+
- [Single/Paired-end alignments](#singlepaired-end-alignments)
1736
- [MALT](#malt)
37+
- [Background](#background-5)
38+
- [Metagenomic Mappability](#metagenomic-mappability)
39+
- [Taxonomic assignment success](#taxonomic-assignment-success)
1840
- [Kraken](#kraken)
41+
- [Background](#background-6)
42+
- [Top Taxa](#top-taxa)
1943
- [Samtools](#samtools)
44+
- [Background](#background-7)
45+
- [Flagstat Plot](#flagstat-plot)
2046
- [DeDup](#dedup)
47+
- [Background](#background-8)
48+
- [DeDup Plot](#dedup-plot)
2149
- [Picard](#picard)
50+
- [Background](#background-9)
51+
- [Mark Duplicates](#mark-duplicates)
2252
- [Preseq](#preseq)
53+
- [Background](#background-10)
54+
- [Complexity Curve](#complexity-curve)
2355
- [DamageProfiler](#damageprofiler)
56+
- [Background](#background-11)
57+
- [Misincorporation Plots](#misincorporation-plots)
58+
- [Length Distribution](#length-distribution)
2459
- [QualiMap](#qualimap)
60+
- [Background](#background-12)
61+
- [Coverage Histogram](#coverage-histogram)
62+
- [Cumulative Genome Coverage](#cumulative-genome-coverage)
63+
- [GC Content Distribution](#gc-content-distribution)
2564
- [Sex.DetERRmine](#sexdeterrmine)
65+
- [Background](#background-13)
66+
- [Relative Coverage](#relative-coverage)
67+
- [Read Counts](#read-counts)
2668
- [MultiVCFAnalyzer](#multivcfanalyzer)
69+
- [Background](#background-14)
70+
- [Summary metrics](#summary-metrics)
71+
- [Call statistics barplot](#call-statistics-barplot)
2772
- [Output Files](#output-files)
2873

2974
## Introduction
@@ -682,7 +727,7 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
682727
- `nuclear_contamination/` this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
683728
- `bedtools/` this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
684729
- `metagenomic_classification/` This contains the output for a given metagenomic classifier.
685-
- Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc.
730+
- Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested.
686731
- Kraken will contain the Kraken output and report files, as well as a merged Taxon count table.
687732
- `maltextract/` this will contain a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
688733
- `consensus_sequence/` this contains three FASTA files from VCF2Genome, of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively.

docs/usage.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,7 @@
188188
- [`--malt_min_support_percent`](#--malt_min_support_percent)
189189
- [`--malt_max_queries`](#--malt_max_queries)
190190
- [`--malt_memory_mode`](#--malt_memory_mode)
191+
- [`--malt_sam_output`](#--malt_sam_output)
191192
- [Metagenomic Authentication](#metagenomic-authentication)
192193
- [`--run_maltextract`](#--run_maltextract)
193194
- [`--maltextract_taxon_list`](#--maltextract_taxon_list)
@@ -2021,6 +2022,15 @@ many remote file-systems such as GPFS. Default is `'load'`.
20212022

20222023
Only when `--metagenomic_tool malt` is also supplied.
20232024

2025+
#### `--malt_sam_output`
2026+
2027+
Specify to _also_ produce gzipped SAM files of all alignments and un-aligned
2028+
reads in addition to RMA6 files. These are **not** soft-clipped or in 'sparse'
2029+
format. Can be useful for downstream analyses due to more common file format.
2030+
2031+
> :warning: can result in very large run output directories as this is
2032+
> essentially duplication of the RMA6 files.
2033+
20242034
### Metagenomic Authentication
20252035

20262036
#### `--run_maltextract`

main.nf

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,7 @@ def helpMessage() {
206206
--malt_min_support_percent [num] Specify the minimum percentage of reads a taxon of sample total is required to have to be retained for MALT. Default: Default: ${params.malt_min_support_percent}
207207
--malt_max_queries [num] Specify the maximium number of queries a read can have for MALT. Default: ${params.malt_max_queries}
208208
--malt_memory_mode [str] Specify the memory load method. Do not use 'map' with GPFS file systems for MALT as can be very slow. Options: 'load', 'page', 'map'. Default: '${params.malt_memory_mode}'
209+
--malt_sam_output [bool] Specify to also produce SAM alignment files. Note this includes both aligned and unaligned reads, and are gzipped. Note this will result in very large file sizes.
209210
210211
Metagenomic Authentication
211212
--run_maltextract [bool] Turn on MaltExtract for MALT aDNA characteristics authentication
@@ -2887,22 +2888,25 @@ process malt {
28872888
file db from ch_db_for_malt
28882889

28892890
output:
2890-
file "*.rma6" into ch_rma_for_maltExtract
2891-
file "malt.log" into ch_malt_for_multiqc
2891+
path("*.rma6") into ch_rma_for_maltExtract
2892+
path("*.sam.gz") optional true
2893+
path("malt.log") into ch_malt_for_multiqc
28922894

28932895
script:
28942896
if ( "${params.malt_min_support_mode}" == "percent" ) {
28952897
min_supp = "-supp ${params.malt_min_support_percent}"
28962898
} else if ( "${params.malt_min_support_mode}" == "reads" ) {
28972899
min_supp = "-sup ${params.metagenomic_min_support_reads}"
28982900
}
2901+
def sam_out = params.malt_sam_output ? "-a . -f SAM" : ""
28992902
"""
29002903
malt-run \
29012904
-J-Xmx${task.memory.toGiga()}g \
29022905
-t ${task.cpus} \
29032906
-v \
29042907
-o . \
29052908
-d ${db} \
2909+
${sam_out} \
29062910
-id ${params.percent_identity} \
29072911
-m ${params.malt_mode} \
29082912
-at ${params.malt_alignment_mode} \

nextflow.config

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,7 @@ params {
195195
malt_min_support_percent = 0.01
196196
malt_max_queries = 100
197197
malt_memory_mode = 'load'
198+
malt_sam_output = false
198199

199200
// maltextract - only including number
200201
// parameters if default documented or duplicate of MALT

nextflow_schema.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1250,6 +1250,12 @@
12501250
"description": "Specify the memory load method. Do not use 'map' with GPFS file systems for MALT as can be very slow. Options: 'load', 'page', 'map'.",
12511251
"fa_icon": "fas fa-memory",
12521252
"help_text": "How to load the database into memory. Options are `'load'`, `'page'` or `'map'`. 'load' directly loads the entire database into memory prior seed look up, this is slow but compatible with all servers/file systems. `'page'` and `'map'` perform a sort of 'chunked' database loading, allow seed look up prior entire database loading. Note that Page and Map modes do not work properly not with many remote file-systems such as GPFS. Default is `'load'`.\n\nOnly when `--metagenomic_tool malt` is also supplied."
1253+
},
1254+
"malt_sam_output": {
1255+
"type": "boolean",
1256+
"description": "Specify to also produce SAM alignment files. Note this includes both aligned and unaligned reads, and are gzipped. Note this will result in very large file sizes.",
1257+
"fa_icon": "fas fa-file-alt",
1258+
"help_text": "Specify to _also_ produce gzipped SAM files of all alignments and un-aligned reads in addition to RMA6 files. These are **not** soft-clipped or in 'sparse' format. Can be useful for downstream analyses due to more common file format. :warning: can result in very large run output directories as this is essentially duplication of the RMA6 files."
12531259
}
12541260
},
12551261
"fa_icon": "fas fa-search"

0 commit comments

Comments
 (0)