Skip to content

Commit 492b1aa

Browse files
authored
Merge pull request #641 from jfy133/metagenomic-complexity-filter
Add pre-metagenomic screening complexity filter
2 parents 7f10eae + 8cca08d commit 492b1aa

10 files changed

Lines changed: 166 additions & 106 deletions

File tree

.github/workflows/ci.yml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -146,16 +146,16 @@ jobs:
146146
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_pmdtools
147147
- name: GENOTYPING_UG AND MULTIVCFANALYZER Test running GATK UnifiedGenotyper and MultiVCFAnalyzer, additional VCFS
148148
run: |
149-
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer --additional_vcf_files 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/vcf/JK2772_CATCAGTGAGTAGA_L008_R1_001.fastq.gz.tengrand.fq.combined.fq.mapped_rmdup.bam.unifiedgenotyper.vcf.gz' --write_allele_frequencies
149+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer --additional_vcf_files 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/vcf/JK2772_CATCAGTGAGTAGA_L008_R1_001.fastq.gz.tengrand.fq.combined.fq.mapped_rmdup.bam.unifiedgenotyper.vcf.gz' --write_allele_frequencies
150150
- name: COMPLEX LANE/LIBRARY MERGING Test running lane and library merging prior to GATK UnifiedGenotyper and running MultiVCFAnalyzer
151151
run: |
152-
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --run_genotyping --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer
152+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --run_genotyping --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer
153153
- name: GENOTYPING_UG ON TRIMMED BAM Test
154154
run: |
155-
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --run_trim_bam --genotyping_source 'trimmed' --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP'
155+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --run_trim_bam --genotyping_source 'trimmed' --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP'
156156
- name: BAM_INPUT Run the basic pipeline with the bam input profile, skip AdapterRemoval as no convertBam
157157
run: |
158-
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --skip_adapterremoval
158+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --skip_adapterremoval
159159
- name: BAM_INPUT Run the basic pipeline with the bam input profile, convert to FASTQ for adapterremoval test and downstream
160160
run: |
161161
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --run_convertinputbam
@@ -167,6 +167,9 @@ jobs:
167167
- name: METAGENOMIC Run the basic pipeline but with unmapped reads going into MALT
168168
run: |
169169
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bam_filtering --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/" --malt_sam_output
170+
- name: METAGENOMIC Run the basic pipeline but low-complexity filtered reads going into MALT
171+
run: |
172+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bam_filtering --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt/" --metagenomic_complexity_filter
170173
- name: MALTEXTRACT Download resource files
171174
run: |
172175
mkdir -p databases/maltextract

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
77

88
### `Added`
99

10-
- [#583](https://github.com/nf-core/eager/issues/583) - mapDamage2 rescaling of BAM files to remove damage
10+
- [#640](https://github.com/nf-core/eager/issues/640) - Added a pre-metagenomic screening filtering of low-sequence complexity reads with `bbduk`
11+
- [#583](https://github.com/nf-core/eager/issues/583) - Added `mapDamage2` rescaling of BAM files to remove damage
1112

1213
### `Fixed`
1314

README.md

Lines changed: 35 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,39 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
2525
<img src="docs/images/output/overview/eager2_workflow.png" alt="nf-core/eager schematic workflow" width="70%"
2626
</p>
2727

28-
## Pipeline steps
28+
## Quick Start
29+
30+
1. Install [`nextflow`](https://nf-co.re/usage/installation) (version >= 20.04.0)
31+
32+
2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) or [`Podman`](https://podman.io/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_
33+
34+
3. Download the pipeline and test it on a minimal dataset with a single command:
35+
36+
```bash
37+
nextflow run nf-core/eager -profile test,<docker/singularity/podman/conda/institute>
38+
```
39+
40+
> Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
41+
42+
4. Start running your own analysis!
43+
44+
```bash
45+
nextflow run nf-core/eager -profile <docker/singularity/conda> --input '*_R{1,2}.fastq.gz' --fasta '<your_reference>.fasta'
46+
```
47+
48+
5. Once your run has completed successfully, clean up the intermediate files.
49+
50+
```bash
51+
nextflow clean -f -k
52+
```
53+
54+
See [usage docs](https://nf-co.re/eager/docs/usage.md) for all of the available options when running the pipeline.
55+
56+
**N.B.** You can see an overview of the run in the MultiQC report located at `./results/MultiQC/multiqc_report.html`
57+
58+
Modifications to the default pipeline are easily made using various options as described in the documentation.
59+
60+
## Pipeline Summary
2961

3062
### Default Steps
3163

@@ -77,6 +109,7 @@ Additional functionality contained by the pipeline currently includes:
77109

78110
#### Metagenomic Screening
79111

112+
* Low-sequenced complexity filtering (`BBduk`)
80113
* Taxonomic binner with alignment (`MALT`)
81114
* Taxonomic binner without alignment (`Kraken2`)
82115
* aDNA characteristic screening of taxonomically binned data from MALT (`MaltExtract`)
@@ -89,48 +122,6 @@ A graphical overview of suggested routes through the pipeline depending on conte
89122
<img src="docs/images/output/overview/eager2_metromap_complex.png" alt="nf-core/eager metro map" width="70%"
90123
</p>
91124

92-
## Quick Start
93-
94-
1. Install [`nextflow`](https://nf-co.re/usage/installation) (version >= 20.04.0)
95-
96-
2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) or [`Podman`](https://podman.io/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_
97-
98-
3. Download the pipeline and test it on a minimal dataset with a single command:
99-
100-
```bash
101-
nextflow run nf-core/eager -profile test,<docker/singularity/podman/conda/institute>
102-
```
103-
104-
> Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
105-
106-
4. Start running your own analysis!
107-
108-
```bash
109-
nextflow run nf-core/eager -profile <docker/singularity/conda> --input '*_R{1,2}.fastq.gz' --fasta '<your_reference>.fasta'
110-
```
111-
112-
5. Once your run has completed successfully, clean up the intermediate files.
113-
114-
```bash
115-
nextflow clean -f -k
116-
```
117-
118-
See [usage docs](https://nf-co.re/eager/docs/usage.md) for all of the available options when running the pipeline.
119-
120-
**N.B.** You can see an overview of the run in the MultiQC report located at `./results/MultiQC/multiqc_report.html`
121-
122-
Modifications to the default pipeline are easily made using various options
123-
as described in the documentation.
124-
125-
## Pipeline Summary
126-
127-
By default, the pipeline currently performs the following:
128-
129-
<!-- TODO nf-core: Fill in short bullet-pointed list of default steps of pipeline -->
130-
131-
* Sequencing quality control (`FastQC`)
132-
* Overall pipeline run summaries (`MultiQC`)
133-
134125
## Documentation
135126

136127
The nf-core/eager pipeline comes with documentation about the pipeline: [usage](https://nf-co.re/eager/usage) and [output](https://nf-co.re/eager/output).
@@ -237,6 +228,7 @@ In addition, references of tools and data used in this pipeline are as follows:
237228
* **sequenceTools** Stephan Schiffels (Unpublished). Download: [https://github.com/stschiff/sequenceTools](https://github.com/stschiff/sequenceTools)
238229
* **EigenstratDatabaseTools** Thiseas C. Lamnidis (Unpublished). Download: [https://github.com/TCLamnidis/EigenStratDatabaseTools.git](https://github.com/TCLamnidis/EigenStratDatabaseTools.git)
239230
* **mapDamage2** Jónsson, H., et al 2013. mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics , 29(13), 1682–1684. [https://doi.org/10.1093/bioinformatics/btt193](https://doi.org/10.1093/bioinformatics/btt193)
231+
* **BBduk** Brian Bushnell (Unpublished). Download: [https://sourceforge.net/projects/bbmap/](sourceforge.net/projects/bbmap/)
240232
241233
## Data References
242234

assets/multiqc_config.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ report_comment: >
66
This report has been generated by the <a href="https://github.com/nf-core/eager" target="_blank">nf-core/eager</a>
77
analysis pipeline. For information about how to interpret these results, please see the
88
<a href="https://github.com/nf-core/eager" target="_blank">documentation</a>.
9-
109
run_modules:
1110
- adapterRemoval
1211
- bowtie2
@@ -270,4 +269,4 @@ report_section_order:
270269
nf-core-eager-summary:
271270
order: -1001
272271

273-
export_plots: true
272+
export_plots: true

bin/scrape_software_versions.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
'kraken':['v_kraken.txt', r"Kraken version (\S+)"],
3838
'eigenstrat_snp_coverage':['v_eigenstrat_snp_coverage.txt',r"(\S+)"],
3939
'mapDamage2':['v_mapdamage.txt',r"(\S+)"],
40+
'bbduk':['v_bbduk.txt',r"(\S+)"]
4041
}
4142

4243
results = OrderedDict()
@@ -73,6 +74,8 @@
7374
results['maltextract'] = '<span style="color:#999999;\">N/A</span>'
7475
results['eigenstrat_snp_coverage'] = '<span style="color:#999999;\">N/A</span>'
7576
results['mapDamage2'] = '<span style="color:#999999;\">N/A</span>'
77+
results['bbduk'] = '<span style="color:#999999;\">N/A</span>'
78+
7679

7780
# Search each file using its regex
7881
for k, v in regexes.items():

docs/output.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -664,6 +664,7 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
664664
- `sex_determination/` - this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
665665
- `nuclear_contamination/` - this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
666666
- `bedtools/` - this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
667+
- `metagenomic_complexity_filter` - this contains the output from filtering of input reads to metagenomic classification of low-sequence complexity reads as performed by `bbduk`. This will include the filtered FASTQ files (`*_lowcomplexityremoved.fq.gz`) and also the run-time log (`_bbduk.stats`) for each sample. **Note:** there are no sections in the MultiQC report for this module, therefore you must check the `._bbduk.stats` files to get summary statistics of the filtering.
667668
- `metagenomic_classification/` - this contains the output for a given metagenomic classifier.
668669
- Running MALT will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested.
669670
- Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table.

environment.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,3 +48,5 @@ dependencies:
4848
- bioconda::bowtie2=2.4.1
4949
- bioconda::eigenstratdatabasetools=1.0.2
5050
- bioconda::mapdamage2=2.2.0
51+
- bioconda::bbmap=38.87
52+

0 commit comments

Comments
 (0)