Skip to content

Commit 103a746

Browse files
authored
Merge pull request #355 from maxibor/kraken
Adding Kraken2 metagenomics classifier
2 parents 9077aab + ace02e1 commit 103a746

10 files changed

Lines changed: 347 additions & 25 deletions

File tree

.github/workflows/ci.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,9 @@ jobs:
122122
- name: MALTEXTRACT Basic with MALT plus MaltExtract
123123
run: |
124124
nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-maltextract" -profile test,docker --paired_end --run_bam_filtering --bam_discard_unmapped --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt" --run_maltextract --maltextract_ncbifiles "/home/runner/work/eager/eager/databases/maltextract/" --maltextract_taxon_list 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/maltextract/MaltExtract_list.txt'
125+
- name: METAGENOMIC Run the basic pipeline but with unmapped reads going into Kraken
126+
run: |
127+
nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-kraken" -profile test_kraken,docker ${{ matrix.endedness }} --run_bam_filtering --bam_discard_unmapped --bam_unmapped_type 'fastq'
125128
- name: SEXDETERMINATION Run the basic pipeline with the bam input profile, but don't convert BAM, skip everything but sex determination
126129
run: |
127130
nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-sexdeterrmine" -profile test_humanbam,docker --bam --skip_fastqc --skip_adapterremoval --skip_mapping --skip_deduplication --skip_qualimap --single_end --run_sexdeterrmine

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
2727
* [#326](https://github.com/nf-core/eager/pull/326) - Add Biopython and [xopen](https://github.com/marcelm/xopen/) dependencies
2828
* [#336](https://github.com/nf-core/eager/issues/336) - Change default Y-axis maximum value of DamageProfiler to 30% to match popular (but slower) mapDamage, and allow user to set their own value.
2929
* [#352](https://github.com/nf-core/eager/pull/352) - Add social preview image
30+
* [#355](https://github.com/nf-core/eager/pull/355) - Add Kraken2 metagenomics classifier
3031

3132
### `Fixed`
3233

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,8 @@ Additional functionality contained by the pipeline currently includes:
6767
#### Metagenomic Screening
6868

6969
* Taxonomic binner with alignment (`MALT`)
70-
* aDNA characteristic screening of taxonomically binned data (`MaltExtract`)
70+
* Taxonomic binner without alignment (`Kraken2`)
71+
* aDNA characteristic screening of taxonomically binned data from MALT (`MaltExtract`)
7172

7273
## Quick Start
7374

@@ -157,3 +158,4 @@ If you've contributed and you're missing in here, please let me know and I'll ad
157158
* Vågene, Å.J. et al., 2018. Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature ecology & evolution, 2(3), pp.520–528. Available at: [http://dx.doi.org/10.1038/s41559-017-0446-6](http://dx.doi.org/10.1038/s41559-017-0446-6).
158159
* Herbig, A. et al., 2016. MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv, p.050559. Available at: [http://biorxiv.org/content/early/2016/04/27/050559](http://biorxiv.org/content/early/2016/04/27/050559).
159160
* **MaltExtract** Huebler, R. et al., 2019. HOPS: Automated detection and authentication of pathogen DNA in archaeological remains. bioRxiv, p.534198. Available at: [https://www.biorxiv.org/content/10.1101/534198v1?rss=1](https://www.biorxiv.org/content/10.1101/534198v1?rss=1). Download: [https://github.com/rhuebler/MaltExtract](https://github.com/rhuebler/MaltExtract)
161+
* **Kraken2** Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. Available at: [https://doi.org/10.1186/s13059-019-1891-0](https://doi.org/10.1186/s13059-019-1891-0). Download: [https://ccb.jhu.edu/software/kraken2/](https://ccb.jhu.edu/software/kraken2/)

bin/kraken_parse.py

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
#!/usr/bin/env python
2+
3+
4+
import argparse
5+
import csv
6+
7+
8+
def _get_args():
9+
'''This function parses and return arguments passed in'''
10+
parser = argparse.ArgumentParser(
11+
prog='kraken_parse',
12+
formatter_class=argparse.RawDescriptionHelpFormatter,
13+
description='Parsing kraken')
14+
parser.add_argument('krakenReport', help="path to kraken report file")
15+
parser.add_argument(
16+
'-c',
17+
dest="count",
18+
default=50,
19+
help="Minimum number of hits on clade to report it. Default = 50")
20+
parser.add_argument(
21+
'-o',
22+
dest="output",
23+
default=None,
24+
help="Output file. Default = <basename>.kraken_parsed.csv")
25+
26+
args = parser.parse_args()
27+
28+
infile = args.krakenReport
29+
countlim = int(args.count)
30+
outfile = args.output
31+
32+
return(infile, countlim, outfile)
33+
34+
35+
def _get_basename(file_name):
36+
if ("/") in file_name:
37+
basename = file_name.split("/")[-1].split(".")[0]
38+
else:
39+
basename = file_name.split(".")[0]
40+
return(basename)
41+
42+
43+
def parse_kraken(infile, countlim):
44+
'''
45+
INPUT:
46+
infile (str): path to kraken report file
47+
countlim (int): lowest count threshold to report hit
48+
OUTPUT:
49+
resdict (dict): key=taxid, value=readCount
50+
51+
'''
52+
with open(infile, 'r') as f:
53+
resdict = {}
54+
csvreader = csv.reader(f, delimiter='\t')
55+
for line in csvreader:
56+
reads = int(line[1])
57+
if reads >= countlim:
58+
taxid = line[4]
59+
resdict[taxid] = reads
60+
return(resdict)
61+
62+
63+
def write_output(resdict, infile, outfile):
64+
with open(outfile, 'w') as f:
65+
basename = _get_basename(infile)
66+
f.write(f"TAXID,{basename}\n")
67+
for akey in resdict.keys():
68+
f.write(f"{akey},{resdict[akey]}\n")
69+
70+
71+
if __name__ == '__main__':
72+
INFILE, COUNTLIM, outfile = _get_args()
73+
74+
if not outfile:
75+
outfile = _get_basename(INFILE)+".kraken_parsed.csv"
76+
77+
tmp_dict = parse_kraken(infile=INFILE, countlim=COUNTLIM)
78+
write_output(resdict=tmp_dict, infile=INFILE, outfile=outfile)

bin/merge_kraken_res.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#!/usr/bin/env python
2+
3+
import argparse
4+
import os
5+
import pandas as pd
6+
import numpy as np
7+
8+
9+
def _get_args():
10+
'''This function parses and return arguments passed in'''
11+
parser = argparse.ArgumentParser(
12+
prog='merge_kraken_res',
13+
formatter_class=argparse.RawDescriptionHelpFormatter,
14+
description='Merging csv count files in one table')
15+
parser.add_argument(
16+
'-o',
17+
dest="output",
18+
default="kraken_count_table.csv",
19+
help="Output file. Default = kraken_count_table.csv")
20+
21+
args = parser.parse_args()
22+
23+
outfile = args.output
24+
25+
return(outfile)
26+
27+
28+
def get_csv():
29+
tmp = [i for i in os.listdir() if ".csv" in i]
30+
return(tmp)
31+
32+
33+
def _get_basename(file_name):
34+
if ("/") in file_name:
35+
basename = file_name.split("/")[-1].split(".")[0]
36+
else:
37+
basename = file_name.split(".")[0]
38+
return(basename)
39+
40+
41+
def merge_csv(all_csv):
42+
df = pd.read_csv(all_csv[0], index_col=0)
43+
for i in range(1, len(all_csv)):
44+
df_tmp = pd.read_csv(all_csv[i], index_col=0)
45+
df = pd.merge(left=df, right=df_tmp, on='TAXID', how='outer')
46+
df.fillna(0, inplace=True)
47+
return(df)
48+
49+
50+
def write_csv(pd_dataframe, outfile):
51+
pd_dataframe.to_csv(outfile)
52+
53+
54+
if __name__ == "__main__":
55+
OUTFILE = _get_args()
56+
all_csv = get_csv()
57+
resdf = merge_csv(all_csv)
58+
write_csv(resdf, OUTFILE)
59+
print(resdf)

conf/test_kraken.config

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
/*
2+
* -------------------------------------------------
3+
* Nextflow config file for running tests
4+
* -------------------------------------------------
5+
* Defines bundled input files and everything required
6+
* to run a fast and simple test. Use as follows:
7+
* nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
8+
*/
9+
10+
params {
11+
config_profile_name = 'Test profile kraken'
12+
config_profile_description = 'Minimal test dataset to check pipeline function with kraken metagenomic profiler'
13+
// Limit resources so that this can run on Travis
14+
max_cpus = 2
15+
max_memory = 6.GB
16+
max_time = 48.h
17+
genome = false
18+
//Input data
19+
single_end = false
20+
metagenomic_tool = 'kraken'
21+
run_metagenomic_screening = true
22+
readPaths = [['JK2782_TGGCCGATCAACGA_L008', ['https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz','https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz']],
23+
['JK2802_AGAATAACCTACCA_L008', ['https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz','https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz']],
24+
]
25+
// Genome references
26+
fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
27+
database = 'https://github.com/nf-core/test-datasets/raw/eager/databases/kraken/eager_test.tar.gz'
28+
}

docs/output.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -485,6 +485,8 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
485485
* `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
486486
* `nuclear_contamination/` this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
487487
* `bedtools/` this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
488-
* `metagenomic_classification/` This contains the output for a given metagenomic classifer (currently only for MALT). Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonmic assignment etc.
488+
* `metagenomic_classification/` This contains the output for a given metagenomic classifer.
489+
* Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc.
490+
* Kraken will contain the Kraken output and report files, as well as a merged Taxon count table.
489491
* `MaltExtract/` this will contain a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
490492
* `consensus_sequence` this contains three FASTA files from VCF2Genome, of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainity system used for other downstream tools, respectively.

0 commit comments

Comments
 (0)