Merge pull request #355 from maxibor/kraken

jfy133 · web-flow · commit 103a746353c8 · 2020-02-21T05:54:22.000+01:00
Adding Kraken2 metagenomics classifier
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -122,6 +122,9 @@ jobs:
       - name: MALTEXTRACT Basic with MALT plus MaltExtract
         run: |
           nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-maltextract" -profile test,docker --paired_end --run_bam_filtering --bam_discard_unmapped --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt" --run_maltextract --maltextract_ncbifiles "/home/runner/work/eager/eager/databases/maltextract/" --maltextract_taxon_list 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/maltextract/MaltExtract_list.txt' 
+      - name: METAGENOMIC Run the basic pipeline but with unmapped reads going into Kraken
+        run: |
+          nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-kraken" -profile test_kraken,docker ${{ matrix.endedness }} --run_bam_filtering --bam_discard_unmapped --bam_unmapped_type 'fastq'
       - name: SEXDETERMINATION Run the basic pipeline with the bam input profile, but don't convert BAM, skip everything but sex determination
         run: |
           nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-sexdeterrmine" -profile test_humanbam,docker --bam --skip_fastqc --skip_adapterremoval --skip_mapping --skip_deduplication --skip_qualimap --single_end --run_sexdeterrmine
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -27,6 +27,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 * [#326](https://github.com/nf-core/eager/pull/326) - Add Biopython and [xopen](https://github.com/marcelm/xopen/) dependencies
 * [#336](https://github.com/nf-core/eager/issues/336) - Change default Y-axis maximum value of DamageProfiler to 30% to match popular (but slower) mapDamage, and allow user to set their own value.
 * [#352](https://github.com/nf-core/eager/pull/352) - Add social preview image
+* [#355](https://github.com/nf-core/eager/pull/355) - Add Kraken2 metagenomics classifier
 
 ### `Fixed`
 
diff --git a/README.md b/README.md
@@ -67,7 +67,8 @@ Additional functionality contained by the pipeline currently includes:
 #### Metagenomic Screening
 
 * Taxonomic binner with alignment (`MALT`)
-* aDNA characteristic screening of taxonomically binned data (`MaltExtract`)
+* Taxonomic binner without alignment (`Kraken2`)
+* aDNA characteristic screening of taxonomically binned data from MALT (`MaltExtract`)
 
 ## Quick Start
 
@@ -157,3 +158,4 @@ If you've contributed and you're missing in here, please let me know and I'll ad
   * Vågene, Å.J. et al., 2018. Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature ecology & evolution, 2(3), pp.520–528. Available at: [http://dx.doi.org/10.1038/s41559-017-0446-6](http://dx.doi.org/10.1038/s41559-017-0446-6).
   * Herbig, A. et al., 2016. MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv, p.050559. Available at: [http://biorxiv.org/content/early/2016/04/27/050559](http://biorxiv.org/content/early/2016/04/27/050559).
 * **MaltExtract** Huebler, R. et al., 2019. HOPS: Automated detection and authentication of pathogen DNA in archaeological remains. bioRxiv, p.534198. Available at: [https://www.biorxiv.org/content/10.1101/534198v1?rss=1](https://www.biorxiv.org/content/10.1101/534198v1?rss=1). Download: [https://github.com/rhuebler/MaltExtract](https://github.com/rhuebler/MaltExtract)
+* **Kraken2**     Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. Available at: [https://doi.org/10.1186/s13059-019-1891-0](https://doi.org/10.1186/s13059-019-1891-0). Download: [https://ccb.jhu.edu/software/kraken2/](https://ccb.jhu.edu/software/kraken2/)
diff --git a/bin/kraken_parse.py b/bin/kraken_parse.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python
+
+
+import argparse
+import csv
+
+
+def _get_args():
+    '''This function parses and return arguments passed in'''
+    parser = argparse.ArgumentParser(
+        prog='kraken_parse',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        description='Parsing kraken')
+    parser.add_argument('krakenReport', help="path to kraken report file")
+    parser.add_argument(
+        '-c',
+        dest="count",
+        default=50,
+        help="Minimum number of hits on clade to report it. Default = 50")
+    parser.add_argument(
+        '-o',
+        dest="output",
+        default=None,
+        help="Output file. Default = <basename>.kraken_parsed.csv")
+
+    args = parser.parse_args()
+
+    infile = args.krakenReport
+    countlim = int(args.count)
+    outfile = args.output
+
+    return(infile, countlim, outfile)
+
+
+def _get_basename(file_name):
+    if ("/") in file_name:
+        basename = file_name.split("/")[-1].split(".")[0]
+    else:
+        basename = file_name.split(".")[0]
+    return(basename)
+
+
+def parse_kraken(infile, countlim):
+    '''
+    INPUT:
+        infile (str): path to kraken report file
+        countlim (int): lowest count threshold to report hit
+    OUTPUT:
+        resdict (dict): key=taxid, value=readCount
+
+    '''
+    with open(infile, 'r') as f:
+        resdict = {}
+        csvreader = csv.reader(f, delimiter='\t')
+        for line in csvreader:
+            reads = int(line[1])
+            if reads >= countlim:
+                taxid = line[4]
+                resdict[taxid] = reads
+        return(resdict)
+
+
+def write_output(resdict, infile, outfile):
+    with open(outfile, 'w') as f:
+        basename = _get_basename(infile)
+        f.write(f"TAXID,{basename}\n")
+        for akey in resdict.keys():
+            f.write(f"{akey},{resdict[akey]}\n")
+
+
+if __name__ == '__main__':
+    INFILE, COUNTLIM, outfile = _get_args()
+
+    if not outfile:
+        outfile = _get_basename(INFILE)+".kraken_parsed.csv"
+
+    tmp_dict = parse_kraken(infile=INFILE, countlim=COUNTLIM)
+    write_output(resdict=tmp_dict, infile=INFILE, outfile=outfile)
diff --git a/bin/merge_kraken_res.py b/bin/merge_kraken_res.py
@@ -0,0 +1,59 @@
+#!/usr/bin/env python
+
+import argparse
+import os
+import pandas as pd
+import numpy as np
+
+
+def _get_args():
+    '''This function parses and return arguments passed in'''
+    parser = argparse.ArgumentParser(
+        prog='merge_kraken_res',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        description='Merging csv count files in one table')
+    parser.add_argument(
+        '-o',
+        dest="output",
+        default="kraken_count_table.csv",
+        help="Output file. Default = kraken_count_table.csv")
+
+    args = parser.parse_args()
+
+    outfile = args.output
+
+    return(outfile)
+
+
+def get_csv():
+    tmp = [i for i in os.listdir() if ".csv" in i]
+    return(tmp)
+
+
+def _get_basename(file_name):
+    if ("/") in file_name:
+        basename = file_name.split("/")[-1].split(".")[0]
+    else:
+        basename = file_name.split(".")[0]
+    return(basename)
+
+
+def merge_csv(all_csv):
+    df = pd.read_csv(all_csv[0], index_col=0)
+    for i in range(1, len(all_csv)):
+        df_tmp = pd.read_csv(all_csv[i], index_col=0)
+        df = pd.merge(left=df, right=df_tmp, on='TAXID', how='outer')
+    df.fillna(0, inplace=True)
+    return(df)
+
+
+def write_csv(pd_dataframe, outfile):
+    pd_dataframe.to_csv(outfile)
+
+
+if __name__ == "__main__":
+    OUTFILE = _get_args()
+    all_csv = get_csv()
+    resdf = merge_csv(all_csv)
+    write_csv(resdf, OUTFILE)
+    print(resdf)
diff --git a/conf/test_kraken.config b/conf/test_kraken.config
@@ -0,0 +1,28 @@
+/*
+ * -------------------------------------------------
+ *  Nextflow config file for running tests
+ * -------------------------------------------------
+ * Defines bundled input files and everything required
+ * to run a fast and simple test. Use as follows:
+ * nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
+ */
+
+params {
+  config_profile_name = 'Test profile kraken'
+  config_profile_description = 'Minimal test dataset to check pipeline function with kraken metagenomic profiler'
+  // Limit resources so that this can run on Travis
+  max_cpus = 2
+  max_memory = 6.GB
+  max_time = 48.h
+  genome = false
+  //Input data
+  single_end = false
+  metagenomic_tool = 'kraken'
+  run_metagenomic_screening = true
+  readPaths = [['JK2782_TGGCCGATCAACGA_L008', ['https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz','https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz']],
+  ['JK2802_AGAATAACCTACCA_L008', ['https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz','https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz']],
+  ]
+  // Genome references
+  fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
+  database = 'https://github.com/nf-core/test-datasets/raw/eager/databases/kraken/eager_test.tar.gz'
+}
diff --git a/docs/output.md b/docs/output.md
@@ -485,6 +485,8 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
 * `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
 * `nuclear_contamination/` this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
 * `bedtools/` this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
-* `metagenomic_classification/` This contains the output for a given metagenomic classifer (currently only for MALT). Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonmic assignment etc.
+* `metagenomic_classification/` This contains the output for a given metagenomic classifer.
+  * Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc.
+  * Kraken will contain the Kraken output and report files, as well as a merged Taxon count table.
 * `MaltExtract/` this will contain a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
 * `consensus_sequence` this contains three FASTA files from VCF2Genome, of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainity system used for other downstream tools, respectively.
diff --git a/docs/usage.md b/docs/usage.md
diff --git a/main.nf b/main.nf
diff --git a/nextflow.config b/nextflow.config