Skip to content

Commit aff33ad

Browse files
authored
Merge pull request #565 from nf-core/dedup
Bump DeDup for stats fix and DamageProfiler to latest version Add eigenstrat snp coverage statistics.
2 parents 0891761 + 2d289df commit aff33ad

8 files changed

Lines changed: 77 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
3535
* [#544](https://github.com/nf-core/eager/pull/544) Add script to perform bam filtering on fragment length
3636
* [#456](https://github.com/nf-core/eager/pull/546) Bumps the base (default) runtime of all processes to 4 hours, and set shorter timelimits for test profiles (1 hour)
3737
* [#552](https://github.com/nf-core/eager/issues/552) Adds optional creation of MALT SAM files alongside RMA6 files.
38+
* Added eigenstrat snp coverage statistics to MultiQC report. Process results are published in `genotyping/*_eigenstrat_coverage.txt`.
3839

3940
### `Fixed`
4041

@@ -67,6 +68,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
6768

6869
* Added Sequencetools (1.4.0.6) that adds the ability to do genotyping with the 'pileupCaller'
6970
* Latest version of DeDup (0.12.6) which now reports mapped reads after deduplication
71+
* [#560] Latest version of Dedup (0.12.7), which now correctly reports deduplication statistics based on calculations of mapped reads only (prior denominator was total reads of BAM file)
7072
* Latest version of ANGSD (0.933) which doesn't seg fault when running contamination on BAMs with insufficient reads
7173
* Latest version of MultiQC (1.9) with support for lots of extra tools in the pipeline (MALT, SexDetERRmine, DamageProfiler, MultiVCFAnalyzer)
7274
* Latest versions of Pygments (7.1), Pymdown-Extensions (2.6.1) and Markdown (3.2.2) for documentation output

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,7 @@ If you've contributed and you're missing in here, please let us know and we will
195195
* **endorS.py** Aida Andrades Valtueña (Unpublished). Download: [https://github.com/aidaanva/endorS.py](https://github.com/aidaanva/endorS.py)
196196
* **Bowtie2** Langmead, B. and Salzberg, S. L. 2012 Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), p. 357–359. doi: [10.1038/nmeth.1923](https:/dx.doi.org/10.1038/nmeth.1923).
197197
* **sequenceTools** Stephan Schiffels (Unpublished). Download: [https://github.com/stschiff/sequenceTools](https://github.com/stschiff/sequenceTools)
198+
* **EigenstratDatabaseTools** Thiseas C. Lamnidis (Unpublished). Download: [https://github.com/TCLamnidis/EigenStratDatabaseTools.git](https://github.com/TCLamnidis/EigenStratDatabaseTools.git)
198199
199200
## Data References
200201

assets/multiqc_config.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,9 @@ table_columns_visible:
177177
Method2_MOM_SE: False
178178
Method2_ML_estimate: False
179179
Method2_ML_SE: False
180+
snp_coverage:
181+
Covered_Snps: True
182+
Total_Snps: False
180183

181184
table_columns_placement:
182185
FastQC (pre-AdapterRemoval):
@@ -220,6 +223,9 @@ table_columns_placement:
220223
Method2_MOM_SE: 1160
221224
Method2_ML_estimate: 1170
222225
Method2_ML_SE: 1180
226+
snp_coverage:
227+
Covered_Snps: 1050
228+
Total_Snps: 1060
223229
DeDup:
224230
mapped_after_dedup: 620
225231
clusterfactor: 630

bin/parse_snp_cov.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/usr/bin/env python3
2+
import sys, json
3+
from collections import OrderedDict
4+
5+
jsonOut = OrderedDict()
6+
data = OrderedDict()
7+
8+
9+
input = open(sys.argv[1], 'r')
10+
for line in input:
11+
fields = line.strip().split()
12+
sample_id = fields[0]
13+
covered_snps = fields[1]
14+
total_snps = fields[2]
15+
if sample_id[0] == "#":
16+
continue
17+
18+
data[sample_id] = {"Covered_Snps":covered_snps, "Total_Snps":total_snps}
19+
20+
jsonOut = {"plot_type": "generalstats", "id": "snp_coverage",
21+
"pconfig": {
22+
"Covered_Snps" : {"title" : "#SNPs Covered"},
23+
"Total_Snps" : {"title": "#SNPs Total"}
24+
},
25+
"data" : data
26+
}
27+
28+
with open(sys.argv[1].rstrip('.txt')+'_mqc.json', 'w') as outfile:
29+
json.dump(jsonOut, outfile)

bin/scrape_software_versions.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,8 @@
3434
'MTNucRatioCalculator':['v_mtnucratiocalculator.txt',r"Version: (\S+)"],
3535
'VCF2genome':['v_vcf2genome.txt', r"VCF2Genome \(v. ([0-9].[0-9]+) "],
3636
'endorS.py':['v_endorSpy.txt', r"endorS.py (\S+)"],
37-
'kraken':['v_kraken.txt', r"Kraken version (\S+)"]
37+
'kraken':['v_kraken.txt', r"Kraken version (\S+)"],
38+
'eigenstrat_snp_coverage':['v_eigenstrat_snp_coverage.txt',r"(\S+)"]
3839
}
3940

4041
results = OrderedDict()
@@ -69,6 +70,7 @@
6970
results['malt'] = '<span style="color:#999999;\">N/A</span>'
7071
results['kraken'] = '<span style="color:#999999;\">N/A</span>'
7172
results['maltextract'] = '<span style="color:#999999;\">N/A</span>'
73+
results['eigenstrat_snp_coverage'] = '<span style="color:#999999;\">N/A</span>'
7274

7375
# Search each file using its regex
7476
for k, v in regexes.items():

docs/output.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,8 @@ The possible columns displayed by default are as follows:
160160
- **MT to Nuclear Ratio** This from MTtoNucRatio. This reports the number of reads aligned to a mitochondrial entry in your reference FASTA to all other entries. This will typically be high but will vary depending on tissue type.
161161
- **XRate** This is from Sex.DetERRmine. This is the relative depth of coverage on the X-chromosome.
162162
- **YRate** This is from Sex.DetERRmine. This is the relative depth of coverage on the Y-chromosome.
163+
- **#SNPs Covered** This is from eigenstrat\_snp\_coverage. The number of called SNPs after genotyping with pileupcaller.
164+
- **#SNPs Total** This is from eigenstrat\_snp\_coverage. The maximum number of covered SNPs, i.e. the number of SNPs in the .snp file provided to pileupcaller with `--pileupcaller_snpfile`.
163165
- **Number of SNPs** This is from ANGSD. The number of SNPs left after removing sites with no data in a 5 base pair surrounding region.
164166
- **Contamination Estimate (Method1_ML)** This is from the nuclear contamination function of ANGSD. The Maximum Likelihood contamination estimate according to Method 1. The estimates using Method of Moments and/or those based on Method 2 can be unhidden through the "Configure Columns" button.
165167
- **Estimate Error (Method1_ML)** This is from ANGSD. The standard error of the Method1 Maximum likelihood estimate. The errors associated with Method of Moments and/or Method2 estimates can be unhidden through the "Configure Columns" button.
@@ -721,7 +723,7 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
721723
- `damageprofiler/` - this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files.
722724
- `pmdtools/` this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`.
723725
- `trimmed_bam/` this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read).
724-
- `genotyping/` this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling.
726+
- `genotyping/` this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics.
725727
- `multivcfanalyzer/` this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files.
726728
- `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
727729
- `nuclear_contamination/` this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.

environment.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ dependencies:
1616
- bioconda::bwa=0.7.17
1717
- bioconda::picard=2.22.9
1818
- bioconda::samtools=1.9
19-
- bioconda::dedup=0.12.6
19+
- bioconda::dedup=0.12.7
2020
- bioconda::angsd=0.933
2121
- bioconda::circularmapper=1.93.5
2222
- bioconda::gatk4=4.1.7.0
@@ -33,7 +33,7 @@ dependencies:
3333
- bioconda::fastp=0.20.1
3434
- bioconda::bamutil=1.0.14
3535
- bioconda::mtnucratio=0.7
36-
- pysam=0.15.4 #Says python3.7 or less
36+
- bioconda::pysam=0.15.4 #Says python3.7 or less
3737
- bioconda::kraken2=2.0.9beta
3838
- conda-forge::pandas=1.0.4 #.4 is python3.8+ compatible
3939
- bioconda::freebayes=1.3.2 #should be fine with python 3.8, but says <3.7 on webpage
@@ -43,4 +43,5 @@ dependencies:
4343
- conda-forge::biopython=1.76
4444
- conda-forge::xopen=0.9.0
4545
- bioconda::bowtie2=2.4.1
46+
- bioconda::eigenstratdatabasetools=1.0.2
4647
#Missing Schmutzi,snpAD

main.nf

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2603,7 +2603,7 @@ if (params.pileupcaller_snpfile.isEmpty ()) {
26032603
path(snp) from ch_snp_for_pileupcaller.collect().dump(tag: "Pileupcaller SNP file")
26042604

26052605
output:
2606-
tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("pileupcaller.${strandedness}.*")
2606+
tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("pileupcaller.${strandedness}.*") into ch_for_eigenstrat_snp_coverage
26072607

26082608
script:
26092609
def use_bed = bed.getName() != 'nf-core_eager_dummy.txt' ? "-l ${bed}" : ''
@@ -2618,7 +2618,33 @@ if (params.pileupcaller_snpfile.isEmpty ()) {
26182618
samtools mpileup -B -q 30 -Q 30 ${use_bed} -f ${fasta} ${bam_list} | pileupCaller ${caller} ${ssmode} ${transitions_mode} --sampleNames ${sample_names} ${use_snp} -e pileupcaller.${strandedness}
26192619
"""
26202620
}
2621-
2621+
2622+
process eigenstrat_snp_coverage {
2623+
label 'mc_tiny'
2624+
tag "${strandedness}"
2625+
publishDir "${params.outdir}/genotyping", mode: params.publish_dir_mode
2626+
2627+
when:
2628+
params.run_genotyping && params.genotyping_tool == 'pileupcaller'
2629+
2630+
input:
2631+
tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*") from ch_for_eigenstrat_snp_coverage.dump()
2632+
2633+
output:
2634+
tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*.json") into ch_eigenstrat_snp_cov_for_multiqc
2635+
path("*_eigenstrat_coverage.txt")
2636+
2637+
script:
2638+
// The following code block can be swapped in once the eigenstratdatabasetools MultiQC module becomes available.
2639+
/* """
2640+
eigenstrat_snp_coverage -i pileupcaller.${strandedness} -s ".txt" >${strandedness}_eigenstrat_coverage.txt -j ${strandedness}_eigenstrat_coverage_mqc.json
2641+
"""*/
2642+
"""
2643+
eigenstrat_snp_coverage -i pileupcaller.${strandedness} -s ".txt" >${strandedness}_eigenstrat_coverage.txt
2644+
parse_snp_cov.py ${strandedness}_eigenstrat_coverage.txt
2645+
"""
2646+
}
2647+
26222648
process genotyping_angsd {
26232649
label 'mc_small'
26242650
tag "${samplename}"
@@ -3143,6 +3169,7 @@ process get_software_versions {
31433169
endorS.py --version &> v_endorSpy.txt || true
31443170
pileupCaller --version &> v_sequencetools.txt 2>&1 || true
31453171
bowtie2 --version | grep -a 'bowtie2-.* -fdebug' > v_bowtie2.txt || true
3172+
eigenstrat_snp_coverage --version | cut -d ' ' -f2 >v_eigenstrat_snp_coverage.txt || true
31463173
31473174
scrape_software_versions.py &> software_versions_mqc.yaml
31483175
"""
@@ -3180,6 +3207,7 @@ process multiqc {
31803207
file ('kraken/*') from ch_kraken_for_multiqc.collect().ifEmpty([])
31813208
file ('hops/*') from ch_hops_for_multiqc.collect().ifEmpty([])
31823209
file ('nuclear_contamination/*') from ch_nuclear_contamination_for_multiqc.collect().ifEmpty([])
3210+
file ('genotyping/*') from ch_eigenstrat_snp_cov_for_multiqc
31833211

31843212
file workflow_summary from ch_workflow_summary.collectFile(name: "workflow_summary_mqc.yaml")
31853213

0 commit comments

Comments
 (0)