Skip to content

Commit e23fc6f

Browse files
authored
Merge pull request #485 from TCLamnidis/pileupcaller-fixes
Pileupcaller individuals together, based on library construction. #484
2 parents 319a6e5 + 1f890d5 commit e23fc6f

6 files changed

Lines changed: 147 additions & 30 deletions

File tree

assets/multiqc_config.yaml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -88,10 +88,10 @@ top_modules:
8888
- '*_postfilterflagstat.stats'
8989
- 'dedup'
9090
- 'picard'
91+
- 'preseq'
9192
- 'damageprofiler'
92-
- 'qualimap'
9393
- 'mtnucratio'
94-
- 'preseq'
94+
- 'qualimap'
9595
- 'sexdeterrmine'
9696
- 'gatk'
9797
- 'multivcfanalyzer':
@@ -151,13 +151,13 @@ table_columns_visible:
151151
3 Prime2: False
152152
mean_readlength: True
153153
median: True
154+
mtnucratio:
155+
mt_nuc_ratio: True
154156
QualiMap:
155157
mean_coverage: True
156158
1_x_pc: True
157159
5_x_pc: True
158160
percentage_aligned: False
159-
mtnucratio:
160-
mt_nuc_ratio: True
161161
MultiVCFAnalyzer:
162162
Heterozygous SNP alleles (percent): True
163163

@@ -205,6 +205,10 @@ table_columns_placement:
205205
3 Prime2: 730
206206
mean_readlength: 740
207207
median: 750
208+
mtnucratio:
209+
mtreads: 760
210+
mt_cov_avg: 770
211+
mt_nuc_ratio: 780
208212
QualiMap:
209213
mean_coverage: 800
210214
median_coverage: 810
@@ -214,10 +218,6 @@ table_columns_placement:
214218
4_x_pc: 850
215219
5_x_pc: 860
216220
avg_gc: 870
217-
mtnucratio:
218-
mtreads: 900
219-
mt_cov_avg: 910
220-
mt_nuc_ratio: 920
221221
sexdeterrmine:
222222
RateX: 100
223223
RateY: 1010

conf/base.config

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,10 @@ process {
7676
errorStrategy = 'ignore'
7777
}
7878

79+
withName:damageprofiler {
80+
errorStrategy = { task.exitStatus in [1,143,137,104,134,139] ? 'retry' : 'finish' }
81+
}
82+
7983
// Add 141 ignore due to unclean pipe closing by pmdtools https://github.com/pontussk/PMDtools/issues/7
8084
withName: pmdtools {
8185
errorStrategy = { task.exitStatus in [141] ? 'ignore' : 'retry' }

docs/output.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@ For other non-default columns, hover over the column name for further descriptio
108108

109109
[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your raw reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C) as sequenced. You also get information about adapter contamination and other overrepresented sequences.
110110

111+
You will receive output for each supplied FASTQ file.
112+
111113
When dealing with ancient DNA data the MultiQC plots for FastQC will often show lots of 'warning' or 'failed' samples. You generally can discard this sort of information as we are dealing with very degraded and metagenomic samples which have artefacts that violate the FastQC 'quality definitions', while still being valid data for aDNA researchers. Instead you will _normally_ be looking for 'global' patterns across all samples of a sequencing run to check for library construction or sequencing failures. Decision on whether a individual sample has 'failed' or not should be made by the user after checking all the plots themselves (e.g. if the sample is consistently an outlier to all others in the run).
112114

113115
For further reading and documentation see the [FastQC help](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
@@ -243,6 +245,8 @@ In the case of dual-indexed paired-end sequencing, it is likely poly-G tails are
243245
244246
While the MultiQC report has multiple plots for FastP, we will only look at GC content as that's the functionality we use currently.
245247

248+
You will receive output for each supplied FASTQ file.
249+
246250
#### GC Content
247251

248252
This line plot shows the average GC content (Y axis) across each nucleotide of the reads (X-axis). There are two buttons per read (i.e. 2 for single-end, and 4 for paired-end) representing before and after the poly-G tail trimming.
@@ -274,6 +278,8 @@ Quality trimming (or 'truncating') involves looking at ends of reads for low-con
274278

275279
Length filtering involves removing any read that does not reach the number of bases specified by a particular value.
276280

281+
You will receive output for each FASTQ file supplied for single end data, or for each pair of merged FASTQ files for paired end data.
282+
277283
#### Retained and Discarded Reads Plot
278284

279285
These stacked bars plots are unfortunately a little confusing, when displayed in MultiQC. However are relatively straight-forward once you understand each category. They can be displayed as counts of reads per AdapterRemoval read-category, or as percentages of the same values. Each forward(/reverse) file combination are displayed once.
@@ -317,6 +323,8 @@ With paired-end ancient DNA sequencing runs You expect to see a slight increase
317323

318324
This module provides numbers in raw counts of the mapping of your DNA reads to your reference genome.
319325

326+
You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes merging, you will get mapping statistics of all lanes in one value.
327+
320328
#### Flagstat Plot
321329

322330
This dot plot shows different statistics, and the number of reads (typically as an multiple e.g. million, or thousands), are represented by dots on the X axis.
@@ -335,6 +343,8 @@ The remaining rows will be 0 when running `bwa aln` as these characteristucs of
335343
336344
### DeDup
337345

346+
You will receive output for each _library_. This means that if you use TSV input and have one library sequenced over multiple lanes merging, you will get mapping statistics of all lanes of the library in one value.
347+
338348
#### Background
339349

340350
DeDup is a duplicate removal tool which searches for PCR duplicates and removes them from your BAM file. We remove these duplicates because otherwise you would be artificially increasing your coverage and subsequently confidence in genotyping, by considering these lab artefacts which are not biologically meaningful. DeDup looks for reads with the same start and end coordinates, and whether they have exactly the same sequence. The main difference of DeDup versus e.g. `samtools markduplicates` is that DeDup considers _both_ ends of a read, not just the start position, so it is more precise in removing actual duplicates without penalising often already low aDNA data.
@@ -364,6 +374,8 @@ Things to look out for:
364374

365375
### Preseq
366376

377+
You will receive output for each deduplicated _library_. This means that if you use TSV input and have one library sequenced over multiple lanes merging, you will get mapping statistics of all lanes of the library in one value.
378+
367379
#### Background
368380

369381
Preseq is a collection of tools that allow assessment of the complexity of the library, where complexity means the number of unique molecules in your library (i.e. not molecules with the exact same length and sequence).
@@ -390,6 +402,8 @@ Plateauing can be caused by a number of reasons:
390402

391403
### DamageProfiler
392404

405+
You will receive output for each deduplicated _library_. This means that if you use TSV input and have one library sequenced over multiple lanes merging, you will get mapping statistics of all lanes of the library in one value.
406+
393407
#### Background
394408

395409
DamageProfiler is a tool which calculates a variety of standard 'aDNA' metrics from a BAM file. The primary plots here are the misincorporation and length distribution plots. Ancient DNA undergoes depurination and hydrolysis, causing fragmentation of molecules into gradually shorter fragments, and cytosine to thymine deamination damage, that occur on the subsequent single-stranded overhangs at the ends of molecules.
@@ -431,14 +445,16 @@ When looking at the length distribution plots, keep in mind the following:
431445

432446
### QualiMap
433447

434-
#### QualiMap
448+
#### Background
435449

436450
Qualimap is a tool which provides statistics on the quality of the mapping of your reads to your reference genome. It allows you to assess how well covered your reference genome is by your data, both in 'fold' depth (average number of times a given base on the reference is covered by a read) and 'percentage' (the percentage of all bases on the reference genome that is covered at a given fold depth). These outputs allow you to make decision if you have enough quality data for downstream applications like genotyping, and how to adjust the parameters for those tools accordingly.
437451

438452
> NB: Neither fold coverage nor percent coverage on there own is sufficient to assess whether you have a high quality mapping. Abnormally high fold coverages of a smaller region such as highly conserved genes or un-removed-adapter-containing reference genomes can artificially inflate the mean coverage, yet a high percent coverage is not useful if all bases of the genome are covered at just 1x coverage.
439453
440454
Note that many of the statistics from this module are displayed in the General Stats table (see above), as they represent single values that are not plottable.
441455

456+
You will receive output for each _sample_. This means you will statistics of deduplicated values of all types of libraries combined in a single value (i.e. non-UDG treated, full-UDG, paired-end, single-end all together).
457+
442458
#### Coverage Histogram
443459

444460
This plot shows on the Y axis the range of fold coverages that the bases of the reference genome are possibly covered by. The Y axis shows the number of bases that were covered at the given fold coverage depth as indicated on the Y axis.

0 commit comments

Comments
 (0)