Skip to content

Commit cb538bb

Browse files
jfy133TCLamnidis
andauthored
Apply suggestions from code review
Co-authored-by: Thiseas C. Lamnidis <thisseass@gmail.com>
1 parent fdbe93f commit cb538bb

1 file changed

Lines changed: 16 additions & 16 deletions

File tree

docs/output.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ For more information about how to use MultiQC reports, see [http://multiqc.info]
5151

5252
#### Background
5353

54-
This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed * however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks.
54+
This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks.
5555

5656
#### Table
5757

@@ -116,7 +116,7 @@ For further reading and documentation see the [FastQC help pages](http://www.bio
116116
117117
#### Sequence Counts
118118

119-
This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself * unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over.
119+
This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over.
120120

121121
A section of the bar will also show an approximate estimation of the fraction of the total number of reads that are duplicates of another. This can derive from over-amplification of the library, or lots of single adapters. This can be later checked with the Deduplication check. A good library and sequencing run should have very low amounts of duplicates reads.
122122

@@ -199,11 +199,11 @@ This plot is some-what similar to looking at duplication rate or 'cluster factor
199199
<img src="images/output/fastqc/fastqc_sequence_duplication_level.png" width="75%" height = "75%">
200200
</p>
201201

202-
A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) * suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates.
202+
A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates.
203203

204204
Note that good libraries may sometimes have small peaks at high duplication levels. This maybe due to free-adapters (with no inserts), or mono-nucleotide reads (e.g. GGGGG in NextSeq/NovaSeq data).
205205

206-
Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels * so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information).
206+
Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information).
207207

208208
> **NB:** amplicon libraries such as for 16S rRNA analysis may appear here as having high duplication rates and these peaks can be ignored. This can be verified if no contaminants are found in the 'Overrepresented sequences' section.
209209
@@ -227,7 +227,7 @@ This can already give you an indication on the authenticity of your library - as
227227

228228
If you have downloaded public data this often is uploaded with adapters already removed, so you can expect a flat distribution straight away.
229229

230-
When comparing pre* and post-AdapterRemoval FASTQC plots of fresh sequencing data (assuming your sequencing center doesn't already remove adapters), you expect to see something similar to the left panel of the example above _pre-_ adapter removal and the right hand panel _post-_ adapter removal.
230+
When comparing pre- and post-AdapterRemoval FASTQC plots of fresh sequencing data (assuming your sequencing center doesn't already remove adapters), you expect to see something similar to the left panel of the example above _pre-_ adapter removal and the right hand panel _post-_ adapter removal.
231231

232232
### FastP
233233

@@ -353,7 +353,7 @@ Due to low 'endogenous' content of aDNA, and the high biodiversity of modern or
353353
<img src="images/output/malt/malt_metagenomic_mappability.png" width="75%" height = "75%">
354354
</p>
355355

356-
This can also be influenced by the type of database you supplied * many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability.
356+
This can also be influenced by the type of database you supplied many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability.
357357

358358
#### Taxonomic assignment success
359359

@@ -372,15 +372,15 @@ there is some sequencing artefact (although it could just be badly preserved and
372372

373373
#### Background
374374

375-
Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment * meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths.
375+
Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths.
376376

377377
It is useful when you do not have large computing power or you want very rapid but rough approximation of the metagenomic profile of your sample.
378378

379379
You will receive output for each *library*. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes and sequencing configurations in one value.
380380

381381
#### Top Taxa
382382

383-
This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes * therefore a large fraction of 'unclassified' can be quite normal.
383+
This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes therefore a large fraction of 'unclassified' can be quite normal.
384384

385385
<p align="center">
386386
<img src="images/output/kraken/kraken_top_taxa.png" width="75%" height = "75%">
@@ -424,10 +424,10 @@ DeDup is a duplicate removal tool which searches for PCR duplicates and removes
424424

425425
This stacked bar plot shows as a whole the total number of reads in the BAM file going into DeDup. The different sections of a given bar represents the following:
426426

427-
* **Not Removed** * the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
428-
* **Reverse Removed** * the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
429-
* **Forward Removed** * the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
430-
* **Merged Removed** * the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
427+
* **Not Removed** the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
428+
* **Reverse Removed** the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
429+
* **Forward Removed** the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
430+
* **Merged Removed** the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
431431

432432
Exceptions to the above:
433433

@@ -451,7 +451,7 @@ Picard is a toolkit for general BAM file manipulation with many different functi
451451

452452
#### Mark Duplicates
453453

454-
The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well* preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.
454+
The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well-preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.
455455

456456
<p align="center">
457457
<img src="images/output/picard/picard_deduplication_stats.png" width="75%" height = "75%">
@@ -568,7 +568,7 @@ Things to watch out for:
568568

569569
This plot shows how much of the genome in percentage (X axis) is covered by a given fold depth coverage (Y axis).
570570

571-
An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage * something particular true for large genome such has for humans.
571+
An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage something particular true for large genomes such as for humans.
572572

573573
<p align="center">
574574
<img src="images/output/qualimap/qualimap_cumulative_genome_coverage.png" width="75%" height = "75%">
@@ -596,7 +596,7 @@ Sex.DetERRmine calculates the coverage of your mapped reads on the X and Y chrom
596596

597597
When a bedfile of specific sites is provided, Sex.DetERRmine additionally calculates error bars around each relative coverage estimate. For this estimate to be trustworthy, the sites included in the bedfile should be spaced apart enough that a single sequencing read cannot overlap multiple sites. Hence, when a bedfile has not been provided, this error should be ignored. When a suitable bedfile is provided, each observation of a covered site is independent, and the error around the coverage is equal to the binomial error estimate. This error is then propagated during the calculation of relative coverage for the X and Y chromosomes.
598598

599-
> Note that in nf-core/eager this will be run on single* and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries.
599+
> Note that in nf-core/eager this will be run on single- and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries.
600600
601601
#### Relative Coverage
602602

@@ -632,7 +632,7 @@ This table shows the contents of the `snpStatistics.tsv` file produced by MultiV
632632

633633
You can get different variants of the call statistics bar plot, depending on how you configured the MultiVCFAnalyzer options.
634634

635-
If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains * particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).
635+
If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).
636636

637637
<p align="center">
638638
<img src="images/output/multivcfanalyzer/multivcfanalyzer_call_categories.png" width="75%" height = "75%">

0 commit comments

Comments
 (0)