Apply suggestions from code review

jfy133 · TCLamnidis · web-flow · commit cb538bb529e5 · 2021-04-08T10:39:38.000+02:00
Co-authored-by: Thiseas C. Lamnidis &lt;thisseass@gmail.com&gt;
diff --git a/docs/output.md b/docs/output.md
@@ -51,7 +51,7 @@ For more information about how to use MultiQC reports, see [http://multiqc.info]
 
 #### Background
 
-This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed * however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks.
+This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed — however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks.
 
 #### Table
 
@@ -116,7 +116,7 @@ For further reading and documentation see the [FastQC help pages](http://www.bio
 
 #### Sequence Counts
 
-This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself * unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over.
+This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself — unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over.
 
 A section of the bar will also show an approximate estimation of the fraction of the total number of reads that are duplicates of another. This can derive from over-amplification of the library, or lots of single adapters. This can be later checked with the Deduplication check. A good library and sequencing run should have very low amounts of duplicates reads.
 
@@ -199,11 +199,11 @@ This plot is some-what similar to looking at duplication rate or 'cluster factor
   <img src="images/output/fastqc/fastqc_sequence_duplication_level.png" width="75%" height = "75%">
 </p>
 
-A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) * suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates.
+A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) — suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates.
 
 Note that good libraries may sometimes have small peaks at high duplication levels. This maybe due to free-adapters (with no inserts), or mono-nucleotide reads (e.g. GGGGG in NextSeq/NovaSeq data).
 
-Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels * so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information).
+Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels — so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information).
 
 > **NB:** amplicon libraries such as for 16S rRNA analysis may appear here as having high duplication rates and these peaks can be ignored. This can be verified if no contaminants are found in the 'Overrepresented sequences' section.
 
@@ -227,7 +227,7 @@ This can already give you an indication on the authenticity of your library - as
 
 If you have downloaded public data this often is uploaded with adapters already removed, so you can expect a flat distribution straight away.
 
-When comparing pre* and post-AdapterRemoval FASTQC plots of fresh sequencing data (assuming your sequencing center doesn't already remove adapters), you expect to see something similar to the left panel of the example above _pre-_ adapter removal and the right hand panel _post-_ adapter removal.
+When comparing pre- and post-AdapterRemoval FASTQC plots of fresh sequencing data (assuming your sequencing center doesn't already remove adapters), you expect to see something similar to the left panel of the example above _pre-_ adapter removal and the right hand panel _post-_ adapter removal.
 
 ### FastP
 
@@ -353,7 +353,7 @@ Due to low 'endogenous' content of aDNA, and the high biodiversity of modern or
   <img src="images/output/malt/malt_metagenomic_mappability.png" width="75%" height = "75%">
 </p>
 
- This can also be influenced by the type of database you supplied * many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability.
+ This can also be influenced by the type of database you supplied — many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability.
 
 #### Taxonomic assignment success
 
@@ -372,15 +372,15 @@ there is some sequencing artefact (although it could just be badly preserved and
 
 #### Background
 
-Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment * meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths.
+Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment — meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths.
 
 It is useful when you do not have large computing power or you want very rapid but rough approximation of the metagenomic profile of your sample.
 
 You will receive output for each *library*. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes and sequencing configurations in one value.
 
 #### Top Taxa
 
-This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes * therefore a large fraction of 'unclassified' can be quite normal.
+This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes — therefore a large fraction of 'unclassified' can be quite normal.
 
 <p align="center">
   <img src="images/output/kraken/kraken_top_taxa.png" width="75%" height = "75%">
@@ -424,10 +424,10 @@ DeDup is a duplicate removal tool which searches for PCR duplicates and removes
 
 This stacked bar plot shows as a whole the total number of reads in the BAM file going into DeDup. The different sections of a given bar represents the following:
 
-* **Not Removed** * the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
-* **Reverse Removed** * the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
-* **Forward Removed** * the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
-* **Merged Removed** * the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
+* **Not Removed** — the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
+* **Reverse Removed** — the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
+* **Forward Removed** — the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
+* **Merged Removed** — the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
   
 Exceptions to the above:
 
@@ -451,7 +451,7 @@ Picard is a toolkit for general BAM file manipulation with many different functi
 
 #### Mark Duplicates
 
-The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well* preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.
+The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well-preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.
 
 <p align="center">
   <img src="images/output/picard/picard_deduplication_stats.png" width="75%" height = "75%">
@@ -568,7 +568,7 @@ Things to watch out for:
 
 This plot shows how much of the genome in percentage (X axis) is covered by a given fold depth coverage (Y axis).
 
-An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage * something particular true for large genome such has for humans.
+An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage — something particular true for large genomes such as for humans.
 
 <p align="center">
   <img src="images/output/qualimap/qualimap_cumulative_genome_coverage.png" width="75%" height = "75%">
@@ -596,7 +596,7 @@ Sex.DetERRmine calculates the coverage of your mapped reads on the X and Y chrom
 
 When a bedfile of specific sites is provided, Sex.DetERRmine additionally calculates error bars around each relative coverage estimate. For this estimate to be trustworthy, the sites included in the bedfile should be spaced apart enough that a single sequencing read cannot overlap multiple sites. Hence, when a bedfile has not been provided, this error should be ignored. When a suitable bedfile is provided, each observation of a covered site is independent, and the error around the coverage is equal to the binomial error estimate. This error is then propagated during the calculation of relative coverage for the X and Y chromosomes.
 
-> Note that in nf-core/eager this will be run on single* and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries.
+> Note that in nf-core/eager this will be run on single- and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries.
 
 #### Relative Coverage
 
@@ -632,7 +632,7 @@ This table shows the contents of the `snpStatistics.tsv` file produced by MultiV
 
 You can get different variants of the call statistics bar plot, depending on how you configured  the MultiVCFAnalyzer options.
 
-If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains * particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).
+If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains — particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).
 
 <p align="center">
   <img src="images/output/multivcfanalyzer/multivcfanalyzer_call_categories.png" width="75%" height = "75%">