You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/output.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,7 +51,7 @@ For more information about how to use MultiQC reports, see [http://multiqc.info]
51
51
52
52
#### Background
53
53
54
-
This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed * however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks.
54
+
This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed — however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks.
55
55
56
56
#### Table
57
57
@@ -116,7 +116,7 @@ For further reading and documentation see the [FastQC help pages](http://www.bio
116
116
117
117
#### Sequence Counts
118
118
119
-
This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself * unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over.
119
+
This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself — unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over.
120
120
121
121
A section of the bar will also show an approximate estimation of the fraction of the total number of reads that are duplicates of another. This can derive from over-amplification of the library, or lots of single adapters. This can be later checked with the Deduplication check. A good library and sequencing run should have very low amounts of duplicates reads.
122
122
@@ -199,11 +199,11 @@ This plot is some-what similar to looking at duplication rate or 'cluster factor
A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) * suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates.
202
+
A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) — suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates.
203
203
204
204
Note that good libraries may sometimes have small peaks at high duplication levels. This maybe due to free-adapters (with no inserts), or mono-nucleotide reads (e.g. GGGGG in NextSeq/NovaSeq data).
205
205
206
-
Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels * so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information).
206
+
Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels — so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information).
207
207
208
208
> **NB:** amplicon libraries such as for 16S rRNA analysis may appear here as having high duplication rates and these peaks can be ignored. This can be verified if no contaminants are found in the 'Overrepresented sequences' section.
209
209
@@ -227,7 +227,7 @@ This can already give you an indication on the authenticity of your library - as
227
227
228
228
If you have downloaded public data this often is uploaded with adapters already removed, so you can expect a flat distribution straight away.
229
229
230
-
When comparing pre* and post-AdapterRemoval FASTQC plots of fresh sequencing data (assuming your sequencing center doesn't already remove adapters), you expect to see something similar to the left panel of the example above _pre-_ adapter removal and the right hand panel _post-_ adapter removal.
230
+
When comparing pre- and post-AdapterRemoval FASTQC plots of fresh sequencing data (assuming your sequencing center doesn't already remove adapters), you expect to see something similar to the left panel of the example above _pre-_ adapter removal and the right hand panel _post-_ adapter removal.
231
231
232
232
### FastP
233
233
@@ -353,7 +353,7 @@ Due to low 'endogenous' content of aDNA, and the high biodiversity of modern or
This can also be influenced by the type of database you supplied * many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability.
356
+
This can also be influenced by the type of database you supplied — many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability.
357
357
358
358
#### Taxonomic assignment success
359
359
@@ -372,15 +372,15 @@ there is some sequencing artefact (although it could just be badly preserved and
372
372
373
373
#### Background
374
374
375
-
Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment * meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths.
375
+
Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment — meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths.
376
376
377
377
It is useful when you do not have large computing power or you want very rapid but rough approximation of the metagenomic profile of your sample.
378
378
379
379
You will receive output for each *library*. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes and sequencing configurations in one value.
380
380
381
381
#### Top Taxa
382
382
383
-
This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes * therefore a large fraction of 'unclassified' can be quite normal.
383
+
This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes — therefore a large fraction of 'unclassified' can be quite normal.
@@ -424,10 +424,10 @@ DeDup is a duplicate removal tool which searches for PCR duplicates and removes
424
424
425
425
This stacked bar plot shows as a whole the total number of reads in the BAM file going into DeDup. The different sections of a given bar represents the following:
426
426
427
-
***Not Removed*** the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
428
-
***Reverse Removed*** the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
429
-
***Forward Removed*** the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
430
-
***Merged Removed*** the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
427
+
***Not Removed**— the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
428
+
***Reverse Removed**— the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
429
+
***Forward Removed**— the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
430
+
***Merged Removed**— the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
431
431
432
432
Exceptions to the above:
433
433
@@ -451,7 +451,7 @@ Picard is a toolkit for general BAM file manipulation with many different functi
451
451
452
452
#### Mark Duplicates
453
453
454
-
The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well*preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.
454
+
The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well-preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.
This plot shows how much of the genome in percentage (X axis) is covered by a given fold depth coverage (Y axis).
570
570
571
-
An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage * something particular true for large genome such has for humans.
571
+
An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage — something particular true for large genomes such as for humans.
@@ -596,7 +596,7 @@ Sex.DetERRmine calculates the coverage of your mapped reads on the X and Y chrom
596
596
597
597
When a bedfile of specific sites is provided, Sex.DetERRmine additionally calculates error bars around each relative coverage estimate. For this estimate to be trustworthy, the sites included in the bedfile should be spaced apart enough that a single sequencing read cannot overlap multiple sites. Hence, when a bedfile has not been provided, this error should be ignored. When a suitable bedfile is provided, each observation of a covered site is independent, and the error around the coverage is equal to the binomial error estimate. This error is then propagated during the calculation of relative coverage for the X and Y chromosomes.
598
598
599
-
> Note that in nf-core/eager this will be run on single* and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries.
599
+
> Note that in nf-core/eager this will be run on single- and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries.
600
600
601
601
#### Relative Coverage
602
602
@@ -632,7 +632,7 @@ This table shows the contents of the `snpStatistics.tsv` file produced by MultiV
632
632
633
633
You can get different variants of the call statistics bar plot, depending on how you configured the MultiVCFAnalyzer options.
634
634
635
-
If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains * particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).
635
+
If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains — particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).
0 commit comments