You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/output.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,7 +76,7 @@ The possible columns displayed by default are as follows:
76
76
***Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _prior_ map quality filtering.
77
77
***Endogenous DNA (%)** This is from the endorS.py tool. It displays a percentage of mapped reads over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). Assuming a perfect ancient sample with no modern contamination, this would be the amount of true ancient DNA in the sample. However this value _most likely_ include contamination and will not entirely be the true 'endogenous' content.
78
78
***Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _after_ map quality filtering (note the column name does not distinguish itself from prior-map quality filtering, but the post-filter column is always second)
79
-
***Endogenous DNA Post (%)** This is from the endorS.py tool. It displays a percentage of mapped reads _after_ BAM filtering (e.g. for mapping quality) over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). This column will only be displayed if BAM filtering is turned on and is based on the original mapping for total reads, and mapped reads as calculated from the post-filtering BAM.
79
+
***Endogenous DNA Post (%)** This is from the endorS.py tool. It displays a percentage of mapped reads _after_ BAM filtering (i.e. for mapping quality and/or bam-level length filtering) over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). This column will only be displayed if BAM filtering is turned on and is based on the original mapping for total reads, and mapped reads as calculated from the post-filtering BAM.
80
80
***ClusterFactor** This is from DeDup. This is a value representing the how many duplicates in the library exist for each unique read. A cluster factor close to one replicates a highly complex library and could be sequenced further. Generally with a value of more than 2 you will not be gaining much more information by sequencing deeper.
81
81
***Dups** This is from Picard's markDuplicates. It represents the percentage of reads in your library that were exact duplicates of other reads in your database. The lower the better, as high duplication rate means lots of sequencing of the same information (and therefore is not time or cost effective).
82
82
***X Prime Y>Z N base** These columns are from DamageProfiler. The prime numbers represent which end of the reads the damage is referring to. The Y>Z is the type of substitution (C>T is the true damage, G>A is the complementary). You should see for no- and half-UDG treatment a decrease in frequency from the 1st to 2nd base.
@@ -333,7 +333,7 @@ Ancient DNA samples typically have low endogenous DNA values, as most of the DNA
The main additional useful information compared to [Samtools](#samtools) is that these plots can inform you how many reads had multiple places on the reference the read could align to. This can occur with low complexity reads or reads derived from e.g. repetitive regions on the genome. If you have large amounts of multi-mapping reads, this can be a warning flag that there is an issue either with the reference genome or library itself (e.g. over-amplification of low-complexity regions or library construction artefacts). You should investigate cases like this more closely before using the data downstream.
336
+
The main additional useful information compared to [Samtools](#samtools) is that these plots can inform you how many reads had multiple places on the reference the read could align to. This can occur with low complexity reads or reads derived from e.g. repetitive regions on the genome. If you have large amounts of multi-mapping reads, this can be a warning flag that there is an issue either with the reference genome or library itself (e.g. library construction artefacts). You should investigate cases like this more closely before using the data downstream.
Copy file name to clipboardExpand all lines: docs/usage.md
+8-5Lines changed: 8 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,7 +111,7 @@ If `-profile` is not specified, the pipeline will run locally and expect all sof
111
111
* Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
112
112
* A generic configuration profile to be used with [Conda](https://conda.io/docs/)
113
113
* Pulls most software from [Bioconda](https://bioconda.github.io/)
114
-
* `test_tsv
114
+
*`test_tsv`
115
115
* A profile with a complete configuration for automated testing
116
116
* Includes links to test data so needs no other parameters
117
117
@@ -348,7 +348,10 @@ will have the following effects:
348
348
Note the following important points and limitations for setting up:
349
349
350
350
* The TSV must use actual tabs (not spaces) between cells.
351
+
* The input FASTQ filenames are discarded after FastQC, all other downstream results files are based on `Sample_Name`, `Library_ID` and `Lane` columns for filenames.
351
352
**File* names must be unique regardless of file path, due to risk of over-writing (see: [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)).
353
+
* At different stages of the merging process, (as above) nf-core/eager will use as output filenames the information from the `Sample_Name`, `Library_ID` and `Lane` column columns for filenames.
354
+
* In other words, your .tsv file must not have rows with `Library1` and `Library1` for both `SampleA` and `SampleB`. While nf-core/eager would not try to _merge_ these, in some stages of the pipeline output files names would be the same, and would overwrite the other if the files are output to the same `results/` subdirectory.
352
355
* If it is 'too late' and you already have duplicate file names, a workaround is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file.
353
356
* Lane IDs must be unique for each sequencing of each library.
354
357
* If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly.
@@ -357,12 +360,12 @@ Note the following important points and limitations for setting up:
357
360
* You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses.
358
361
* nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration
359
362
* Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files (unless you use `--run_convertbam`), as only FASTQ files are lane-merged together.
360
-
*Same libraries that are sequenced on different sequencing configurations (i.e single- and paired-end data), will be merged after mapping and will _always_ be considered 'paired-end' during downstream processes
361
-
***Important**running DeDup in this context is _not_ recommended, as PE and SE data at the same position will _not_ be evaluated as duplicates. Therefore not all duplicates will be removed.
362
-
* When you wish to run PE/SE data together`-dedupper markduplicates` is therefore preferred.
363
+
*nf-core/eager is able to correctly handle libraries that are sequenced multiple times on different sequencing configurations (i.e mixtures of single- and paired-end data). These will be merged after mapping and considered 'paired-end' during downstream processes.
364
+
***Important**we do not recommend choosing to use DeDup (i.e. `--dedupper 'dedup'`) when mixing PE and SE data, as SE data will not necessarily have the correct end position of the read, and DeDup requires both ends of the molecule to remove a duplicate read. Therefore you may end up with inflated (false-positive) coverages due to suboptimal deduplication.
365
+
* When you wish to run PE/SE data together, the default `-dedupper markduplicates` is therefore preferred, as it only looks at the first position. While more conservative (i.e. it'll remove more reads even if not technically duplicates, because it assumes it can't see the true ends of molecules), it is more consistent.
363
366
* An error will be thrown if you try to merge both PE and SE and also supply `--skip_merging`.
364
367
* If you truly want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager
365
-
* If you _regularly_ want to run the situation above, please leave a feature request on github.
368
+
* If you _regularly_ want to run the situation above, please leave a feature request on github.
366
369
* DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging).
367
370
* nf-core/eager functionality such as `--run_trim_bam` will be applied to only non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values will reflect the values of all libraries combined - after being damage trimmed etc.).
368
371
* Genotyping will be typically performed on each `sample` independently, as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you **must** give each the SS and DS libraries _distinct_`Sample_IDs`; otherwise you will receive a `file collision` error in steps such as `sexdeterrmine`, and then you will need to merge these yourself. We will consider changing this behaviour in the future if there is enough interest.
0 commit comments