Skip to content

Commit 8855466

Browse files
authored
Merge pull request #500 from nf-core/fix-aws
Adjusted fixes for AWS index handling
2 parents 5e64de8 + e2a8b84 commit 8855466

11 files changed

Lines changed: 364 additions & 204 deletions

.github/workflows/ci.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ jobs:
6464
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --save_reference
6565
- name: REFERENCE Basic workflow, with supplied indices
6666
run: |
67-
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --bwa_index 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' --fasta_index 'https://github.com/nf-core/test-datasets/blob/eager/reference/Mammoth/Mammoth_MT_Krause.fasta.fai'
67+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --bwa_index 'results/reference_genome/bwa_index/BWAIndex/' --fasta_index 'https://github.com/nf-core/test-datasets/blob/eager/reference/Mammoth/Mammoth_MT_Krause.fasta.fai'
6868
- name: REFERENCE Run the basic pipeline with FastA reference with `fna` extension
6969
run: |
7070
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker
@@ -103,7 +103,7 @@ jobs:
103103
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --mapper 'bowtie2' --bt2_alignmode 'local' --bt2_sensitivity 'sensitive' --bt2n 1 --bt2l 16 --bt2_trim5 1 --bt2_trim3 1
104104
- name: STRIP_FASTQ Run the basic pipeline with output unmapped reads as fastq
105105
run: |
106-
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --strip_input_fastq
106+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --strip_input_fastq
107107
- name: BAM_FILTERING Run basic mapping pipeline with mapping quality filtering, and unmapped export
108108
run: |
109109
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bam_filtering --bam_mapping_quality_threshold 37 --bam_unmapped_type 'fastq'

assets/dummy.txt

Lines changed: 0 additions & 1 deletion
This file was deleted.

assets/multiqc_config.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,7 @@ table_columns_visible:
161161
1_x_pc: True
162162
5_x_pc: True
163163
percentage_aligned: False
164+
median_insert_size: False
164165
MultiVCFAnalyzer:
165166
Heterozygous SNP alleles (percent): True
166167
endorSpy:
@@ -204,11 +205,11 @@ table_columns_placement:
204205
flagstat_total: 551
205206
mapped_passed: 552
206207
Samtools Flagstat (post-samtools filter):
207-
flagstat_total: 553
208-
mapped_passed: 554
208+
flagstat_total: 600
209+
mapped_passed: 620
209210
endorSpy:
210-
endogenous_dna: 600
211-
endogenous_dna_post: 610
211+
endogenous_dna: 610
212+
endogenous_dna_post: 640
212213
nuclear_contamination:
213214
Num_SNPs: 1100
214215
Method1_MOM_estimate: 1110

assets/nf-core_eager_dummy.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This is a dummy file for when we need a 'fake' file to satisfy all nextflow channel inputs being filled, even if we actually only use one.

assets/nf-core_eager_dummy2.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This is a second dummy file for when we need a 'fake' file to satisfy all nextflow channel inputs being filled, even if we actually only use one.

conf/benchmarking_vikingfish.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ params {
1212
config_profile_description = "A 'fullsized' benchmarking profile for deepish sequencing aDNA data"
1313

1414
//Input data
15-
input = 'https://raw.githubusercontent.com/jfy133/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv'
15+
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv'
1616
// Genome reference
1717
fasta = 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gadus_morhua/representative/GCF_902167405.1_gadMor3.0/GCF_902167405.1_gadMor3.0_genomic.fna.gz'
1818

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
/*
2+
* -------------------------------------------------
3+
* Nextflow config file for running tests
4+
* -------------------------------------------------
5+
* Defines bundled input files and everything required
6+
* to run a fast and simple test. Use as follows:
7+
* nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
8+
*/
9+
10+
params {
11+
config_profile_name = 'nf-core/eager benchmarking - Viking Fish profile'
12+
config_profile_description = "A 'fullsized' benchmarking profile for deepish sequencing aDNA data"
13+
14+
//Input data
15+
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish_single.tsv'
16+
// Genome reference
17+
fasta = 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gadus_morhua/representative/GCF_902167405.1_gadMor3.0/GCF_902167405.1_gadMor3.0_genomic.fna.gz'
18+
19+
bwaalnn = 0.04
20+
bwaalnl = 1024
21+
22+
run_bam_filtering = true
23+
bam_discard_unmapped = true
24+
bam_unmapped_type = 'discard'
25+
bam_mapping_quality_threshold = 25
26+
27+
run_genotyping = true
28+
genotyping_tool = 'hc'
29+
genotyping_source = 'raw'
30+
gatk_ploidy = 2
31+
32+
}
33+
34+
process {
35+
withName:'adapter_removal'{
36+
cpus = { check_max( 8, 'cpus' ) }
37+
memory = { check_max( 16.GB * task.attempt, 'memory' ) }
38+
time = { check_max( 2.h * task.attempt, 'time' ) }
39+
}
40+
withName:'bwa'{
41+
cpus = { check_max( 8, 'cpus' ) }
42+
memory = { check_max( 16.GB * task.attempt, 'memory' ) }
43+
time = { check_max( 8.h * task.attempt, 'time' ) }
44+
}
45+
withName:'dedup'{
46+
cpus = { check_max( 8, 'cpus' ) }
47+
memory = { check_max( 16.GB * task.attempt, 'memory' ) }
48+
time = { check_max( 4.h * task.attempt, 'time' ) }
49+
}
50+
withName:'genotyping_hc'{
51+
cpus = { check_max( 8, 'cpus' ) }
52+
memory = { check_max( 16.GB * task.attempt, 'memory' ) }
53+
time = { check_max( 8.h * task.attempt, 'time' ) }
54+
}
55+
56+
}

docs/output.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,15 +98,16 @@ The possible columns displayed by default are as follows:
9898
- **Mappability** This is from MALT. It reports the percentage of the off-target reads (from mapping), that could map to your MALT metagenomic database. This can often be low for aDNA due to short reads and database bias.
9999
- **% Unclassified** This is from Kraken. It reports the percentage of reads that could not be aligned and taxonomically assigned against your Kraken metagenomic database. This can often be high for aDNA due to short reads and database bias.
100100
- **Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _prior_ map quality filtering and deduplication.
101-
- **Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _after_ map quality filtering and deduplication (note the column name does not distinguish itself from prior-map quality filtering, but the post-filter column is always second)
102101
- **Endogenous DNA (%)** This is from the endorS.py tool. It displays a percentage of mapped reads over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). Assuming a perfect ancient sample with no modern contamination, this would be the amount of true ancient DNA in the sample. However this value _most likely_ include contamination and will not entirely be the true 'endogenous' content.
102+
- **Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _after_ map quality filtering and deduplication (note the column name does not distinguish itself from prior-map quality filtering, but the post-filter column is always second)
103103
- **Endogenous DNA Post (%)** This is from the endorS.py tool. It displays a percentage of mapped reads _after_ BAM filtering (e.g. for mapping quality) over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). This column will only be displayed if BAM filtering is turned on and is based on the original mapping for total reads, and mapped reads as calculated from the post-filtering BAM.
104104
- **ClusterFactor** This is from DeDup. This is a value representing the how many duplicates in the library exist for each unique read. A cluster factor close to one replicates a highly complex library and could be sequenced further. Generally with a value of more than 2 you will not be gaining much more information by sequencing deeper.
105105
- **Dups** This is from Picard's markDuplicates. It represents the percentage of reads in your library that were exact duplicates of other reads in your database. The lower the better, as high duplication rate means lots of sequencing of the same information (and therefore is not time or cost effective).
106106
- **X Prime Y>Z N base** These columns are from DamageProfiler. The prime numbers represent which end of the reads the damage is referring to. The Y>Z is the type of substitution (C>T is the true damage, G>A is the complementary). You should see for no- and half- UDG treatment a decrease in frequency from the 1st to 2nd base.
107107
- **Mean Read Length** This is from DamageProfiler. This is the mean length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary.
108108
- **Median Read Length** This is from DamageProfiler. This is the median length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary.
109-
- **Coverage** This is from Qualimap. This is the median number of times a base on your reference genome was covered by a read (i.e. depth coverage).. This average includes bases with 0 reads covering that position.
109+
- **Algined** This is from Qualimap. This is the total number of _deduplicated_ reads that mapped to your reference genome.
110+
- **Mean/Median Coverage** This is from Qualimap. This is the mean/median number of times a base on your reference genome was covered by a read (i.e. depth coverage). This average includes bases with 0 reads covering that position.
110111
- **>= 1X** to **>= 5X** These are from Qualimap. This is the percentage of the genome covered at that particular depth coverage.
111112
- **% GC** This is the mean GC content in percent of all mapped reads post-deduplication. This should normally be close to the GC content of your reference genome.
112113
- **MT to Nuclear Ratio** This from MTtoNucRatio. This reports the number of reads aligned to a mitochondrial entry in your reference FASTA to all other entries. This will typically be high but will vary depending on tissue type.

docs/usage.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,7 @@ If you have multiple files in different directories, you can use additional wild
192192
4. When using the pipeline with **paired end data**, the path must use `{1,2}` notation to specify read pairs.
193193
5. Files names must be unique, having files with the same name, but in different directories is _not_ sufficient
194194
- This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input.
195+
6. Due to limitations of downstream tools (e.g. FastQC), sample IDs maybe truncated after the first `.` in the name, Ensure file names are unique prior to this!
195196

196197
##### TSV Input Method
197198

@@ -362,7 +363,7 @@ Use this if you do not have pre-made reference FASTA indices for `bwa`, `samtool
362363

363364
#### `--bwa_index`
364365

365-
If you want to use pre-existing `bwa index` indices, please supply the path **and file** to the FASTA you also specified in `--fasta` (see above). EAGER2 will automagically detect the index files by searching for the FASTA filename with the corresponding `bwa` index file suffixes.
366+
If you want to use pre-existing `bwa index` indices, please supply the **directory** to the FASTA you also specified in `--fasta` (see above). EAGER2 will automagically detect the index files by searching for the FASTA filename with the corresponding `bwa` index file suffixes.
366367

367368
For example:
368369

@@ -371,7 +372,7 @@ nextflow run nf-core/eager \
371372
-profile test,docker \
372373
--input '*{R1,R2}*.fq.gz'
373374
--fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \
374-
--bwa_index 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta'
375+
--bwa_index 'results/reference_genome/bwa_index/BWAIndex/'
375376
```
376377

377378
> `bwa index` does not give you an option to supply alternative suffixes/names for these indices. Thus, the file names generated by this command _must not_ be changed, otherwise EAGER2 will not be able to find them.
@@ -734,6 +735,8 @@ Sets DeDup to treat all reads as merged reads. This is useful if reads are for e
734735

735736
### Library Complexity Estimation Parameters
736737

738+
nf-core/eager uses Preseq on map reads as one method to calculate library complexity. If DeDup is used, Preseq uses the historigram output of DeDup, otherwise the sored non-duplicated BAM file is supplied. Furthermore, if paired-end read collapsing is not performed, the `-P` flag is used.
739+
737740
#### `--preseq_step_size`
738741

739742
Can be used to configure the step size of Preseqs `c_curve` method. Can be useful when only few and thus shallow sequencing results are used for extrapolation.

0 commit comments

Comments
 (0)