Skip to content

Commit b760924

Browse files
authored
Merge pull request #642 from jfy133/damage-scaling
Add Damage rescaling functionality
2 parents 7ead9a0 + fff0364 commit b760924

11 files changed

Lines changed: 107 additions & 244 deletions

File tree

.github/workflows/ci.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,3 +186,6 @@ jobs:
186186
- name: MTNUCRATIO Run basic pipeline with bam input profile, but don't convert BAM, skip everything but nmtnucratio
187187
run: |
188188
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --skip_preseq --skip_damage_calculation --run_mtnucratio
189+
- name: RESCALING Run basic pipeline with basic pipeline but with mapDamage rescaling of BAM files. Note this will be slow
190+
run: |
191+
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_mapdamage_rescaling --run_genotyping --genotyping_tool hc --genotyping_source 'rescaled'

CHANGELOG.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,14 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
77

88
### `Added`
99

10+
- [#583](https://github.com/nf-core/eager/issues/583) - mapDamage2 rescaling of BAM files to remove damage
11+
1012
### `Fixed`
1113

1214
- Removed leftover old DockerHub push CI commands.
13-
- [#627](https://github.com/nf-core/eager/issues/627) Added de Barros Damgaard citation to README
14-
- [#630](https://github.com/nf-core/eager/pull/630) Better handling of Qualimap memory requirements and error strategy.
15+
- [#627](https://github.com/nf-core/eager/issues/627) - Added de Barros Damgaard citation to README
16+
- [#630](https://github.com/nf-core/eager/pull/630) - Better handling of Qualimap memory requirements and error strategy.
17+
- Fixed some imcomplete schema options to ensure users supply valid input values
1518

1619
### `Dependencies`
1720

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,7 @@ In addition, references of tools and data used in this pipeline are as follows:
236236
* **Bowtie2** Langmead, B. and Salzberg, S. L. 2012 Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), p. 357–359. doi: [10.1038/nmeth.1923](https:/dx.doi.org/10.1038/nmeth.1923).
237237
* **sequenceTools** Stephan Schiffels (Unpublished). Download: [https://github.com/stschiff/sequenceTools](https://github.com/stschiff/sequenceTools)
238238
* **EigenstratDatabaseTools** Thiseas C. Lamnidis (Unpublished). Download: [https://github.com/TCLamnidis/EigenStratDatabaseTools.git](https://github.com/TCLamnidis/EigenStratDatabaseTools.git)
239+
* **mapDamage2** Jónsson, H., et al 2013. mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics , 29(13), 1682–1684. [https://doi.org/10.1093/bioinformatics/btt193](https://doi.org/10.1093/bioinformatics/btt193)
239240
240241
## Data References
241242

bin/scrape_software_versions.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,8 @@
3535
'VCF2genome':['v_vcf2genome.txt', r"VCF2Genome \(v. ([0-9].[0-9]+) "],
3636
'endorS.py':['v_endorSpy.txt', r"endorS.py (\S+)"],
3737
'kraken':['v_kraken.txt', r"Kraken version (\S+)"],
38-
'eigenstrat_snp_coverage':['v_eigenstrat_snp_coverage.txt',r"(\S+)"]
38+
'eigenstrat_snp_coverage':['v_eigenstrat_snp_coverage.txt',r"(\S+)"],
39+
'mapDamage2':['v_mapdamage.txt',r"(\S+)"],
3940
}
4041

4142
results = OrderedDict()
@@ -55,7 +56,7 @@
5556
results['Qualimap'] = '<span style="color:#999999;\">N/A</span>'
5657
results['Preseq'] = '<span style="color:#999999;\">N/A</span>'
5758
results['GATK HaplotypeCaller'] = '<span style="color:#999999;\">N/A</span>'
58-
#results['GATK UnifiedGenotyper'] = '<span style="color:#999999;\">N/A</span>'
59+
results['GATK UnifiedGenotyper'] = '<span style="color:#999999;\">N/A</span>'
5960
results['freebayes'] = '<span style="color:#999999;\">N/A</span>'
6061
results['sequenceTools'] = '<span style="color:#999999;\">N/A</span>'
6162
results['VCF2genome'] = '<span style="color:#999999;\">N/A</span>'
@@ -71,6 +72,7 @@
7172
results['kraken'] = '<span style="color:#999999;\">N/A</span>'
7273
results['maltextract'] = '<span style="color:#999999;\">N/A</span>'
7374
results['eigenstrat_snp_coverage'] = '<span style="color:#999999;\">N/A</span>'
75+
results['mapDamage2'] = '<span style="color:#999999;\">N/A</span>'
7476

7577
# Search each file using its regex
7678
for k, v in regexes.items():

conf/test_resources.config

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,4 +51,8 @@ process {
5151
time = { check_max( 10.m * task.attempt, 'time' ) }
5252
}
5353

54+
withName:'mapdamage_rescaling'{
55+
time = { check_max( 20.m * task.attempt, 'time' ) }
56+
}
57+
5458
}

docs/output.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -658,6 +658,7 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
658658
- `damageprofiler/` - this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files.
659659
- `pmdtools/` - this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`.
660660
- `trimmed_bam/` - this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read).
661+
- `damage_rescaling/` - this contains rescaled BAM files from mapDamage2. These BAM files have damage probabilistically removed via a bayesian model, and can be used for downstream genotyping.
661662
- `genotyping/` - this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics.
662663
- `multivcfanalyzer/` - this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files.
663664
- `sex_determination/` - this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.

docs/usage.md

Lines changed: 0 additions & 228 deletions
Original file line numberDiff line numberDiff line change
@@ -391,203 +391,6 @@ hard drive footprint of the run, so be sure to do this!
391391

392392
## Troubleshooting and FAQs
393393

394-
### My pipeline update doesn't seem to do anything
395-
396-
To download a new version of a pipeline, you can use the following, replacing
397-
`<VERSION>` to the corresponding version.
398-
399-
```bash
400-
nextflow pull nf-core/eager -r <VERSION>
401-
```
402-
403-
However, in very rare cases, minor fixes to a version will be pushed out without
404-
a version number bump. This can confuse nextflow slightly, as it thinks you
405-
already have the 'broken' version from your original pipeline download.
406-
407-
If when running the pipeline you don't see any changes in the fixed version when
408-
running it, you can try removing your nextflow EAGER cache typically stored in
409-
your home directory with
410-
411-
```bash
412-
rm -r ~/.nextflow/assets/nf-core/eager
413-
```
414-
415-
And re-pull the pipeline with the command above. This will install a fresh
416-
version of the version with the fixes.
417-
418-
### Input files not found
419-
420-
When using the [direct input](#direct-input-method) method: if no file, only one
421-
input file, or only 'read one' and not 'read two' is picked up then something is
422-
likely wrong with your input file declaration ([`--input`](#--input)):
423-
424-
1. The path must be enclosed in quotes (`'` or `"`)
425-
2. The path must have at least one `*` wildcard character. This is even if you
426-
are only running one paired end sample.
427-
3. When using the pipeline with paired end data, the path must use `{1,2}` or
428-
`{R1,R2}` notation to specify read pairs.
429-
4. If you are running single-end data make sure to specify `--single_end`
430-
431-
**Important**: The pipeline can't take a list of multiple input files when using
432-
the direct input method - it takes a 'glob' expression. If your input files are
433-
scattered in different paths then we recommend that you generate a directory
434-
with symlinked files. If running in paired-end mode please make sure that your
435-
files are sensibly named so that they can be properly paired. See the previous
436-
point.
437-
438-
If the pipeline can't find your files then you will get the following error
439-
440-
```bash
441-
ERROR ~ Cannot find any reads matching: *{1,2}.fastq.gz
442-
```
443-
444-
If your sample name is "messy" then you have to be very particular with your
445-
glob specification. A file name like `L1-1-D-2h_S1_L002_R1_001.fastq.gz` can be
446-
difficult enough for a human to read. Specifying `*{1,2}*.gz` won't work give
447-
you what you want whilst `*{R1,R2}*.gz` (i.e. the addition of the `R`s) will.
448-
449-
If using the [TSV input](#tsv-input-method) method, this likely means there is a
450-
mistake or typo in the path in a given column. Often this is a trailing space at
451-
the end of the path.
452-
453-
### I am only getting output for a single sample although I specified multiple with wildcards
454-
455-
You must specify paths to files in quotes, otherwise your shell will evaluate
456-
any wildcards (\*) rather than Nextflow.
457-
458-
For example
459-
460-
```bash
461-
nextflow run nf-core/eager --input /path/to/sample_*/*.fq.gz
462-
```
463-
464-
Would be evaluated by your shell as
465-
466-
```bash
467-
nextflow run nf-core/eager --input /path/to/sample_1/sample_1.fq.gz /path/to/sample_1/sample_1.fq.gz /path/to/sample_1/sample_1.fq.gz
468-
```
469-
470-
And Nextflow will only take the first path after `--input`, ignoring the others.
471-
472-
On the other hand, encapsulating the path in quotes will allow Nextflow to
473-
evaluate the paths.
474-
475-
```bash
476-
nextflow run nf-core/eager --input "/path/to/sample_*/*.fq.gz"
477-
```
478-
479-
### The pipeline crashes almost immediately with an early pipeline step
480-
481-
Sometimes a newly downloaded and set up nf-core/eager pipeline will encounter an
482-
issue where a run almost immediately crashes (e.g. at `fastqc`,
483-
`output_documentation` etc.) saying the tool could not be found or similar.
484-
485-
#### I am running Docker
486-
487-
You may have an outdated container. This happens more often when running on the
488-
`dev` branch of nf-core/eager, because Docker will _not_ update the container on
489-
each new commit, and thus may not get new tools called within the pipeline code.
490-
491-
To fix, just re-pull the nf-core/eager Docker container manually with:
492-
493-
```bash
494-
docker pull nfcore/eager:dev
495-
```
496-
497-
#### I am running Singularity
498-
499-
If you're running Singularity, it could be that Nextflow cannot access your
500-
Singularity image properly - often due to missing bind paths.
501-
502-
See
503-
[here](https://nf-co.re/usage/troubleshooting#cannot-find-input-files-when-using-singularity)
504-
for more information.
505-
506-
### The pipeline has crashed with an error but Nextflow is still running
507-
508-
If this happens, you can either wait until all other already running jobs to
509-
safely finish, or if Nextflow _still_ does not stop press `ctrl + c` on your
510-
keyboard (or equivalent) to stop the Nextflow run.
511-
512-
> :warning: if you do this, and do not plan to fix the run make sure to delete
513-
the output folder. Otherwise you may end up a lot of large intermediate files
514-
being left! You can clean a Nextflow run of all intermediate files with
515-
`nextflow clean -f -k` or delete the `work/` directory.
516-
517-
### I get a exceeded job memory limit error
518-
519-
While Nextflow tries to make your life easier by automatically retrying jobs
520-
that run out of memory with more resources (until your specified max-limit),
521-
sometimes you may have such large data you run out even after the default 3
522-
retries.
523-
524-
To fix this you need to change the default memory requirements for the process
525-
that is breaking. We can do this by making a custom profile, which we then
526-
provide to the Nextflow run command.
527-
528-
For example, lets say it's the `markduplicates` process that is running out of
529-
memory.
530-
531-
First we need to check to see what default memory value we have. We can do this
532-
by going to the main [nf-core/eager code](https://github.com/nf-core/) and
533-
opening the `main.nf` file. We can then use your browser's find functionality
534-
for: `process markduplicates`.
535-
536-
Once found, we then need to check the line called `label`. In this case the
537-
label is `mc_small` (for multi-core small).
538-
539-
Next we need to go back to the main github repository, and open
540-
`conf/base.config`. Again using our find functionality, we search for:
541-
`withLabel:'mc_small'`.
542-
543-
We see that the `memory` is set to `4.GB` (`memory = { check_max( 4.GB *
544-
task.attempt, 'memory' )})`).
545-
546-
Now back on your computer, we need to make a new file called
547-
`custom_resources.conf`. You should save it somewhere centrally so you can
548-
reuse it.
549-
550-
> If you think this would be useful for multiple people in your lab/institute,
551-
> we highly recommend you make an institutional profile at
552-
> [nf-core/configs](https://github.com/nf-core/configs). This will simplify this
553-
> process in the future.
554-
555-
Within this file, you will need to add the following:
556-
557-
```txt
558-
profiles {
559-
big_data {
560-
process {
561-
withName: markduplicates {
562-
memory = 16.GB
563-
}
564-
}
565-
}
566-
}
567-
```
568-
569-
Where we have increased the default `4.GB` to `16.GB`. Make sure that you keep
570-
the `check_max` function, as this prevents your run asking for too much memory
571-
during retries.
572-
573-
> Note that with this you will _not_ have the automatic retry mechanism. If
574-
> you want this, re-add the `check_max()` function on the `memory` line, and
575-
> add to the bottom of the entire file (outside the profiles block), the
576-
> block starting `def check_max(obj, type) {`, which is at the end of the
577-
> [nextflow.config file](https://github.com/nf-core/eager/blob/master/nextflow.config)
578-
579-
Once saved, we can then modify your original Nextflow run command:
580-
581-
```bash
582-
nextflow run nf-core/eager -r 2.2.0 -c /<path>/<to>/custom_resources.conf -profile big_data,<original>,<profiles> <...>
583-
```
584-
585-
Where we have added `-c` to specify which file to use for the custom profiles,
586-
and then added the `big_data` profile to the original profiles you were using.
587-
588-
:warning: it's important that big_data comes first, to ensure it overwrites any
589-
parameters set in the subsequent profiles!
590-
591394
### I get a file name collision error during merging
592395

593396
When using TSV input, nf-core/eager will attempt to merge all `Lanes` of a
@@ -608,37 +411,6 @@ they are unique (e.g. if one library was sequenced on Lane 8 of two HiSeq runs,
608411
specify lanes as 8 and 16 for each FASTQ file respectively). For library merging
609412
errors, you must modify your `Library_ID`s accordingly, to make them unique.
610413
611-
### I specified a module and it didn't produce the expected output
612-
613-
Possible options:
614-
615-
1. Check there if you have a typo in the parameter name. Nextflow _does not_
616-
check for this
617-
2. Check that an upstream module was turned on (if a module requires the output
618-
of a previous module, it will not be activated unless it receives the output)
619-
620-
### I get a unable to acquire lock
621-
622-
Errors like the following
623-
624-
```bash
625-
Unable to acquire lock on session with ID 84333844-66e3-4846-a664-b446d070f775
626-
```
627-
628-
normally suggest a previous Nextflow run (on the same folder) was not cleanly
629-
killed by a user (e.g. using ctrl + z to hard kill a crashed run).
630-
631-
To fix this, you must clean the entirety of the output directory (including
632-
output files) e.g. with `rm -r <output_dir>/* <output_dir>/.*` and re-running
633-
from scratch.
634-
635-
`ctrl +z` is **not** a recommended way of killing a Nextflow job. Runs that take
636-
a long time to fail are often still running because other job submissions are
637-
still running. Nextflow will normally wait for those processes to complete
638-
before cleaning shutting down the run (to allow rerunning of a run with
639-
`-resume`). `ctrl + c` is much safer as it will tell Nextflow to stop earlier
640-
but cleanly.
641-
642414
## Tutorials
643415
644416
### Tutorial - How to investigate a failed run

environment.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,4 @@ dependencies:
4747
- conda-forge::xopen=0.9.0
4848
- bioconda::bowtie2=2.4.1
4949
- bioconda::eigenstratdatabasetools=1.0.2
50+
- bioconda::mapdamage2=2.2.0

0 commit comments

Comments
 (0)