Merge pull request #642 from jfy133/damage-scaling

apeltzer · web-flow · commit b760924d8909 · 2020-12-23T15:32:54.000+01:00
Add Damage rescaling functionality
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -186,3 +186,6 @@ jobs:
       - name: MTNUCRATIO Run basic pipeline with bam input profile, but don't convert BAM, skip everything but nmtnucratio
         run: |
           nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --skip_preseq --skip_damage_calculation --run_mtnucratio
+      - name: RESCALING Run basic pipeline with basic pipeline but with mapDamage rescaling of BAM files. Note this will be slow
+        run: |
+          nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_mapdamage_rescaling --run_genotyping --genotyping_tool hc --genotyping_source 'rescaled'
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,11 +7,14 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 
 ### `Added`
 
+- [#583](https://github.com/nf-core/eager/issues/583) - mapDamage2 rescaling of BAM files to remove damage
+
 ### `Fixed`
 
 - Removed leftover old DockerHub push CI commands.
-- [#627](https://github.com/nf-core/eager/issues/627) Added de Barros Damgaard citation to README
-- [#630](https://github.com/nf-core/eager/pull/630) Better handling of Qualimap memory requirements and error strategy.
+- [#627](https://github.com/nf-core/eager/issues/627) - Added de Barros Damgaard citation to README
+- [#630](https://github.com/nf-core/eager/pull/630) - Better handling of Qualimap memory requirements and error strategy.
+- Fixed some imcomplete schema options to ensure users supply valid input values
 
 ### `Dependencies`
 
diff --git a/README.md b/README.md
@@ -236,6 +236,7 @@ In addition, references of tools and data used in this pipeline are as follows:
 * **Bowtie2**  Langmead, B. and Salzberg, S. L. 2012 Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), p. 357–359. doi: [10.1038/nmeth.1923](https:/dx.doi.org/10.1038/nmeth.1923).
 * **sequenceTools** Stephan Schiffels (Unpublished). Download: [https://github.com/stschiff/sequenceTools](https://github.com/stschiff/sequenceTools)
 * **EigenstratDatabaseTools** Thiseas C. Lamnidis (Unpublished). Download: [https://github.com/TCLamnidis/EigenStratDatabaseTools.git](https://github.com/TCLamnidis/EigenStratDatabaseTools.git)
+* **mapDamage2** Jónsson, H., et al 2013. mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics , 29(13), 1682–1684. [https://doi.org/10.1093/bioinformatics/btt193](https://doi.org/10.1093/bioinformatics/btt193)
 
 ## Data References
 
diff --git a/bin/scrape_software_versions.py b/bin/scrape_software_versions.py
@@ -35,7 +35,8 @@
     'VCF2genome':['v_vcf2genome.txt', r"VCF2Genome \(v. ([0-9].[0-9]+) "],
     'endorS.py':['v_endorSpy.txt', r"endorS.py (\S+)"],
     'kraken':['v_kraken.txt', r"Kraken version (\S+)"],
-    'eigenstrat_snp_coverage':['v_eigenstrat_snp_coverage.txt',r"(\S+)"]
+    'eigenstrat_snp_coverage':['v_eigenstrat_snp_coverage.txt',r"(\S+)"],
+    'mapDamage2':['v_mapdamage.txt',r"(\S+)"],
 }
 
 results = OrderedDict()
@@ -55,7 +56,7 @@
 results['Qualimap'] = '<span style="color:#999999;\">N/A</span>'
 results['Preseq'] = '<span style="color:#999999;\">N/A</span>'
 results['GATK HaplotypeCaller'] = '<span style="color:#999999;\">N/A</span>'
-#results['GATK UnifiedGenotyper'] = '<span style="color:#999999;\">N/A</span>'
+results['GATK UnifiedGenotyper'] = '<span style="color:#999999;\">N/A</span>'
 results['freebayes'] = '<span style="color:#999999;\">N/A</span>'
 results['sequenceTools'] = '<span style="color:#999999;\">N/A</span>'
 results['VCF2genome'] = '<span style="color:#999999;\">N/A</span>'
@@ -71,6 +72,7 @@
 results['kraken'] = '<span style="color:#999999;\">N/A</span>'
 results['maltextract'] = '<span style="color:#999999;\">N/A</span>'
 results['eigenstrat_snp_coverage'] = '<span style="color:#999999;\">N/A</span>'
+results['mapDamage2'] = '<span style="color:#999999;\">N/A</span>'
 
 # Search each file using its regex
 for k, v in regexes.items():
diff --git a/conf/test_resources.config b/conf/test_resources.config
@@ -51,4 +51,8 @@ process {
       time = { check_max( 10.m * task.attempt, 'time' ) }
   }
 
+  withName:'mapdamage_rescaling'{
+      time = { check_max( 20.m * task.attempt, 'time' ) }
+  }
+
 }
diff --git a/docs/output.md b/docs/output.md
@@ -658,6 +658,7 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
 - `damageprofiler/` - this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files.
 - `pmdtools/` - this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`.
 - `trimmed_bam/` - this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read).
+- `damage_rescaling/` - this contains rescaled BAM files from mapDamage2. These BAM files have damage probabilistically removed via a bayesian model, and can be used for downstream genotyping.
 - `genotyping/` - this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics.
 - `multivcfanalyzer/` - this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files.
 - `sex_determination/` - this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
diff --git a/docs/usage.md b/docs/usage.md
@@ -391,203 +391,6 @@ hard drive footprint of the run, so be sure to do this!
 
 ## Troubleshooting and FAQs
 
-### My pipeline update doesn't seem to do anything
-
-To download a new version of a pipeline, you can use the following, replacing
-`<VERSION>` to the corresponding version.
-
-```bash
-nextflow pull nf-core/eager -r <VERSION>
-```
-
-However, in very rare cases, minor fixes to a version will be pushed out without
-a version number bump. This can confuse nextflow slightly, as it thinks you
-already have the 'broken' version from your original pipeline download.
-
-If when running the pipeline you don't see any changes in the fixed version when
-running it, you can try removing your nextflow EAGER cache typically stored in
-your home directory with
-
-```bash
-rm -r ~/.nextflow/assets/nf-core/eager
-```
-
-And re-pull the pipeline with the command above. This will install a fresh
-version of the version with the fixes.
-
-### Input files not found
-
-When using the [direct input](#direct-input-method) method: if no file, only one
-input file, or only 'read one' and not 'read two' is picked up then something is
-likely wrong with your input file declaration ([`--input`](#--input)):
-
-1. The path must be enclosed in quotes (`'` or `"`)
-2. The path must have at least one `*` wildcard character. This is even if you
-   are only running one paired end sample.
-3. When using the pipeline with paired end data, the path must use `{1,2}` or
-   `{R1,R2}` notation to specify read pairs.
-4. If you are running single-end data make sure to specify `--single_end`
-
-**Important**: The pipeline can't take a list of multiple input files when using
-the direct input method - it takes a 'glob' expression. If your input files are
-scattered in different paths then we recommend that you generate a directory
-with symlinked files. If running in paired-end mode please make sure that your
-files are sensibly named so that they can be properly paired. See the previous
-point.
-
-If the pipeline can't find your files then you will get the following error
-
-```bash
-ERROR ~ Cannot find any reads matching: *{1,2}.fastq.gz
-```
-
-If your sample name is "messy" then you have to be very particular with your
-glob specification. A file name like `L1-1-D-2h_S1_L002_R1_001.fastq.gz` can be
-difficult enough for a human to read. Specifying `*{1,2}*.gz` won't work give
-you what you want whilst `*{R1,R2}*.gz` (i.e. the addition of the `R`s) will.
-
-If using the [TSV input](#tsv-input-method) method, this likely means there is a
-mistake or typo in the path in a given column. Often this is a trailing space at
-the end of the path.
-
-### I am only getting output for a single sample although I specified multiple with wildcards
-
-You must specify paths to files in quotes, otherwise your shell will evaluate
-any wildcards (\*) rather than Nextflow.
-
-For example
-
-```bash
-nextflow run nf-core/eager --input /path/to/sample_*/*.fq.gz
-```
-
-Would be evaluated by your shell as  
-
-```bash
-nextflow run nf-core/eager --input /path/to/sample_1/sample_1.fq.gz /path/to/sample_1/sample_1.fq.gz /path/to/sample_1/sample_1.fq.gz
-```
-
-And Nextflow will only take the first path after `--input`, ignoring the others.
-
-On the other hand, encapsulating the path in quotes will allow Nextflow to
-evaluate the paths.
-
-```bash
-nextflow run nf-core/eager --input "/path/to/sample_*/*.fq.gz"
-```
-
-### The pipeline crashes almost immediately with an early pipeline step
-
-Sometimes a newly downloaded and set up nf-core/eager pipeline will encounter an
-issue where a run almost immediately crashes (e.g. at `fastqc`,
-`output_documentation` etc.) saying the tool could not be found or similar.
-
-#### I am running Docker
-
-You may have an outdated container. This happens more often when running on the
-`dev` branch of nf-core/eager, because Docker will _not_ update the container on
-each new commit, and thus may not get new tools called within the pipeline code.
-
-To fix, just re-pull the nf-core/eager Docker container manually with:
-
-```bash
-docker pull nfcore/eager:dev
-```
-
-#### I am running Singularity
-
-If you're running Singularity, it could be that Nextflow cannot access your
-Singularity image properly - often due to missing bind paths.
-
-See
-[here](https://nf-co.re/usage/troubleshooting#cannot-find-input-files-when-using-singularity)
-for more information.
-
-### The pipeline has crashed with an error but Nextflow is still running
-
-If this happens, you can either wait until all other already running jobs to
-safely finish, or if Nextflow _still_ does not stop press `ctrl + c` on your
-keyboard (or equivalent) to stop the Nextflow run.
-
-> :warning: if you do this, and do not plan to fix the run make sure to delete
-the output folder. Otherwise you may end up a lot of large intermediate files
-being left! You can clean a Nextflow run of all intermediate files with
-`nextflow clean -f -k` or delete the `work/` directory.
-
-### I get a exceeded job memory limit error
-
-While Nextflow tries to make your life easier by automatically retrying jobs
-that run out of memory with more resources (until your specified max-limit),
-sometimes you may have such large data you run out even after the default 3
-retries.
-
-To fix this you need to change the default memory requirements for the process
-that is breaking. We can do this by making a custom profile, which we then
-provide to the Nextflow run command.
-
-For example, lets say it's the `markduplicates` process that is running out of
-memory.
-
-First we need to check to see what default memory value we have. We can do this
-by going to the main [nf-core/eager code](https://github.com/nf-core/) and
-opening the `main.nf` file. We can then use your browser's find functionality
-for: `process markduplicates`.
-
-Once found, we then need to check the line called `label`. In this case the
-label is `mc_small` (for multi-core small).
-
-Next we need to go back to the main github repository, and open
-`conf/base.config`. Again using our find functionality, we search for:
-`withLabel:'mc_small'`.
-
-We see that the `memory` is set to `4.GB` (`memory = { check_max( 4.GB *
-task.attempt, 'memory' )})`).
-
-Now back on your computer, we need to make a new file called
-`custom_resources.conf`. You should save it somewhere centrally so you can
-reuse it.
-
-> If you think this would be useful for multiple people in your lab/institute,
-> we highly recommend you make an institutional profile at
-> [nf-core/configs](https://github.com/nf-core/configs). This will simplify this
-> process in the future.
-
-Within this file, you will need to add the following:
-
-```txt
-profiles {
-    big_data {
-      process {
-        withName: markduplicates {
-          memory = 16.GB
-        }
-      }
-    }
-}
-```
-
-Where we have increased the default `4.GB` to `16.GB`. Make sure that you keep
-the `check_max` function, as this prevents your run asking for too much memory
-during retries.
-
-> Note that with this you will _not_ have the automatic retry mechanism. If
-> you want this, re-add the `check_max()` function on the `memory` line, and
-> add to the bottom of the entire file (outside the profiles block), the
-> block starting `def check_max(obj, type) {`, which is at the end of the
-> [nextflow.config file](https://github.com/nf-core/eager/blob/master/nextflow.config)
-
-Once saved, we can then modify your original Nextflow run command:
-
-```bash
-nextflow run nf-core/eager -r 2.2.0 -c /<path>/<to>/custom_resources.conf -profile big_data,<original>,<profiles> <...>
-```
-
-Where we have added `-c` to specify which file to use for the custom profiles,
-and then added the `big_data` profile to the original profiles you were using.
-
-:warning: it's important that big_data comes first, to ensure it overwrites any
-parameters set in the subsequent profiles!
-
 ### I get a file name collision error during merging
 
 When using TSV input, nf-core/eager will attempt to merge all `Lanes` of a
@@ -608,37 +411,6 @@ they are unique (e.g. if one library was sequenced on Lane 8 of two HiSeq runs,
 specify lanes as 8 and 16 for each FASTQ file respectively). For library merging
 errors, you must modify your `Library_ID`s accordingly, to make them unique.
 
-### I specified a module and it didn't produce the expected output
-
-Possible options:
-
-1. Check there if you have a typo in the parameter name. Nextflow _does not_
-   check for this
-2. Check that an upstream module was turned on (if a module requires the output
-   of a previous module, it will not be activated unless it receives the output)
-
-### I get a unable to acquire lock
-
-Errors like the following
-
-```bash
-Unable to acquire lock on session with ID 84333844-66e3-4846-a664-b446d070f775
-```
-
-normally suggest a previous Nextflow run (on the same folder) was not cleanly
-killed by a user (e.g. using ctrl + z to hard kill a crashed run).
-
-To fix this, you must clean the entirety of the output directory (including
-output files) e.g. with `rm -r <output_dir>/* <output_dir>/.*` and re-running
-from scratch.
-
-`ctrl +z` is **not** a recommended way of killing a Nextflow job. Runs that take
-a long time to fail are often still running because other job submissions are
-still running. Nextflow will normally wait for those processes to complete
-before cleaning shutting down the run (to allow rerunning of a run with
-`-resume`). `ctrl + c` is much safer as it will tell Nextflow to stop earlier
-but cleanly.
-
 ## Tutorials
 
 ### Tutorial - How to investigate a failed run
diff --git a/environment.yml b/environment.yml
@@ -47,3 +47,4 @@ dependencies:
   - conda-forge::xopen=0.9.0
   - bioconda::bowtie2=2.4.1
   - bioconda::eigenstratdatabasetools=1.0.2
+  - bioconda::mapdamage2=2.2.0
diff --git a/main.nf b/main.nf
diff --git a/nextflow.config b/nextflow.config
diff --git a/nextflow_schema.json b/nextflow_schema.json

Original file line number	Diff line number	Diff line change
`@@ -51,4 +51,8 @@ process {`
`51`	`51`	`time = { check_max( 10.m * task.attempt, 'time' ) }`
`52`	`52`	`}`
`53`	`53`
	`54`	`+ withName:'mapdamage_rescaling'{`
	`55`	`+ time = { check_max( 20.m * task.attempt, 'time' ) }`
	`56`	`+ }`
	`57`	`+`
`54`	`58`	`}`