Merge pull request #142 from maxibor/dev

apeltzer · web-flow · commit 291990f6ba06 · 2019-03-04T22:29:51.000+01:00
Add optional merging and trimming
diff --git a/.travis.yml b/.travis.yml
@@ -40,6 +40,12 @@ script:
   - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --pairedEnd --saveReference
   # Run the basic pipeline with single end data (pretending its single end actually)
   - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --singleEnd --bwa_index results/reference_genome/bwa_index/bwa_index/
+  # Run the basic pipeline with paired end data without collapsing
+  - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --pairedEnd --skip_collapse --saveReference
+  # Run the basic pipeline with paired end data without trimming
+  - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --pairedEnd --skip_trim --saveReference
+   # Run the basic pipeline with paired end data without adapterRemoval
+  - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --pairedEnd --skip_adapterremoval --saveReference
   # Run the same pipeline testing optional step: fastp, complexity 
   - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --pairedEnd --complexity_filter --bwa_index results/reference_genome/bwa_index/bwa_index/
   # Test BAM Trimming
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,17 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 
 ## [Unpublished / Dev Branch]
 
+### `Added`
+
+* [#152](https://github.com/nf-core/eager/pull/152) - Clarified `--complexity_filter` flag to be specifically for poly G trimming.
+* [#155](https://github.com/nf-core/eager/pull/155) - Added [Dedup log to output folders](https://github.com/nf-core/eager/issues/154)
+
+### `Fixed`
+
+* [#151](https://github.com/nf-core/eager/pull/151) - Fixed [post-deduplication step errors](https://github.com/nf-core/eager/issues/128
+* [#147](https://github.com/nf-core/eager/pull/147) - Fix Samtools Index for [large references](https://github.com/nf-core/eager/issues/146)
+* [#145](https://github.com/nf-core/eager/pull/145) - Added Picard Memory Handling [fix](https://github.com/nf-core/eager/issues/144)
+
 ## [2.0.5] - 2019-01-28
 
 ### `Added`
diff --git a/README.md b/README.md
@@ -45,20 +45,28 @@ Additional functionality contained by the pipeline currently includes:
 ## Quick Start
 
 1. Install [`nextflow`](docs/installation.md)
+
 2. Install one of [`docker`](https://docs.docker.com/engine/installation/), [`singularity`](https://www.sylabs.io/guides/3.0/user-guide/) or [`conda`](https://conda.io/miniconda.html)
+
 3. Download the EAGER pipeline
 
 ```bash
 nextflow pull nf-core/eager
 ```
 
-4. Set up your job with default parameters
+4. Test the pipeline using the provided test data
 
 ```bash
-nextflow run nf-core -profile <docker/singularity/conda> --reads'*_R{1,2}.fastq.gz' --fasta '<REFERENCE>.fasta'
+nextflow run nf-core/eager -profile <docker/singularity/conda>,test --pairedEnd
 ```
 
-5. See the overview of the run with under `<OUTPUT_DIR>/MultiQC/multiqc_report.html`
+5. Start running your own ancient DNA analysis!
+
+```bash
+nextflow run nf-core/eager -profile <docker/singularity/conda> --reads'*_R{1,2}.fastq.gz' --fasta '<REFERENCE>.fasta'
+```
+
+NB. You can see an overview of the run in the MultiQC report located at `<OUTPUT_DIR>/MultiQC/multiqc_report.html`
 
 Modifications to the default pipeline are easily made using various options
 as described in the documentation.
@@ -84,6 +92,18 @@ James Fellows Yates, Raphael Eisenhofer and Judith Neukamm. If you want to
 contribute, please open an issue and ask to be added to the project - happy to 
 do so and everyone is welcome to contribute here!
 
+## Contributors
+
+- [James A. Fellows-Yates](https://github.com/jfy133)
+- [Stephen Clayton](https://github.com/sc13-bioinf)
+- [Judith Neukamm](https://github.com/JudithNeukamm)
+- [Raphael Eisenhofer](https://github.com/EisenRa)
+- [Maxime Garcia](https://github.com/MaxUlysse)
+- [Luc Venturini](https://github.com/lucventurini)
+- [Hester van Schalkwyk](https://github.com/hesterjvs)
+
+If you've contributed and you're missing in here, please let me know and I'll add you in.
+
 ## Tool References
 
 * **EAGER v1**, CircularMapper, DeDup* Peltzer, A., Jäger, G., Herbig, A., Seitz, A., Kniep, C., Krause, J., & Nieselt, K. (2016). EAGER: efficient ancient genome reconstruction. Genome Biology, 17(1), 1–14. [https://doi.org/10.1186/s13059-016-0918-z](https://doi.org/10.1186/s13059-016-0918-z)  Download: [https://github.com/apeltzer/EAGER-GUI](https://github.com/apeltzer/EAGER-GUI) and [https://github.com/apeltzer/EAGER-CLI](https://github.com/apeltzer/EAGER-CLI)
diff --git a/conf/base.config b/conf/base.config
@@ -31,7 +31,10 @@ process {
   withName:convertBam {
     cpus = { check_max(8 * task.attempt, 'cpus') }
   }
-  
+  withName:makeSeqDict {
+    memory = { check_max( 16.GB * task.attempt, 'memory' ) }
+  }
+
   withName:bwa {
     memory = { check_max( 16.GB * task.attempt, 'memory' ) }
     cpus = { check_max(8 * task.attempt, 'cpus') }
diff --git a/conf/multiqc_config.yaml b/conf/multiqc_config.yaml
@@ -9,6 +9,7 @@ top_modules:
             - '*_fastqc.zip'
         path_filters_exclude:
              - '*.combined.prefixed_fastqc.zip'
+     - 'fastp'
      - 'adapterRemoval'
      - 'fastqc':
          name: 'FastQC (post-AdapterRemoval)'
diff --git a/docs/usage.md b/docs/usage.md
@@ -170,6 +170,10 @@ If you prefer, you can specify the full path to your reference genome when you r
 ```
 > If you don't specify appropriate `--bwa_index`, `--fasta_index` parameters, the pipeline will create these indices for you automatically. Note, that saving these for later has to be turned on using `--saveReference`. You may also specify the path to a gzipped (`*.gz` file extension) FastA as reference genome - this will be uncompressed by the pipeline automatically for you. Note that other file extensions such as `.fna`, `.fa` are also supported but will be renamed to `.fasta` automatically by the pipeline.
 
+### `--large_ref`
+
+This parameter is required to be set for large reference genomes. If your reference genome is larger than 3.5GB, the `samtools index` calls in the pipeline need to generate `CSI` indices instead of `BAI` indices to accompensate for the size of the reference genome. This parameter is not required for smaller references (including a human `hg19` or `grch37`/`grch38` reference), but `>4GB` genomes have been shown to need `CSI` indices. 
+
 ### `--genome` (using iGenomes)
 
 The pipeline config files come bundled with paths to the illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) resource.
@@ -237,7 +241,7 @@ Use to set a top-limit for the default time requirement for each process.
 Should be a string in the format integer-unit. eg. `--max_time '2.h'`. If not specified, will be taken from the configuration in the `-profile` flag.
 
 ### `--max_cpus`
-Use to set a top-limit for the default CPU requirement for each process.
+Use to set a top-limit for the default CPU requirement for each **process**. This is not the maximum number of CPUs that can be used for the whole pipeline, but the maximum number of CPUs each program can use for each program submission (known as a process). Do not set this higher than what is available on your workstation or computing node can provide. If you're unsure, ask your local IT administrator for details on compute node capabilities! 
 Should be a string in the format integer-unit. eg. `--max_cpus 1`. If not specified, will be taken from the configuration in the `-profile` flag.
 
 ### `--email`
@@ -279,12 +283,17 @@ This part of the documentation contains a list of user-adjustable parameters in
 
 ## Step skipping parameters
 
-Some of the steps in the pipeline can be executed optionally. If you specify specific steps to be skipped, there won't be any output related to these modules. 
+Some of the steps in the pipeline can be executed optionally. If you specify specific steps to be skipped, there won't be any output related to these modules.
 
 ### `--skip_preseq`
 
 Turns off the computation of library complexity estimation.  
 
+### `--skip_adapterremoval`
+
+Turns off adaptor trimming and paired-end read merging.
+Equivalent to setting both `--skip_collapse` and `--skip_trim`
+
 ### `--skip_damage_calculation`
 
 Turns off the DamageProfiler module to compute DNA damage profiles. 
@@ -299,7 +308,7 @@ Turns off duplicate removal methods DeDup and MarkDuplicates respectively. No du
 
 ## Complexity Filtering Options
 
-### `--complexity_filter`
+### `--complexity_filter_poly_g`
 
 Performs a poly-G tail removal step in the beginning of the pipeline, if turned on. This can be useful for trimming ploy-G tails from short-fragments sequenced on two-colour Illumina chemistry such as NextSeqs (where no-fluorescence is read as a G on two-colour chemistry), which can inflate reported GC content values.
 
@@ -329,6 +338,24 @@ Defines the minimum read quality per base that is required for a base to be kept
 ### `--clip_min_adap_overlap` 1
 Sets the minimum overlap between two reads when read merging is performed. Default is set to `1` base overlap.
 
+### `--skip_collapse`
+
+Turns off the paired-end read merging. 
+
+For example
+```bash
+--pairedEnd --skip_collapse  --reads '*.fastq'
+```
+
+### `--skip_trim`
+
+Turns off the adaptor and quality trimming.
+
+For example
+```bash
+--pairedEnd --skip_trim  --reads '*.fastq'
+```
+
 ## Read Mapping Parameters
 
 ## BWA (default)
diff --git a/main.nf b/main.nf
diff --git a/nextflow.config b/nextflow.config