diff --git a/.github/.dockstore.yml b/.github/.dockstore.yml
index 030138a0c..191fabd22 100644
--- a/.github/.dockstore.yml
+++ b/.github/.dockstore.yml
@@ -3,3 +3,4 @@ version: 1.2
workflows:
- subclass: nfl
primaryDescriptorPath: /nextflow.config
+ publish: True
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
index 97f223d86..3e4a4cfa2 100644
--- a/.github/CONTRIBUTING.md
+++ b/.github/CONTRIBUTING.md
@@ -69,7 +69,7 @@ If you wish to contribute a new step, please use the following coding standards:
2. Write the process block (see below).
3. Define the output channel if needed (see below).
4. Add any new flags/options to `nextflow.config` with a default (see below).
-5. Add any new flags/options to `nextflow_schema.json` **with help text** (with `nf-core schema build .`)
+5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build .`).
6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter).
7. Add sanity checks for all relevant parameters.
8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`.
@@ -87,7 +87,7 @@ Once there, use `nf-core schema build .` to add to `nextflow_schema.json`.
### Default processes resource requirements
-Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels.
+Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels.
:warning: Note that in nf-core/eager we currently have our own custom process labels, so please check `base.config`!
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
index f00ef2e57..b461caca3 100644
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -57,7 +57,7 @@ Have you provided the following extra information/files:
## Container engine
-- Engine:
+- Engine:
- version:
- Image tag:
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
index eadff09eb..c7ca5c253 100644
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -1,6 +1,6 @@
---
name: Feature request
-about: Suggest an idea for the nf-core website
+about: Suggest an idea for the nf-core/eager pipeline
labels: enhancement
---
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 57a13ac3e..4d46a3ac7 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -15,9 +15,9 @@ Learn more about contributing: [CONTRIBUTING.md](https://github.com/nf-core/eage
- [ ] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py`
- - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md)
- - [ ] If necessary, also make a PR on the nf-core/eager _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository.
+ - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py`
+ - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md)
+ - [ ] If necessary, also make a PR on the nf-core/eager _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository.
- [ ] Make sure your code lints (`nf-core lint .`).
- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`).
- [ ] Usage Documentation in `docs/usage.md` is updated.
diff --git a/.github/workflows/awsfulltest.yml b/.github/workflows/awsfulltest.yml
index 51475927c..4e03e75be 100644
--- a/.github/workflows/awsfulltest.yml
+++ b/.github/workflows/awsfulltest.yml
@@ -9,6 +9,16 @@ on:
types: [completed]
workflow_dispatch:
+
+env:
+ AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+ AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+ TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }}
+ AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }}
+ AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }}
+ AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
+
+
jobs:
run-awstest:
name: Run AWS full tests
@@ -26,13 +36,6 @@ jobs:
# Add full size test data (but still relatively small datasets for few samples)
# on the `test_full.config` test runs with only one set of parameters
# Then specify `-profile test_full` instead of `-profile test` on the AWS batch command
- env:
- AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
- AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }}
- AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }}
- AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }}
- AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
run: |
aws batch submit-job \
--region eu-west-1 \
diff --git a/.github/workflows/awstest.yml b/.github/workflows/awstest.yml
index 7ffc9c417..6e0a9538c 100644
--- a/.github/workflows/awstest.yml
+++ b/.github/workflows/awstest.yml
@@ -6,6 +6,16 @@ name: nf-core AWS test
on:
workflow_dispatch:
+
+env:
+ AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+ AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+ TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }}
+ AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }}
+ AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }}
+ AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
+
+
jobs:
run-awstest:
name: Run AWS tests
@@ -22,13 +32,6 @@ jobs:
- name: Start AWS batch job
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
- env:
- AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
- AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }}
- AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }}
- AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }}
- AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
run: |
aws batch submit-job \
--region eu-west-1 \
diff --git a/.github/workflows/branch.yml b/.github/workflows/branch.yml
index a08150144..909b52d6b 100644
--- a/.github/workflows/branch.yml
+++ b/.github/workflows/branch.yml
@@ -13,7 +13,7 @@ jobs:
- name: Check PRs
if: github.repository == 'nf-core/eager'
run: |
- { [[ ${{github.event.pull_request.head.repo.full_name}} == nf-core/eager ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]]
+ { [[ ${{github.event.pull_request.head.repo.full_name }} == nf-core/eager ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]]
# If the above check failed, post a comment on the PR explaining the failure
@@ -23,13 +23,22 @@ jobs:
uses: mshick/add-pr-comment@v1
with:
message: |
+ ## This PR is against the `master` branch :x:
+
+ * Do not close this PR
+ * Click _Edit_ and change the `base` to `dev`
+ * This CI test will remain failed until you push a new commit
+
+ ---
+
Hi @${{ github.event.pull_request.user.login }},
- It looks like this pull-request is has been made against the ${{github.event.pull_request.head.repo.full_name}} `master` branch.
+ It looks like this pull-request is has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `master` branch.
The `master` branch on nf-core repositories should always contain code from the latest release.
- Because of this, PRs to `master` are only allowed if they come from the ${{github.event.pull_request.head.repo.full_name}} `dev` branch.
+ Because of this, PRs to `master` are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch.
You do not need to close this PR, you can change the target branch to `dev` by clicking the _"Edit"_ button at the top of this page.
+ Note that even after this, the test will continue to show as failing until you push a new commit.
Thanks again for your contribution!
repo-token: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 946c9caa1..a8bfa6ba1 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -20,7 +20,7 @@ jobs:
strategy:
matrix:
# Nextflow versions: check pipeline minimum and current latest
- nxf_ver: ['20.07.1', '']
+ nxf_ver: ['20.07.1', '21.03.0-edge']
steps:
- name: Check out pipeline code
uses: actions/checkout@v2
@@ -34,13 +34,13 @@ jobs:
- name: Build new docker image
if: env.MATCHED_FILES
- run: docker build --no-cache . -t nfcore/eager:2.3.2
+ run: docker build --no-cache . -t nfcore/eager:2.3.3
- name: Pull docker image
if: ${{ !env.MATCHED_FILES }}
run: |
docker pull nfcore/eager:dev
- docker tag nfcore/eager:dev nfcore/eager:2.3.2
+ docker tag nfcore/eager:dev nfcore/eager:2.3.3
- name: Install Nextflow
env:
@@ -125,7 +125,7 @@ jobs:
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bedtools_coverage --anno_file 'https://github.com/nf-core/test-datasets/raw/eager/reference/Mammoth/Mammoth_MT_Krause.gff3'
- name: GENOTYPING_HC Test running GATK HaplotypeCaller
run: |
- nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker --run_genotyping --genotyping_tool 'hc' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_hc_emitrefconf 'BP_RESOLUTION'
+ nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker --run_genotyping --genotyping_tool 'hc' --gatk_hc_out_mode 'EMIT_ALL_ACTIVE_SITES' --gatk_hc_emitrefconf 'BP_RESOLUTION'
- name: GENOTYPING_FB Test running FreeBayes
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --genotyping_tool 'freebayes'
@@ -146,13 +146,13 @@ jobs:
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_pmdtools
- name: GENOTYPING_UG AND MULTIVCFANALYZER Test running GATK UnifiedGenotyper and MultiVCFAnalyzer, additional VCFS
run: |
- nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer --additional_vcf_files 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/vcf/JK2772_CATCAGTGAGTAGA_L008_R1_001.fastq.gz.tengrand.fq.combined.fq.mapped_rmdup.bam.unifiedgenotyper.vcf.gz' --write_allele_frequencies
+ nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer --additional_vcf_files 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/vcf/JK2772_CATCAGTGAGTAGA_L008_R1_001.fastq.gz.tengrand.fq.combined.fq.mapped_rmdup.bam.unifiedgenotyper.vcf.gz' --write_allele_frequencies
- name: COMPLEX LANE/LIBRARY MERGING Test running lane and library merging prior to GATK UnifiedGenotyper and running MultiVCFAnalyzer
run: |
- nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --run_genotyping --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer
+ nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --run_genotyping --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer
- name: GENOTYPING_UG ON TRIMMED BAM Test
run: |
- nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --run_trim_bam --genotyping_source 'trimmed' --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP'
+ nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --run_trim_bam --genotyping_source 'trimmed' --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP'
- name: BAM_INPUT Run the basic pipeline with the bam input profile, skip AdapterRemoval as no convertBam
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --skip_adapterremoval
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
index d99d4d751..fcde400ce 100644
--- a/.github/workflows/linting.yml
+++ b/.github/workflows/linting.yml
@@ -19,6 +19,34 @@ jobs:
run: npm install -g markdownlint-cli
- name: Run Markdownlint
run: markdownlint ${GITHUB_WORKSPACE} -c ${GITHUB_WORKSPACE}/.github/markdownlint.yml
+
+ # If the above check failed, post a comment on the PR explaining the failure
+ - name: Post PR comment
+ if: failure()
+ uses: mshick/add-pr-comment@v1
+ with:
+ message: |
+ ## Markdown linting is failing
+
+ To keep the code consistent with lots of contributors, we run automated code consistency checks.
+ To fix this CI test, please run:
+
+ * Install `markdownlint-cli`
+ * On Mac: `brew install markdownlint-cli`
+ * Everything else: [Install `npm`](https://www.npmjs.com/get-npm) then [install `markdownlint-cli`](https://www.npmjs.com/package/markdownlint-cli) (`npm install -g markdownlint-cli`)
+ * Fix the markdown errors
+ * Automatically: `markdownlint . --config .github/markdownlint.yml --fix`
+ * Manually resolve anything left from `markdownlint . --config .github/markdownlint.yml`
+
+ Once you push these changes the test should pass, and you can hide this comment :+1:
+
+ We highly recommend setting up markdownlint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!
+
+ Thanks again for your contribution!
+ repo-token: ${{ secrets.GITHUB_TOKEN }}
+ allow-repeats: false
+
+
YAML:
runs-on: ubuntu-latest
steps:
@@ -29,7 +57,34 @@ jobs:
- name: Install yaml-lint
run: npm install -g yaml-lint
- name: Run yaml-lint
- run: yamllint $(find ${GITHUB_WORKSPACE} -type f -name "*.yml")
+ run: yamllint $(find ${GITHUB_WORKSPACE} -type f -name "*.yml" -o -name "*.yaml")
+
+ # If the above check failed, post a comment on the PR explaining the failure
+ - name: Post PR comment
+ if: failure()
+ uses: mshick/add-pr-comment@v1
+ with:
+ message: |
+ ## YAML linting is failing
+
+ To keep the code consistent with lots of contributors, we run automated code consistency checks.
+ To fix this CI test, please run:
+
+ * Install `yaml-lint`
+ * [Install `npm`](https://www.npmjs.com/get-npm) then [install `yaml-lint`](https://www.npmjs.com/package/yaml-lint) (`npm install -g yaml-lint`)
+ * Fix the markdown errors
+ * Run the test locally: `yamllint $(find . -type f -name "*.yml" -o -name "*.yaml")`
+ * Fix any reported errors in your YAML files
+
+ Once you push these changes the test should pass, and you can hide this comment :+1:
+
+ We highly recommend setting up yaml-lint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!
+
+ Thanks again for your contribution!
+ repo-token: ${{ secrets.GITHUB_TOKEN }}
+ allow-repeats: false
+
+
nf-core:
runs-on: ubuntu-latest
steps:
@@ -48,6 +103,7 @@ jobs:
with:
python-version: '3.6'
architecture: 'x64'
+
- name: Install dependencies
run: |
python -m pip install --upgrade pip
@@ -68,7 +124,7 @@ jobs:
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
- name: linting-log-file
+ name: linting-logs
path: |
lint_log.txt
lint_results.md
diff --git a/.nf-core-lint.yml b/.nf-core-lint.yml
new file mode 100644
index 000000000..496fea360
--- /dev/null
+++ b/.nf-core-lint.yml
@@ -0,0 +1,6 @@
+files_unchanged:
+ - assets/multiqc_config.yaml
+ - .github/CONTRIBUTING.md
+ - .github/ISSUE_TEMPLATE/bug_report.md
+ - docs/README.md
+
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 013927858..e289edece 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,25 @@
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
+## v2.3.3 - 2021-01-06
+
+### `Added`
+
+- [#349](https://github.com/nf-core/eager/issues/349) - Added option enabling platypus formatted output of pmdtools misincorporation frequencies.
+
+### `Fixed`
+
+- [#719](https://github.com/nf-core/eager/pull/719) - Fix filename for bam output of `mapdamage_rescaling`
+- [#707](https://github.com/nf-core/eager/pull/707) - Fix typo in UnifiedGenotyper IndelRealigner command
+- Fixed some Java tools not following process memory specifications
+- Updated template to nf-core/tools 1.13.2
+- [#711](https://github.com/nf-core/eager/pull/711) - Fix conditional execution preventing multivcfanalyze to run
+- [#714](https://github.com/nf-core/eager/issues/714) - Fixes bug in nuc contamination by upgrading to latest MultiQC v1.10.1 bugfix release
+
+### `Dependencies`
+
+### `Deprecated`
+
## [2.3.2] - 2021-03-16
### `Added`
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index 405fb1bfd..f4fd052f1 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -1,46 +1,111 @@
-# Contributor Covenant Code of Conduct
+# Code of Conduct at nf-core (v1.0)
## Our Pledge
-In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
+In the interest of fostering an open, collaborative, and welcoming environment, we as contributors and maintainers of nf-core, pledge to making participation in our projects and community a harassment-free experience for everyone, regardless of:
-## Our Standards
+- Age
+- Body size
+- Familial status
+- Gender identity and expression
+- Geographical location
+- Level of experience
+- Nationality and national origins
+- Native language
+- Physical and neurological ability
+- Race or ethnicity
+- Religion
+- Sexual identity and orientation
+- Socioeconomic status
-Examples of behavior that contributes to creating a positive environment include:
+Please note that the list above is alphabetised and is therefore not ranked in any order of preference or importance.
-* Using welcoming and inclusive language
-* Being respectful of differing viewpoints and experiences
-* Gracefully accepting constructive criticism
-* Focusing on what is best for the community
-* Showing empathy towards other community members
+## Preamble
-Examples of unacceptable behavior by participants include:
+> Note: This Code of Conduct (CoC) has been drafted by the nf-core Safety Officer and been edited after input from members of the nf-core team and others. "We", in this document, refers to the Safety Officer and members of the nf-core core team, both of whom are deemed to be members of the nf-core community and are therefore required to abide by this Code of Conduct. This document will amended periodically to keep it up-to-date, and in case of any dispute, the most current version will apply.
-* The use of sexualized language or imagery and unwelcome sexual attention or advances
-* Trolling, insulting/derogatory comments, and personal or political attacks
-* Public or private harassment
-* Publishing others' private information, such as a physical or electronic address, without explicit permission
-* Other conduct which could reasonably be considered inappropriate in a professional setting
+An up-to-date list of members of the nf-core core team can be found [here](https://nf-co.re/about). Our current safety officer is Renuka Kudva.
+
+nf-core is a young and growing community that welcomes contributions from anyone with a shared vision for [Open Science Policies](https://www.fosteropenscience.eu/taxonomy/term/8). Open science policies encompass inclusive behaviours and we strive to build and maintain a safe and inclusive environment for all individuals.
+
+We have therefore adopted this code of conduct (CoC), which we require all members of our community and attendees in nf-core events to adhere to in all our workspaces at all times. Workspaces include but are not limited to Slack, meetings on Zoom, Jitsi, YouTube live etc.
+
+Our CoC will be strictly enforced and the nf-core team reserve the right to exclude participants who do not comply with our guidelines from our workspaces and future nf-core activities.
+
+We ask all members of our community to help maintain a supportive and productive workspace and to avoid behaviours that can make individuals feel unsafe or unwelcome. Please help us maintain and uphold this CoC.
+
+Questions, concerns or ideas on what we can include? Contact safety [at] nf-co [dot] re
## Our Responsibilities
-Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
+The safety officer is responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behaviour.
+
+The safety officer in consultation with the nf-core core team have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
+
+Members of the core team or the safety officer who violate the CoC will be required to recuse themselves pending investigation. They will not have access to any reports of the violations and be subject to the same actions as others in violation of the CoC.
+
+## When are where does this Code of Conduct apply?
+
+Participation in the nf-core community is contingent on following these guidelines in all our workspaces and events. This includes but is not limited to the following listed alphabetically and therefore in no order of preference:
+
+- Communicating with an official project email address.
+- Communicating with community members within the nf-core Slack channel.
+- Participating in hackathons organised by nf-core (both online and in-person events).
+- Participating in collaborative work on GitHub, Google Suite, community calls, mentorship meetings, email correspondence.
+- Participating in workshops, training, and seminar series organised by nf-core (both online and in-person events). This applies to events hosted on web-based platforms such as Zoom, Jitsi, YouTube live etc.
+- Representing nf-core on social media. This includes both official and personal accounts.
+
+## nf-core cares 😊
+
+nf-core's CoC and expectations of respectful behaviours for all participants (including organisers and the nf-core team) include but are not limited to the following (listed in alphabetical order):
+
+- Ask for consent before sharing another community member’s personal information (including photographs) on social media.
+- Be respectful of differing viewpoints and experiences. We are all here to learn from one another and a difference in opinion can present a good learning opportunity.
+- Celebrate your accomplishments at events! (Get creative with your use of emojis 🎉 🥳 💯 🙌 !)
+- Demonstrate empathy towards other community members. (We don’t all have the same amount of time to dedicate to nf-core. If tasks are pending, don’t hesitate to gently remind members of your team. If you are leading a task, ask for help if you feel overwhelmed.)
+- Engage with and enquire after others. (This is especially important given the geographically remote nature of the nf-core community, so let’s do this the best we can)
+- Focus on what is best for the team and the community. (When in doubt, ask)
+- Graciously accept constructive criticism, yet be unafraid to question, deliberate, and learn.
+- Introduce yourself to members of the community. (We’ve all been outsiders and we know that talking to strangers can be hard for some, but remember we’re interested in getting to know you and your visions for open science!)
+- Show appreciation and **provide clear feedback**. (This is especially important because we don’t see each other in person and it can be harder to interpret subtleties. Also remember that not everyone understands a certain language to the same extent as you do, so **be clear in your communications to be kind.**)
+- Take breaks when you feel like you need them.
+- Using welcoming and inclusive language. (Participants are encouraged to display their chosen pronouns on Zoom or in communication on Slack.)
+
+## nf-core frowns on 😕
+
+The following behaviours from any participants within the nf-core community (including the organisers) will be considered unacceptable under this code of conduct. Engaging or advocating for any of the following could result in expulsion from nf-core workspaces.
+
+- Deliberate intimidation, stalking or following and sustained disruption of communication among participants of the community. This includes hijacking shared screens through actions such as using the annotate tool in conferencing software such as Zoom.
+- “Doxing” i.e. posting (or threatening to post) another person’s personal identifying information online.
+- Spamming or trolling of individuals on social media.
+- Use of sexual or discriminatory imagery, comments, or jokes and unwelcome sexual attention.
+- Verbal and text comments that reinforce social structures of domination related to gender, gender identity and expression, sexual orientation, ability, physical appearance, body size, race, age, religion or work experience.
+
+### Online Trolling
+
+The majority of nf-core interactions and events are held online. Unfortunately, holding events online comes with the added issue of online trolling. This is unacceptable, reports of such behaviour will be taken very seriously, and perpetrators will be excluded from activities immediately.
+
+All community members are required to ask members of the group they are working within for explicit consent prior to taking screenshots of individuals during video calls.
+
+## Procedures for Reporting CoC violations
-Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
+If someone makes you feel uncomfortable through their behaviours or actions, report it as soon as possible.
-## Scope
+You can reach out to members of the [nf-core core team](https://nf-co.re/about) and they will forward your concerns to the safety officer(s).
-This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
+Issues directly concerning members of the core team will be dealt with by other members of the core team and the safety manager, and possible conflicts of interest will be taken into account. nf-core is also in discussions about having an ombudsperson, and details will be shared in due course.
-## Enforcement
+All reports will be handled with utmost discretion and confidentially.
-Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team on [Slack](https://nf-co.re/join/slack). The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
+## Attribution and Acknowledgements
-Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
+- The [Contributor Covenant, version 1.4](http://contributor-covenant.org/version/1/4)
+- The [OpenCon 2017 Code of Conduct](http://www.opencon2017.org/code_of_conduct) (CC BY 4.0 OpenCon organisers, SPARC and Right to Research Coalition)
+- The [eLife innovation sprint 2020 Code of Conduct](https://sprint.elifesciences.org/code-of-conduct/)
+- The [Mozilla Community Participation Guidelines v3.1](https://www.mozilla.org/en-US/about/governance/policies/participation/) (version 3.1, CC BY-SA 3.0 Mozilla)
-## Attribution
+## Changelog
-This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct/][version]
+### v1.0 - March 12th, 2021
-[homepage]: https://contributor-covenant.org
-[version]: https://www.contributor-covenant.org/version/1/4/code-of-conduct/
+- Complete rewrite from original [Contributor Covenant](http://contributor-covenant.org/) CoC.
diff --git a/Dockerfile b/Dockerfile
index 773a11a32..88e0429a8 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,4 +1,4 @@
-FROM nfcore/base:1.12.1
+FROM nfcore/base:1.13.3
LABEL authors="The nf-core/eager community" \
description="Docker image containing all software requirements for the nf-core/eager pipeline"
@@ -7,10 +7,10 @@ COPY environment.yml /
RUN conda env create --quiet -f /environment.yml && conda clean -a
# Add conda installation dir to PATH (instead of doing 'conda activate')
-ENV PATH /opt/conda/envs/nf-core-eager-2.3.2/bin:$PATH
+ENV PATH /opt/conda/envs/nf-core-eager-2.3.3/bin:$PATH
# Dump the details of the installed packages to a file for posterity
-RUN conda env export --name nf-core-eager-2.3.2 > nf-core-eager-2.3.2.yml
+RUN conda env export --name nf-core-eager-2.3.3 > nf-core-eager-2.3.3.yml
# Instruct R processes to use these empty files instead of clashing with a local version
RUN touch .Rprofile
diff --git a/README.md b/README.md
index 43eec0138..ac9e19a4e 100644
--- a/README.md
+++ b/README.md
@@ -29,12 +29,12 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
1. Install [`nextflow`](https://nf-co.re/usage/installation) (version >= 20.04.0)
-2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) or [`Podman`](https://podman.io/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_
+2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/), [`Podman`](https://podman.io/), [`Shifter`](https://nersc.gitlab.io/development/shifter/how-to-use/) or [`Charliecloud`](https://hpc.github.io/charliecloud/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_
3. Download the pipeline and test it on a minimal dataset with a single command:
```bash
- nextflow run nf-core/eager -profile test_tsv,
@@ -307,11 +303,11 @@ If you see high numbers of discarded or truncated reads, you should check your F
The length distribution plots show the number of reads at each read-length. You can change the plot to display different categories.
-- All represent the overall distribution of reads. In the case of paired-end sequencing You may see a peak at the turn around from forward to reverse cycles.
-- **Mate 1** and **Mate 2** represents the length of the forward and reverse read respectively prior collapsing
-- **Singleton** represent those reads that had a one member of a pair discarded
-- **Collapsed** and **Collapsed Truncated** represent reads that overlapped and able to merge into a single read, with the latter including base-quality trimming off ends of reads. These plots will start with a vertical rise representing where you are above the minimum-read threshold you set.
-- **Discarded** here represents the number of reads that did not each the read length filter. You will likely see a vertical drop at what your threshold was set to.
+* All represent the overall distribution of reads. In the case of paired-end sequencing You may see a peak at the turn around from forward to reverse cycles.
+* **Mate 1** and **Mate 2** represents the length of the forward and reverse read respectively prior collapsing
+* **Singleton** represent those reads that had a one member of a pair discarded
+* **Collapsed** and **Collapsed Truncated** represent reads that overlapped and able to merge into a single read, with the latter including base-quality trimming off ends of reads. These plots will start with a vertical rise representing where you are above the minimum-read threshold you set.
+* **Discarded** here represents the number of reads that did not each the read length filter. You will likely see a vertical drop at what your threshold was set to.
@@ -357,7 +353,7 @@ Due to low 'endogenous' content of aDNA, and the high biodiversity of modern or
@@ -428,15 +424,15 @@ DeDup is a duplicate removal tool which searches for PCR duplicates and removes
This stacked bar plot shows as a whole the total number of reads in the BAM file going into DeDup. The different sections of a given bar represents the following:
-- **Not Removed** - the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
-- **Reverse Removed** - the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
-- **Forward Removed** - the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
-- **Merged Removed** - the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
+* **Not Removed** — the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below).
+* **Reverse Removed** — the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step).
+* **Forward Removed** — the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step).
+* **Merged Removed** — the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step).
Exceptions to the above:
-- If you do not have paired end data, you will not have sections for 'Merged removed' or 'Reverse removed'.
-- If you use the `--dedup_all_merged` flag, you will not have the 'Forward removed' or 'Reverse removed' sections.
+* If you do not have paired end data, you will not have sections for 'Merged removed' or 'Reverse removed'.
+* If you use the `--dedup_all_merged` flag, you will not have the 'Forward removed' or 'Reverse removed' sections.
@@ -444,8 +440,8 @@ Exceptions to the above:
Things to look out for:
-- The smaller the number of the duplicates removed the better. If you have a small number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence.
-- If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, or a lot of left-over adapters that were able to map to your genome.
+* The smaller the number of the duplicates removed the better. If you have a small number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence.
+* If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, or a lot of left-over adapters that were able to map to your genome.
### Picard
@@ -455,7 +451,7 @@ Picard is a toolkit for general BAM file manipulation with many different functi
#### Mark Duplicates
-The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well- preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.
+The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well-preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.
@@ -465,8 +461,8 @@ The amount of unmapped reads will depend on whether you have filtered out unmapp
Things to look out for:
-- The smaller the number of the duplicates removed the better. If you have a smaller number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence.
-- If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, a badly preserved sample with a very low yield, or a lot of left-over adapters that were able to map to your genome.
+* The smaller the number of the duplicates removed the better. If you have a smaller number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence.
+* If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, a badly preserved sample with a very low yield, or a lot of left-over adapters that were able to map to your genome.
### Preseq
@@ -492,9 +488,9 @@ The dashed line represents a 'perfect' library containing only unique molecules
Plateauing can be caused by a number of reasons:
-- You have simply sequenced your library to exhaustion
-- You have an over-amplified library with many PCR duplicates. You should consider rebuilding the library to maximise data to cost ratio
-- You have a low quality library made up of mappable sequencing artefacts that were able to pass filtering (e.g. adapters)
+* You have simply sequenced your library to exhaustion
+* You have an over-amplified library with many PCR duplicates. You should consider rebuilding the library to maximise data to cost ratio
+* You have a low quality library made up of mappable sequencing artefacts that were able to pass filtering (e.g. adapters)
### DamageProfiler
@@ -504,9 +500,9 @@ DamageProfiler is a tool which calculates a variety of standard 'aDNA' metrics f
Therefore, three main characteristics of ancient DNA are:
-- Short DNA fragments
-- Elevated G and As (purines) just before strand breaks
-- Increased C and Ts at ends of fragments
+* Short DNA fragments
+* Elevated G and As (purines) just before strand breaks
+* Increased C and Ts at ends of fragments
You will receive output for each deduplicated *library*. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes of the library in one value.
@@ -516,12 +512,12 @@ The MultiQC DamageProfiler module misincorporation plots shows the percent frequ
When looking at the misincorporation plots, keep the following in mind:
-- As few-base single-stranded overhangs are more likely to occur than long overhangs, we expect to see a gradual decrease in the frequency of the modifications from position 1 to the inside of the reads.
-- If your library has been **partially-UDG treated**, only the first one or two bases will display the misincorporation frequency.
-- If your library has been **UDG treated** you will expect to see extremely-low to no misincorporations at read ends.
-- If your library is **single-stranded**, you will expect to see only C to T misincorporations at both 5' and 3' ends of the fragments.
-- We generally expect that the older the sample, or the less-ideal preservational environment (hot/wet) the greater the frequency of C to T/G to A.
-- The curve will be not smooth then you have few reads informing the frequency calculation. Read counts of less than 500 are likely not reliable.
+* As few-base single-stranded overhangs are more likely to occur than long overhangs, we expect to see a gradual decrease in the frequency of the modifications from position 1 to the inside of the reads.
+* If your library has been **partially-UDG treated**, only the first one or two bases will display the misincorporation frequency.
+* If your library has been **UDG treated** you will expect to see extremely-low to no misincorporations at read ends.
+* If your library is **single-stranded**, you will expect to see only C to T misincorporations at both 5' and 3' ends of the fragments.
+* We generally expect that the older the sample, or the less-ideal preservational environment (hot/wet) the greater the frequency of C to T/G to A.
+* The curve will be not smooth then you have few reads informing the frequency calculation. Read counts of less than 500 are likely not reliable.
@@ -535,9 +531,9 @@ The MultiQC DamageProfiler module length distribution plots show the frequency o
When looking at the length distribution plots, keep in mind the following:
-- Your curves will likely not start at 0, and will start wherever your minimum read-length setting was when removing adapters.
-- You should typically see the bulk of the distribution falling between 40-120bp, which is normal for aDNA
-- You may see large peaks at paired-end turn-arounds, due to very-long reads that could not overlap for merging being present, however this reads are normally from modern contamination.
+* Your curves will likely not start at 0, and will start wherever your minimum read-length setting was when removing adapters.
+* You should typically see the bulk of the distribution falling between 40-120bp, which is normal for aDNA
+* You may see large peaks at paired-end turn-arounds, due to very-long reads that could not overlap for merging being present, however this reads are normally from modern contamination.
### QualiMap
@@ -565,14 +561,14 @@ The greater the number of bases covered at as high as possible fold coverage, th
Things to watch out for:
-- You will typically see a direct decay from the lowest coverage to higher. A large range of coverages along the X axis is potentially suspicious.
-- If you have stacking of reads i.e. a small region with an abnormally large amount of reads despite the rest of the reference being quite shallowly covered, this will artificially increase your coverage. This would be represented by a small peak that is a much further along the X axis away from the main distribution of reads.
+* You will typically see a direct decay from the lowest coverage to higher. A large range of coverages along the X axis is potentially suspicious.
+* If you have stacking of reads i.e. a small region with an abnormally large amount of reads despite the rest of the reference being quite shallowly covered, this will artificially increase your coverage. This would be represented by a small peak that is a much further along the X axis away from the main distribution of reads.
#### Cumulative Genome Coverage
This plot shows how much of the genome in percentage (X axis) is covered by a given fold depth coverage (Y axis).
-An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage - something particular true for large genome such has for humans.
+An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage — something particular true for large genomes such as for humans.
@@ -588,9 +584,9 @@ This plot shows the distribution of the frequency of reads at different GC conte
Things to watch out for:
-- This plot should normally show a normal distribution around the average GC content of your reference genome.
-- Bimodal peaks may represent lab-based artefacts that should be further investigated.
-- Skews of the peak to a higher GC content that the reference in Illumina dual-colour chemistry data (e.g. NextSeq or NovaSeq), may suggest long poly-G tails that are mapping to poly-G stretches of your genome. The nf-core/eager trimming option `--complexity_filter_poly_g` can be used to remove these tails by utilising the tool FastP for detection and trimming.
+* This plot should normally show a normal distribution around the average GC content of your reference genome.
+* Bimodal peaks may represent lab-based artefacts that should be further investigated.
+* Skews of the peak to a higher GC content that the reference in Illumina dual-colour chemistry data (e.g. NextSeq or NovaSeq), may suggest long poly-G tails that are mapping to poly-G stretches of your genome. The nf-core/eager trimming option `--complexity_filter_poly_g` can be used to remove these tails by utilising the tool FastP for detection and trimming.
### Sex.DetERRmine
@@ -636,7 +632,7 @@ This table shows the contents of the `snpStatistics.tsv` file produced by MultiV
You can get different variants of the call statistics bar plot, depending on how you configured the MultiVCFAnalyzer options.
-If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains - particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).
+If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains — particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).
$group
@@ -650,32 +646,31 @@ This section gives a brief summary of where to look for what files for downstrea
Each module has it's own output directory which sit alongside the `MultiQC/` directory from which you opened the report.
-- `reference_genome/` - this directory contains the indexing files of your input reference genome (i.e. the various `bwa` indices, a `samtools`' `.fai` file, and a picard `.dict`), if you used the `--saveReference` flag.
-- `fastqc/` - this contains the original per-FASTQ FastQC reports that are summarised with MultiQC. These occur in both `html` (the report) and `.zip` format (raw data). The `after_clipping` folder contains the same but for after AdapterRemoval.
-- `adapterremoval/` - this contains the log files (ending with `.settings`) with raw trimming (and merging) statistics after AdapterRemoval. In the `output` sub-directory, are the output trimmed (and merged) FASTQ files. These you can use for downstream applications such as taxonomic binning for metagenomic studies.
-- `mapping/` - this contains a sub-directory corresponding to the mapping tool you used, inside of which will be the initial BAM files containing the reads that mapped to your reference genome with no modification (see below). You will also find a corresponding BAM index file (ending in `.csi` or `.bam`), and if running the `bowtie2` mapper - a log ending in `_bt2.log`. You can use these for downstream applications e.g. if you wish to use a different de-duplication tool not included in nf-core/eager (although please feel free to add a new module request on the Github repository's [issue page](https://github.com/nf-core/eager/issues)!).
-- `samtools/` - this contains two sub-directories. `stats/` contain the raw mapping statistics files (ending in `.stats`) from directly after mapping. `filter/` contains BAM files that have had a mapping quality filter applied (set by the `--bam_mapping_quality_threshold` flag) and a corresponding index file. Furthermore, if you selected `--bam_discard_unmapped`, you will find your separate file with only unmapped reads in the format you selected. Note unmapped read BAM files will _not_ have an index file.
-- `deduplication/` - this contains a sub-directory called `dedup/`, inside here are sample specific directories. Each directory contains a BAM file containing mapped reads but with PCR duplicates removed, a corresponding index file and two stats file. `.hist.` contains raw data for a deduplication histogram used for tools like preseq (see below), and the `.log` contains overall summary deduplication statistics.
-- `endorSpy/` - this contains all JSON files exported from the endorSpy endogenous DNA calculation tool. The JSON files are generated specifically for display in the MultiQC general statistics table and is otherwise very likely not useful for you.
-- `preseq/` - this contains a `.ccurve` file for every BAM file that had enough deduplication statistics to generate a complexity curve for estimating the amount unique reads that will be yield if the library is re-sequenced. You can use this file for plotting e.g. in `R` to find your sequencing target depth.
-- `qualimap/` - this contains a sub-directory for every sample, which includes a qualimap report and associated raw statistic files. You can open the `.html` file in your internet browser to see the in-depth report (this will be more detailed than in MultiQC). This includes stuff like percent coverage, depth coverage, GC content and so on of your mapped reads.
-- `damageprofiler/` - this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files.
-- `pmdtools/` - this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`.
-- `trimmed_bam/` - this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read).
-- `damage_rescaling/` - this contains rescaled BAM files from mapDamage2. These BAM files have damage probabilistically removed via a bayesian model, and can be used for downstream genotyping.
-- `genotyping/` - this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics.
-- `multivcfanalyzer/` - this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files.
-- `sex_determination/` - this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
-- `nuclear_contamination/` - this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
-- `bedtools/` - this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
-- `metagenomic_complexity_filter` - this contains the output from filtering of input reads to metagenomic classification of low-sequence complexity reads as performed by `bbduk`. This will include the filtered FASTQ files (`*_lowcomplexityremoved.fq.gz`) and also the run-time log (`_bbduk.stats`) for each sample. **Note:** there are no sections in the MultiQC report for this module, therefore you must check the `._bbduk.stats` files to get summary statistics of the filtering.
-- `metagenomic_classification/` - this contains the output for a given metagenomic classifier.
- - Running MALT will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested.
- - Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table. You will also get a Kraken kmer duplication table, in a [KrakenUniq](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0) fashion. This is very useful to check for breadth of coverage and detect read stacking. A small number of aligned reads (low coverage) and a kmer duplication >1 is usually a sign of read stacking, usually indicative of a false positive hit (e.g. from over-amplified libraries). *Kmer duplication is defined as: number of kmers / number of unique kmers*. You will find two kraken reports formats available:
- - the `*.kreport` which is the old report format, without distinct minimizer count information, used by some tools such as [Pavian](https://github.com/fbreitwieser/pavian)
- - the `*.kraken2_report` which is the new kraken report format, with the distinct minimizer count information.
-
- Finally, the `*.kraken.out` file are the direct output of Kraken2
-- `maltextract/` - this contains a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
-- `consensus_sequence/` - this contains three FASTA files from VCF2Genome of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively.
-- `librarymerged_bams/` - these contain the final BAM files that would go into genotyping (if genotyping is turned on). This means the files will contain all libraries of a given sample (including trimmed non-UDG or half-UDG treated libraries, if BAM trimming turned on)
+* `reference_genome/`: this directory contains the indexing files of your input reference genome (i.e. the various `bwa` indices, a `samtools`' `.fai` file, and a picard `.dict`), if you used the `--saveReference` flag.
+* `fastqc/`: this contains the original per-FASTQ FastQC reports that are summarised with MultiQC. These occur in both `html` (the report) and `.zip` format (raw data). The `after_clipping` folder contains the same but for after AdapterRemoval.
+* `adapterremoval/`: this contains the log files (ending with `.settings`) with raw trimming (and merging) statistics after AdapterRemoval. In the `output` sub-directory, are the output trimmed (and merged) FASTQ files. These you can use for downstream applications such as taxonomic binning for metagenomic studies.
+* `mapping/`: this contains a sub-directory corresponding to the mapping tool you used, inside of which will be the initial BAM files containing the reads that mapped to your reference genome with no modification (see below). You will also find a corresponding BAM index file (ending in `.csi` or `.bam`), and if running the `bowtie2` mapper: a log ending in `_bt2.log`. You can use these for downstream applications e.g. if you wish to use a different de-duplication tool not included in nf-core/eager (although please feel free to add a new module request on the Github repository's [issue page](https://github.com/nf-core/eager/issues)!).
+* `samtools/`: this contains two sub-directories. `stats/` contain the raw mapping statistics files (ending in `.stats`) from directly after mapping. `filter/` contains BAM files that have had a mapping quality filter applied (set by the `--bam_mapping_quality_threshold` flag) and a corresponding index file. Furthermore, if you selected `--bam_discard_unmapped`, you will find your separate file with only unmapped reads in the format you selected. Note unmapped read BAM files will _not_ have an index file.
+* `deduplication/`: this contains a sub-directory called `dedup/`, inside here are sample specific directories. Each directory contains a BAM file containing mapped reads but with PCR duplicates removed, a corresponding index file and two stats file. `.hist.` contains raw data for a deduplication histogram used for tools like preseq (see below), and the `.log` contains overall summary deduplication statistics.
+* `endorSpy/`: this contains all JSON files exported from the endorSpy endogenous DNA calculation tool. The JSON files are generated specifically for display in the MultiQC general statistics table and is otherwise very likely not useful for you.
+* `preseq/`: this contains a `.ccurve` file for every BAM file that had enough deduplication statistics to generate a complexity curve for estimating the amount unique reads that will be yield if the library is re-sequenced. You can use this file for plotting e.g. in `R` to find your sequencing target depth.
+* `qualimap/`: this contains a sub-directory for every sample, which includes a qualimap report and associated raw statistic files. You can open the `.html` file in your internet browser to see the in-depth report (this will be more detailed than in MultiQC). This includes stuff like percent coverage, depth coverage, GC content and so on of your mapped reads.
+* `damageprofiler/`: this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files.
+* `pmdtools/`: this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`.
+* `trimmed_bam/`: this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read).
+* `damage_rescaling/`: this contains rescaled BAM files from mapDamage2. These BAM files have damage probabilistically removed via a bayesian model, and can be used for downstream genotyping.
+* `genotyping/`: this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics.
+* `multivcfanalyzer/`: this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files.
+* `sex_determination/`: this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
+* `nuclear_contamination/`: this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
+* `bedtools/`: this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
+* `metagenomic_complexity_filter`: this contains the output from filtering of input reads to metagenomic classification of low-sequence complexity reads as performed by `bbduk`. This will include the filtered FASTQ files (`*_lowcomplexityremoved.fq.gz`) and also the run-time log (`_bbduk.stats`) for each sample. **Note:** there are no sections in the MultiQC report for this module, therefore you must check the `._bbduk.stats` files to get summary statistics of the filtering.
+* `metagenomic_classification/`: this contains the output for a given metagenomic classifier.
+ * Running MALT will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested.
+ * Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table. You will also get a Kraken kmer duplication table, in a [KrakenUniq](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0) fashion. This is very useful to check for breadth of coverage and detect read stacking. A small number of aligned reads (low coverage) and a kmer duplication >1 is usually a sign of read stacking, usually indicative of a false positive hit (e.g. from over-amplified libraries). *Kmer duplication is defined as: number of kmers / number of unique kmers*. You will find two kraken reports formats available:
+ * the `*.kreport` which is the old report format, without distinct minimizer count information, used by some tools such as [Pavian](https://github.com/fbreitwieser/pavian)
+ * the `*.kraken2_report` which is the new kraken report format, with the distinct minimizer count information.
+ * finally, the `*.kraken.out` file are the direct output of Kraken2
+* `maltextract/`: this contains a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
+* `consensus_sequence/`: this contains three FASTA files from VCF2Genome of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively.
+* `librarymerged_bams/`: these contain the final BAM files that would go into genotyping (if genotyping is turned on). This means the files will contain all libraries of a given sample (including trimmed non-UDG or half-UDG treated libraries, if BAM trimming turned on)
diff --git a/docs/usage.md b/docs/usage.md
index 7d13ff440..82ce25cc0 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -81,7 +81,7 @@ twice the amount of CPU and memory. This will occur two times before failing.
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
-Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Conda) - see below.
+Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below.
> We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
@@ -92,22 +92,28 @@ They are loaded in sequence, so later profiles can overwrite earlier profiles.
If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended.
-- `docker`
- - A generic configuration profile to be used with [Docker](https://docker.com/)
- - Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
-- `singularity`
- - A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/)
- - Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
-- `podman`
- - A generic configuration profile to be used with [Podman](https://podman.io/)
- - Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
-- `conda`
- - Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity or Podman.
- - A generic configuration profile to be used with [Conda](https://conda.io/docs/)
- - Pulls most software from [Bioconda](https://bioconda.github.io/)
-- `test_tsv`
- - A profile with a complete configuration for automated testing
- - Includes links to test data so needs no other parameters
+* `docker`
+ * A generic configuration profile to be used with [Docker](https://docker.com/)
+ * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
+* `singularity`
+ * A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/)
+ * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
+* `podman`
+ * A generic configuration profile to be used with [Podman](https://podman.io/)
+ * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
+* `shifter`
+ * A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/)
+ * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
+* `charliecloud`
+ * A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/)
+ * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/)
+* `conda`
+ * Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
+ * A generic configuration profile to be used with [Conda](https://conda.io/docs/)
+ * Pulls most software from [Bioconda](https://bioconda.github.io/)
+* `test_tsv
+ * A profile with a complete configuration for automated testing
+ * Includes links to test data so needs no other parameters
> *Important*: If running nf-core/eager on a cluster - ask your system
> administrator what profile to use.
@@ -118,17 +124,17 @@ clusters**, and are centrally maintained at
regular users of nf-core/eager, if you don't see your own institution here check
the [nf-core/configs](https://github.com/nf-core/configs) repository.
-- `uzh`
- - A profile for the University of Zurich Research Cloud
- - Loads Singularity and defines appropriate resources for running the
+* `uzh`
+ * A profile for the University of Zurich Research Cloud
+ * Loads Singularity and defines appropriate resources for running the
pipeline.
-- `binac`
- - A profile for the BinAC cluster at the University of Tuebingen 0 Loads
+* `binac`
+ * A profile for the BinAC cluster at the University of Tuebingen 0 Loads
Singularity and defines appropriate resources for running the pipeline
-- `shh`
- - A profile for the S/CDAG cluster at the Department of Archaeogenetics of
+* `shh`
+ * A profile for the S/CDAG cluster at the Department of Archaeogenetics of
the Max Planck Institute for the Science of Human History
- - Loads Singularity and defines appropriate resources for running the pipeline
+ * Loads Singularity and defines appropriate resources for running the pipeline
**Pipeline Specific Institution Profiles** There are also pipeline-specific
institution profiles. I.e., we can also offer a profile which sets special
@@ -139,10 +145,10 @@ pipelines. This can be seen at
We currently offer a nf-core/eager specific profile for
-- `shh`
- - A profiler for the S/CDAG cluster at the Department of Archaeogenetics of
+* `shh`
+ * A profiler for the S/CDAG cluster at the Department of Archaeogenetics of
the Max Planck Institute for the Science of Human History
- - In addition to the nf-core wide profile, this also sets the MALT resources
+ * In addition to the nf-core wide profile, this also sets the MALT resources
to match our commonly used databases
Further institutions can be added at
@@ -181,6 +187,8 @@ process {
}
```
+To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: `Error executing process > 'bwa'`. In this case the name to specify in the custom config file is `bwa`.
+
See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information.
If you are likely to be running `nf-core` pipelines regularly it may be a good
@@ -277,7 +285,7 @@ If you have multiple files in different directories, you can use additional wild
4. When using the pipeline with **paired end data**, the path must use `{1,2}`
notation to specify read pairs.
5. Files names must be unique, having files with the same name, but in different directories is _not_ sufficient
- - This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input.
+ * This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input.
6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be truncated after the first `.` in the name, Ensure file names are unique prior to this!
7. For input BAM files you should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses.
@@ -309,17 +317,17 @@ When using TSV_input, nf-core/eager will merge FASTQ files of libraries with the
Column descriptions are as follows:
-- **Sample_Name:** A text string containing the name of a given sample of which there can be multiple libraries. All libraries with the same sample name and same SeqType will be merged after deduplication.
-- **Library_ID:** A text string containing a given library, which there can be multiple sequencing lanes (with the same SeqType).
-- **Lane:** A number indicating which lane the library was sequenced on. Files from the libraries sequenced on different lanes (and different SeqType) will be concatenated after read clipping and merging.
-- **Colour Chemistry** A number indicating whether the Illumina sequencer the library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour chemistry machine. This informs whether poly-G trimming (if turned on) should be performed.
-- **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 [forward], or BAM). This will affect lane merging if different per library.
-- **Organism:** A text string of the organism name of the sample or 'NA'. This currently has no functionality and can be set to 'NA', but will affect lane/library merging if different per library
-- **Strandedness:** A text string indicating whether the library type is'single' or 'double'. This will affect lane/library merging if different per library.
-- **UDG_Treatment:** A text string indicating whether the library was generated with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library merging if different per library.
-- **R1:** A text string of a file path pointing to a forward or R1 FASTQ file. This can be used with the R2 column. File names **must be unique**, even if they are in different directories.
-- **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file, or 'NA' when single end data. This can be used with the R1 column. File names **must be unique**, even if they are in different directories.
-- **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot be specified at the same time as R1 or R2, both of which should be set to 'NA'
+* **Sample_Name:** A text string containing the name of a given sample of which there can be multiple libraries. All libraries with the same sample name and same SeqType will be merged after deduplication.
+* **Library_ID:** A text string containing a given library, which there can be multiple sequencing lanes (with the same SeqType).
+* **Lane:** A number indicating which lane the library was sequenced on. Files from the libraries sequenced on different lanes (and different SeqType) will be concatenated after read clipping and merging.
+* **Colour Chemistry** A number indicating whether the Illumina sequencer the library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour chemistry machine. This informs whether poly-G trimming (if turned on) should be performed.
+* **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 [forward], or BAM). This will affect lane merging if different per library.
+* **Organism:** A text string of the organism name of the sample or 'NA'. This currently has no functionality and can be set to 'NA', but will affect lane/library merging if different per library
+* **Strandedness:** A text string indicating whether the library type is'single' or 'double'. This will affect lane/library merging if different per library.
+* **UDG_Treatment:** A text string indicating whether the library was generated with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library merging if different per library.
+* **R1:** A text string of a file path pointing to a forward or R1 FASTQ file. This can be used with the R2 column. File names **must be unique**, even if they are in different directories.
+* **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file, or 'NA' when single end data. This can be used with the R1 column. File names **must be unique**, even if they are in different directories.
+* **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot be specified at the same time as R1 or R2, both of which should be set to 'NA'
For example, the following TSV table:
@@ -332,32 +340,32 @@ For example, the following TSV table:
will have the following effects:
-- After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 _with the same `SeqType`_ (and all other _metadata_ columns) will be concatenated together for each **Library**.
-- After mapping, and prior BAM filtering, BAM files with different `SeqType` (but with all other metadata columns the same) will be merged together for each **Library**.
-- After duplicate removal, BAM files with different `Library_ID`s but with the same `Sample_Name` and the same `UDG_Treatment` will be merged together.
-- If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the same `Sample_Name`.
+* After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 _with the same `SeqType`_ (and all other _metadata_ columns) will be concatenated together for each **Library**.
+* After mapping, and prior BAM filtering, BAM files with different `SeqType` (but with all other metadata columns the same) will be merged together for each **Library**.
+* After duplicate removal, BAM files with different `Library_ID`s but with the same `Sample_Name` and the same `UDG_Treatment` will be merged together.
+* If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the same `Sample_Name`.
Note the following important points and limitations for setting up:
-- The TSV must use actual tabs (not spaces) between cells.
-- *File* names must be unique regardless of file path, due to risk of over-writing (see: [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)).
- - If it is 'too late' and you already have duplicate file names, a workaround is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file.
-- Lane IDs must be unique for each sequencing of each library.
- - If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly.
- - This also applies to the SeqType column, i.e. with the example above, if one run is PE and one run is SE, you need to give fake lane IDs to one of the runs as well.
-- All _BAM_ files must be specified as `SE` under `SeqType`.
- - You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses.
-- nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration
-- Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files (unless you use `--run_convertbam`), as only FASTQ files are lane-merged together.
-- Same libraries that are sequenced on different sequencing configurations (i.e single- and paired-end data), will be merged after mapping and will _always_ be considered 'paired-end' during downstream processes
- - **Important** running DeDup in this context is _not_ recommended, as PE and SE data at the same position will _not_ be evaluated as duplicates. Therefore not all duplicates will be removed.
- - When you wish to run PE/SE data together `-dedupper markduplicates` is therefore preferred.
- - An error will be thrown if you try to merge both PE and SE and also supply `--skip_merging`.
- - If you truly want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager
- - If you _regularly_ want to run the situation above, please leave a feature request on github.
-- DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging).
-- nf-core/eager functionality such as `--run_trim_bam` will be applied to only non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values will reflect the values of all libraries combined - after being damage trimmed etc.).
-- Genotyping will be typically performed on each `sample` independently, as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you **must** give each the SS and DS libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file collision` error in steps such as `sexdeterrmine`, and then you will need to merge these yourself. We will consider changing this behaviour in the future if there is enough interest.
+* The TSV must use actual tabs (not spaces) between cells.
+* *File* names must be unique regardless of file path, due to risk of over-writing (see: [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)).
+ * If it is 'too late' and you already have duplicate file names, a workaround is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file.
+* Lane IDs must be unique for each sequencing of each library.
+ * If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly.
+ * This also applies to the SeqType column, i.e. with the example above, if one run is PE and one run is SE, you need to give fake lane IDs to one of the runs as well.
+* All _BAM_ files must be specified as `SE` under `SeqType`.
+ * You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses.
+* nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration
+* Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files (unless you use `--run_convertbam`), as only FASTQ files are lane-merged together.
+* Same libraries that are sequenced on different sequencing configurations (i.e single- and paired-end data), will be merged after mapping and will _always_ be considered 'paired-end' during downstream processes
+ * **Important** running DeDup in this context is _not_ recommended, as PE and SE data at the same position will _not_ be evaluated as duplicates. Therefore not all duplicates will be removed.
+ * When you wish to run PE/SE data together `-dedupper markduplicates` is therefore preferred.
+ * An error will be thrown if you try to merge both PE and SE and also supply `--skip_merging`.
+ * If you truly want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager
+ * If you _regularly_ want to run the situation above, please leave a feature request on github.
+* DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging).
+* nf-core/eager functionality such as `--run_trim_bam` will be applied to only non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values will reflect the values of all libraries combined - after being damage trimmed etc.).
+* Genotyping will be typically performed on each `sample` independently, as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you **must** give each the SS and DS libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file collision` error in steps such as `sexdeterrmine`, and then you will need to merge these yourself. We will consider changing this behaviour in the future if there is enough interest.
## Clean up
@@ -419,7 +427,7 @@ In some cases it maybe no output log is produced by a particular tool for MultiQ
Known cases include:
-- Qualimap: there will be no MultiQC output if the BAM file is empty. An empty BAM file is produced when no reads map to the reference and causes Qualimap to crash - this is crash is ignored by nf-core/eager (to allow the rest of the pipeline to continue) and will therefore have no log file for that particular sample/library
+* Qualimap: there will be no MultiQC output if the BAM file is empty. An empty BAM file is produced when no reads map to the reference and causes Qualimap to crash - this is crash is ignored by nf-core/eager (to allow the rest of the pipeline to continue) and will therefore have no log file for that particular sample/library
## Tutorials
@@ -536,10 +544,10 @@ If you change into this with `cd` and run `ls -la` you should see a collection
of normal files, symbolic links (symlinks) and hidden files (indicated with `.`
at the beginning of the file name).
-- Symbolic links: are typically input files from previous processes.
-- Normal files: are typically successfully completed output files from some of
+* Symbolic links: are typically input files from previous processes.
+* Normal files: are typically successfully completed output files from some of
some of the commands in the process
-- Hidden files are Nextflow generated files and include the submission commands
+* Hidden files are Nextflow generated files and include the submission commands
as well as log files
When you have an error run, you can firstly check the contents of the output
@@ -596,9 +604,9 @@ DNA to map and cause false positive SNP calls.
Within nf-core, there are two main levels of configs
-- Institutional-level profiles: these normally define things like paths to
+* Institutional-level profiles: these normally define things like paths to
common storage, resource maximums, scheduling system
-- Pipeline-level profiles: these normally define parameters specifically for a
+* Pipeline-level profiles: these normally define parameters specifically for a
pipeline (such as mapping parameters, turning specific modules on or off)
As well as allowing more efficiency and control at cluster or Institutional
@@ -656,11 +664,11 @@ This would be translated as follows.
If your parameters looked like the following
-Parameter | Resolved Parameters | institution | cluster | my_paper
-----------------|------------------------|-------------|----------|----------
---executor | singularity | singularity | \\n"
- for (param in group_params.keySet()) {
- summary_section += "
\n"
- }
- }
-
- String yaml_file_text = "id: '${workflow.manifest.name.replace('/','-')}-summary'\n"
- yaml_file_text += "description: ' - this information is collected when the pipeline is started.'\n"
- yaml_file_text += "section_name: '${workflow.manifest.name} Workflow Summary'\n"
- yaml_file_text += "section_href: 'https://github.com/${workflow.manifest.name}'\n"
- yaml_file_text += "plot_type: 'html'\n"
- yaml_file_text += "data: |\n"
- yaml_file_text += "${summary_section}"
- return yaml_file_text
- }
-
}
diff --git a/main.nf b/main.nf
index f3ad4b2a0..6033028c6 100644
--- a/main.nf
+++ b/main.nf
@@ -11,128 +11,23 @@
------------------------------------------------------------------------------------------------------------
*/
+log.info Headers.nf_core(workflow, params.monochrome_logs)
-// Show help message
-params.help = false
+////////////////////////////////////////////////////
+/* -- PRINT HELP -- */
+////////////////////////////////////////////////////+
def json_schema = "$projectDir/nextflow_schema.json"
if (params.help) {
- def command = "nextflow run nf-core/eager -profile
+ $x
+
+ """.stripIndent() }
+ .set { ch_workflow_summary }
-// Check AWS batch settings
-Checks.aws_batch(workflow, params)
// Check the hostnames against configured profiles
-Checks.hostname(workflow, params, log)
+checkHostname()
log.info "Schaffa, Schaffa, Genome Baua!"
@@ -1298,7 +1226,7 @@ process bwamem {
// CircularMapper reference preparation and mapping for circular genomes e.g. mtDNA
process circulargenerator{
- label 'sc_tiny'
+ label 'sc_medium'
tag "$prefix"
publishDir "${params.outdir}/reference_genome/circularmapper_index", mode: params.publish_dir_mode, saveAs: { filename ->
if (params.save_reference) filename
@@ -1320,7 +1248,7 @@ process circulargenerator{
script:
prefix = "${fasta.baseName}_${params.circularextension}.fasta"
"""
- circulargenerator -e ${params.circularextension} -i $fasta -s ${params.circulartarget}
+ circulargenerator -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i $fasta -s ${params.circulartarget}
bwa index $prefix
"""
@@ -1353,7 +1281,7 @@ process circularmapper{
bwa aln -t ${task.cpus} $elongated_root $r1 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.r1.sai
bwa aln -t ${task.cpus} $elongated_root $r2 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.r2.sai
bwa sampe -r "@RG\\tID:ILLUMINA-${libraryid}\\tSM:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $elongated_root ${libraryid}.r1.sai ${libraryid}.r2.sai $r1 $r2 > tmp.out
- realignsamfile -e ${params.circularextension} -i tmp.out -r $fasta $filter
+ realignsamfile -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i tmp.out -r $fasta $filter
samtools sort -@ ${task.cpus} -O bam tmp_realigned.bam > ${libraryid}_"${seqtype}".mapped.bam
samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size}
"""
@@ -1361,7 +1289,7 @@ process circularmapper{
"""
bwa aln -t ${task.cpus} $elongated_root $r1 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.sai
bwa samse -r "@RG\\tID:ILLUMINA-${libraryid}\\tSM:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $elongated_root ${libraryid}.sai $r1 > tmp.out
- realignsamfile -e ${params.circularextension} -i tmp.out -r $fasta $filter
+ realignsamfile -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i tmp.out -r $fasta $filter
samtools sort -@ ${task.cpus} -O bam tmp_realigned.bam > "${libraryid}"_"${seqtype}".mapped.bam
samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size}
"""
@@ -1567,7 +1495,7 @@ ch_branched_for_seqtypemerge = ch_mapping_for_seqtype_merging
"""
samtools merge ${libraryid}_seqtypemerged.bam ${bam}
## Have to set validation as lenient because of BWA issue: "I see a read stands out the end of a chromosome and is flagged as unmapped (flag 0x4). [...]" http://bio-bwa.sourceforge.net/
- picard AddOrReplaceReadGroups I=${libraryid}_seqtypemerged.bam O=${libraryid}_seqtypemerged_rg.bam RGID=1 RGLB="${libraryid}_seqtypemerged" RGPL=illumina RGPU=4410 RGSM="${libraryid}_seqtypemerged" VALIDATION_STRINGENCY=LENIENT
+ picard -Xmx${task.memory.toGiga()}g AddOrReplaceReadGroups I=${libraryid}_seqtypemerged.bam O=${libraryid}_seqtypemerged_rg.bam RGID=1 RGLB="${libraryid}_seqtypemerged" RGPL=illumina RGPU=4410 RGSM="${libraryid}_seqtypemerged" VALIDATION_STRINGENCY=LENIENT
samtools index ${libraryid}_seqtypemerged_rg.bam ${size}
"""
@@ -1938,7 +1866,7 @@ process library_merge {
"""
samtools merge ${samplename}_libmerged_rmdup.bam ${bam}
## Have to set validation as lenient because of BWA issue: "I see a read stands out the end of a chromosome and is flagged as unmapped (flag 0x4). [...]" http://bio-bwa.sourceforge.net/
- picard AddOrReplaceReadGroups I=${samplename}_libmerged_rmdup.bam O=${samplename}_libmerged_rg_rmdup.bam RGID=1 RGLB="${samplename}_merged" RGPL=illumina RGPU=4410 RGSM="${samplename}_merged" VALIDATION_STRINGENCY=LENIENT
+ picard -Xmx${task.memory.toGiga()}g AddOrReplaceReadGroups I=${samplename}_libmerged_rmdup.bam O=${samplename}_libmerged_rg_rmdup.bam RGID=1 RGLB="${samplename}_merged" RGPL=illumina RGPU=4410 RGSM="${samplename}_merged" VALIDATION_STRINGENCY=LENIENT
samtools index ${samplename}_libmerged_rg_rmdup.bam ${size}
"""
}
@@ -2081,8 +2009,8 @@ process mapdamage_rescaling {
def singlestranded = strandedness == "single" ? '--single-stranded' : ''
def size = params.large_ref ? '-c' : ''
"""
- mapDamage -i ${bam} -r ${fasta} --rescale --rescale-out ${bam}_rescaled.bam --rescale-length-5p ${params.rescale_length_5p} --rescale-length-3p=${params.rescale_length_3p} ${singlestranded}
- samtools index ${bam}_rescaled.bam ${size}
+ mapDamage -i ${bam} -r ${fasta} --rescale --rescale-out ${base}_rescaled.bam --rescale-length-5p ${params.rescale_length_5p} --rescale-length-3p=${params.rescale_length_3p} ${singlestranded}
+ samtools index ${base}_rescaled.bam ${size}
"""
}
@@ -2114,14 +2042,15 @@ process pmdtools {
snpcap = ''
}
def size = params.large_ref ? '-c' : ''
+ def platypus = params.pmdtools_platypus ? '--platypus' : ''
"""
#Run Filtering step
- samtools calmd -b $bam $fasta | samtools view -h - | pmdtools --threshold ${params.pmdtools_threshold} $treatment $snpcap --header | samtools view -@ ${task.cpus} -Sb - > "${libraryid}".pmd.bam
+ samtools calmd -b ${bam} ${fasta} | samtools view -h - | pmdtools --threshold ${params.pmdtools_threshold} ${treatment} ${snpcap} --header | samtools view -@ ${task.cpus} -Sb - > "${libraryid}".pmd.bam
#Run Calc Range step
## To allow early shut off of pipe: https://github.com/nextflow-io/nextflow/issues/1564
trap 'if [[ \$? == 141 ]]; then echo "Shutting samtools early due to -n parameter" && samtools index ${libraryid}.pmd.bam ${size}; exit 0; fi' EXIT
- samtools calmd -b $bam $fasta | samtools view -h - | pmdtools --deamination --range ${params.pmdtools_range} $treatment $snpcap -n ${params.pmdtools_max_reads} > "${libraryid}".cpg.range."${params.pmdtools_range}".txt
+ samtools calmd -b ${bam} ${fasta} | samtools view -h - | pmdtools --deamination ${platypus} --range ${params.pmdtools_range} ${treatment} ${snpcap} -n ${params.pmdtools_max_reads} > "${libraryid}".cpg.range."${params.pmdtools_range}".txt
echo "Running indexing"
samtools index ${libraryid}.pmd.bam ${size}
@@ -2219,7 +2148,7 @@ process additional_library_merge {
def size = params.large_ref ? '-c' : ''
"""
samtools merge ${samplename}_libmerged_add.bam ${bam}
- picard AddOrReplaceReadGroups I=${samplename}_libmerged_add.bam O=${samplename}_libmerged_rg_add.bam RGID=1 RGLB="${samplename}_additionalmerged" RGPL=illumina RGPU=4410 RGSM="${samplename}_additionalmerged" VALIDATION_STRINGENCY=LENIENT
+ picard -Xmx${task.memory.toGiga()}g AddOrReplaceReadGroups I=${samplename}_libmerged_add.bam O=${samplename}_libmerged_rg_add.bam RGID=1 RGLB="${samplename}_additionalmerged" RGPL=illumina RGPU=4410 RGSM="${samplename}_additionalmerged" VALIDATION_STRINGENCY=LENIENT
samtools index ${samplename}_libmerged_rg_add.bam ${size}
"""
}
@@ -2557,7 +2486,7 @@ process vcf2genome {
def fasta_head = "${params.vcf2genome_header}" == '' ? "${samplename}" : "${params.vcf2genome_header}"
"""
pigz -f -d -p ${task.cpus} *.vcf.gz
- vcf2genome -draft ${out}.fasta -draftname "${fasta_head}" -in ${vcf.baseName} -minc ${params.vcf2genome_minc} -minfreq ${params.vcf2genome_minfreq} -minq ${params.vcf2genome_minq} -ref ${fasta} -refMod ${out}_refmod.fasta -uncertain ${out}_uncertainy.fasta
+ vcf2genome -Xmx${task.memory.toGiga()}g -draft ${out}.fasta -draftname "${fasta_head}" -in ${vcf.baseName} -minc ${params.vcf2genome_minc} -minfreq ${params.vcf2genome_minfreq} -minq ${params.vcf2genome_minq} -ref ${fasta} -refMod ${out}_refmod.fasta -uncertain ${out}_uncertainy.fasta
pigz -p ${task.cpus} *.fasta
pigz -p ${task.cpus} *.vcf
"""
@@ -2566,10 +2495,10 @@ process vcf2genome {
// More complex consensus caller with additional filtering functionality (e.g. for heterozygous calls) to generate SNP tables and other things sometimes used in aDNA bacteria studies
// Create input channel for MultiVCFAnalyzer, possibly mixing with pre-made VCFs.
-if (params.additional_vcf_files == '') {
- ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it[7] }.collect()
+if (!params.additional_vcf_files) {
+ ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it[-1] }.collect()
} else {
- ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it [7] }.collect().mix(ch_extravcfs_for_multivcfanalyzer)
+ ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it [-1] }.collect().mix(ch_extravcfs_for_multivcfanalyzer)
}
process multivcfanalyzer {
@@ -2577,11 +2506,11 @@ process multivcfanalyzer {
publishDir "${params.outdir}/multivcfanalyzer", mode: params.publish_dir_mode
when:
- params.genotyping_tool == 'ug' && params.run_multivcfanalyzer && params.gatk_ploidy == '2'
+ params.genotyping_tool == 'ug' && params.run_multivcfanalyzer && params.gatk_ploidy.toString() == '2'
input:
- file vcf from ch_vcfs_for_multivcfanalyzer.collect()
- file fasta from ch_fasta_for_multivcfanalyzer.collect()
+ file vcf from ch_vcfs_for_multivcfanalyzer
+ file fasta from ch_fasta_for_multivcfanalyzer
output:
file('fullAlignment.fasta.gz')
@@ -2600,7 +2529,7 @@ process multivcfanalyzer {
def write_freqs = params.write_allele_frequencies ? "T" : "F"
"""
gunzip -f *.vcf.gz
- multivcfanalyzer ${params.snp_eff_results} ${fasta} ${params.reference_gff_annotations} . ${write_freqs} ${params.min_genotype_quality} ${params.min_base_coverage} ${params.min_allele_freq_hom} ${params.min_allele_freq_het} ${params.reference_gff_exclude} *.vcf
+ multivcfanalyzer -Xmx${task.memory.toGiga()}g ${params.snp_eff_results} ${fasta} ${params.reference_gff_annotations} . ${write_freqs} ${params.min_genotype_quality} ${params.min_base_coverage} ${params.min_allele_freq_hom} ${params.min_allele_freq_het} ${params.reference_gff_exclude} *.vcf
pigz -p ${task.cpus} *.tsv *.txt snpAlignment.fasta snpAlignmentIncludingRefGenome.fasta fullAlignment.fasta
"""
}
@@ -2627,7 +2556,7 @@ process multivcfanalyzer {
script:
"""
- mtnucratio ${bam} "${params.mtnucratio_header}"
+ mtnucratio -Xmx${task.memory.toGiga()}g ${bam} "${params.mtnucratio_header}"
"""
}
@@ -2986,7 +2915,9 @@ process output_documentation {
"""
}
-// Collect all software versions for inclusion in MultiQC report
+/*
+ * Parse software version numbers
+ */
process get_software_versions {
label 'sc_tiny'
@@ -3043,8 +2974,9 @@ process get_software_versions {
}
// MultiQC file generation for pipeline report
-def workflow_summary = NfcoreSchema.params_summary_multiqc(workflow, summary_params)
-ch_workflow_summary = Channel.value(workflow_summary)
+//def workflow_summary = NfcoreSchema.params_summary_multiqc(workflow, summary_params)
+
+//ch_workflow_summary = Channel.value(workflow_summary)
process multiqc {
label 'sc_medium'
@@ -3101,17 +3033,126 @@ process multiqc {
// Send completion emails if requested, so user knows data is ready
workflow.onComplete {
- Completion.email(workflow, params, summary_params, projectDir, log, multiqc_report)
- Completion.summary(workflow, params, log, fail_percent_mapped, pass_percent_mapped)
+
+ // Set up the e-mail variables
+ def subject = "[nf-core/eager] Successful: $workflow.runName"
+ if (!workflow.success) {
+ subject = "[nf-core/eager] FAILED: $workflow.runName"
+ }
+ def email_fields = [:]
+ email_fields['version'] = workflow.manifest.version
+ email_fields['runName'] = workflow.runName
+ email_fields['success'] = workflow.success
+ email_fields['dateComplete'] = workflow.complete
+ email_fields['duration'] = workflow.duration
+ email_fields['exitStatus'] = workflow.exitStatus
+ email_fields['errorMessage'] = (workflow.errorMessage ?: 'None')
+ email_fields['errorReport'] = (workflow.errorReport ?: 'None')
+ email_fields['commandLine'] = workflow.commandLine
+ email_fields['projectDir'] = workflow.projectDir
+ email_fields['summary'] = summary
+ email_fields['summary']['Date Started'] = workflow.start
+ email_fields['summary']['Date Completed'] = workflow.complete
+ email_fields['summary']['Pipeline script file path'] = workflow.scriptFile
+ email_fields['summary']['Pipeline script hash ID'] = workflow.scriptId
+ if (workflow.repository) email_fields['summary']['Pipeline repository Git URL'] = workflow.repository
+ if (workflow.commitId) email_fields['summary']['Pipeline repository Git Commit'] = workflow.commitId
+ if (workflow.revision) email_fields['summary']['Pipeline Git branch/tag'] = workflow.revision
+ email_fields['summary']['Nextflow Version'] = workflow.nextflow.version
+ email_fields['summary']['Nextflow Build'] = workflow.nextflow.build
+ email_fields['summary']['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp
+
+ // On success try attach the multiqc report
+ def mqc_report = null
+ try {
+ if (workflow.success) {
+ mqc_report = ch_multiqc_report.getVal()
+ if (mqc_report.getClass() == ArrayList) {
+ log.warn "[nf-core/eager] Found multiple reports from process 'multiqc', will use only one"
+ mqc_report = mqc_report[0]
+ }
+ }
+ } catch (all) {
+ log.warn "[nf-core/eager] Could not attach MultiQC report to summary email"
+ }
+
+ // Check if we are only sending emails on failure
+ email_address = params.email
+ if (!params.email && params.email_on_fail && !workflow.success) {
+ email_address = params.email_on_fail
+ }
+
+ // Render the TXT template
+ def engine = new groovy.text.GStringTemplateEngine()
+ def tf = new File("$projectDir/assets/email_template.txt")
+ def txt_template = engine.createTemplate(tf).make(email_fields)
+ def email_txt = txt_template.toString()
+
+ // Render the HTML template
+ def hf = new File("$projectDir/assets/email_template.html")
+ def html_template = engine.createTemplate(hf).make(email_fields)
+ def email_html = html_template.toString()
+
+ // Render the sendmail template
+ def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize: params.max_multiqc_email_size.toBytes() ]
+ def sf = new File("$projectDir/assets/sendmail_template.txt")
+ def sendmail_template = engine.createTemplate(sf).make(smail_fields)
+ def sendmail_html = sendmail_template.toString()
+
+ // Send the HTML e-mail
+ if (email_address) {
+ try {
+ if (params.plaintext_email) { throw GroovyException('Send plaintext e-mail, not HTML') }
+ // Try to send HTML e-mail using sendmail
+ [ 'sendmail', '-t' ].execute() << sendmail_html
+ log.info "[nf-core/eager] Sent summary e-mail to $email_address (sendmail)"
+ } catch (all) {
+ // Catch failures and try with plaintext
+ def mail_cmd = [ 'mail', '-s', subject, '--content-type=text/html', email_address ]
+ if ( mqc_report.size() <= params.max_multiqc_email_size.toBytes() ) {
+ mail_cmd += [ '-A', mqc_report ]
+ }
+ mail_cmd.execute() << email_html
+ log.info "[nf-core/eager] Sent summary e-mail to $email_address (mail)"
+ }
+ }
+
+ // Write summary e-mail HTML to a file
+ def output_d = new File("${params.outdir}/pipeline_info/")
+ if (!output_d.exists()) {
+ output_d.mkdirs()
+ }
+ def output_hf = new File(output_d, "pipeline_report.html")
+ output_hf.withWriter { w -> w << email_html }
+ def output_tf = new File(output_d, "pipeline_report.txt")
+ output_tf.withWriter { w -> w << email_txt }
+
+ c_green = params.monochrome_logs ? '' : "\033[0;32m";
+ c_purple = params.monochrome_logs ? '' : "\033[0;35m";
+ c_red = params.monochrome_logs ? '' : "\033[0;31m";
+ c_reset = params.monochrome_logs ? '' : "\033[0m";
+
+ if (workflow.stats.ignoredCount > 0 && workflow.success) {
+ log.info "-${c_purple}Warning, pipeline completed, but with errored process(es) ${c_reset}-"
+ log.info "-${c_red}Number of ignored errored process(es) : ${workflow.stats.ignoredCount} ${c_reset}-"
+ log.info "-${c_green}Number of successfully ran process(es) : ${workflow.stats.succeedCount} ${c_reset}-"
+ }
+
+ if (workflow.success) {
+ log.info "-${c_purple}[nf-core/eager]${c_green} Pipeline completed successfully${c_reset}-"
+ } else {
+ checkHostname()
+ log.info "-${c_purple}[nf-core/eager]${c_red} Pipeline completed with errors${c_reset}-"
+ }
+
}
workflow.onError {
- // Print unexpected parameters
- for (p in unexpectedParams) {
- log.warn "Unexpected parameter: ${p}"
- }
+ // Print unexpected parameters - easiest is to just rerun validation
+ NfcoreSchema.validateParameters(params, json_schema, log)
}
+
/////////////////////////////////////
/* -- AUXILARY FUNCTIONS -- */
/////////////////////////////////////
@@ -3279,3 +3320,24 @@ ch_reads_for_faketsv
def validate_size(collection, size){
if ( collection.size() != size ) { return false } else { return true }
}
+
+def checkHostname() {
+ def c_reset = params.monochrome_logs ? '' : "\033[0m"
+ def c_white = params.monochrome_logs ? '' : "\033[0;37m"
+ def c_red = params.monochrome_logs ? '' : "\033[1;91m"
+ def c_yellow_bold = params.monochrome_logs ? '' : "\033[1;93m"
+ if (params.hostnames) {
+ def hostname = 'hostname'.execute().text.trim()
+ params.hostnames.each { prof, hnames ->
+ hnames.each { hname ->
+ if (hostname.contains(hname) && !workflow.profile.contains(prof)) {
+ log.error '====================================================\n' +
+ " ${c_red}WARNING!${c_reset} You are running with `-profile $workflow.profile`\n" +
+ " but your machine hostname is ${c_white}'$hostname'${c_reset}\n" +
+ " ${c_yellow_bold}It's highly recommended that you use `-profile $prof${c_reset}`\n" +
+ '============================================================'
+ }
+ }
+ }
+ }
+}
diff --git a/nextflow.config b/nextflow.config
index 5a87732e0..2ac079dad 100644
--- a/nextflow.config
+++ b/nextflow.config
@@ -9,6 +9,8 @@ params {
// Workflow flags
genome = false
+ input = null
+ input_paths = null
single_end = false
outdir = './results'
publish_dir_mode = 'copy'
@@ -22,11 +24,10 @@ params {
//Pipeline options
enable_conda = false
validate_params = true
- schema_ignore_params = 'genomes'
+ schema_ignore_params = 'genome'
show_hidden_params = false
//Input reads
- input = null
udg_type = 'none'
single_stranded = false
single_end = false
@@ -45,6 +46,10 @@ params {
seq_dict = ''
large_ref = false
save_reference = false
+
+ // this is just to stop the iGenomes WARN as we set as FALSE by default. Otherwise should be overwritten by optional config load below.
+ genomes = false
+
//Skipping parts of the pipeline for impatient users
skip_fastqc = false
@@ -113,6 +118,7 @@ params {
pmdtools_threshold = 3
pmdtools_reference_mask = ''
pmdtools_max_reads = 10000
+ pmdtools_platypus = false
// mapDamage
run_mapdamage_rescaling = false
@@ -244,6 +250,9 @@ params {
config_profile_description = false
config_profile_contact = false
config_profile_url = false
+ validate_params = true
+ show_hidden_params = false
+ schema_ignore_params = 'genomes,input_paths'
// Defaults only, expecting to be overwritten
max_memory = 128.GB
@@ -254,7 +263,7 @@ params {
// Container slug. Stable releases should specify release tag!
// Developmental code should specify :dev
-process.container = 'nfcore/eager:2.3.2'
+process.container = 'nfcore/eager:2.3.3'
// Load base.config by default for all pipelines
includeConfig 'conf/base.config'
@@ -274,13 +283,21 @@ try {
}
profiles {
- conda {
+ conda {
+ docker.enabled = false
+ singularity.enabled = false
+ podman.enabled = false
+ shifter.enabled = false
+ charliecloud = false
process.conda = "$projectDir/environment.yml"
- params.enable_conda = true
}
debug { process.beforeScript = 'echo $HOSTNAME' }
docker {
docker.enabled = true
+ singularity.enabled = false
+ podman.enabled = false
+ shifter.enabled = false
+ charliecloud.enabled = false
// Avoid this error:
// WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
// Testing this in nf-core after discussion here https://github.com/nf-core/tools/pull/351
@@ -288,11 +305,33 @@ profiles {
docker.runOptions = '-u \$(id -u):\$(id -g)'
}
singularity {
+ docker.enabled = false
singularity.enabled = true
+ podman.enabled = false
+ shifter.enabled = false
+ charliecloud.enabled = false
singularity.autoMounts = true
}
podman {
+ singularity.enabled = false
+ docker.enabled = false
podman.enabled = true
+ shifter.enabled = false
+ charliecloud = false
+ }
+ shifter {
+ singularity.enabled = false
+ docker.enabled = false
+ podman.enabled = false
+ shifter.enabled = true
+ charliecloud.enabled = false
+ }
+ charliecloud {
+ singularity.enabled = false
+ docker.enabled = false
+ podman.enabled = false
+ shifter.enabled = false
+ charliecloud.enabled = true
}
test { includeConfig 'conf/test.config' }
test_full { includeConfig 'conf/test_full.config' }
@@ -312,6 +351,8 @@ profiles {
benchmarking_human { includeConfig 'conf/benchmarking_human.config' }
benchmarking_vikingfish { includeConfig 'conf/benchmarking_vikingfish.config' }
}
+
+
// Load igenomes.config if required
if (!params.igenomes_ignore) {
includeConfig 'conf/igenomes.config'
@@ -351,7 +392,7 @@ manifest {
description = 'A fully reproducible and state-of-the-art ancient DNA analysis pipeline'
mainScript = 'main.nf'
nextflowVersion = '!>=20.07.1'
- version = '2.3.2'
+ version = '2.3.3'
}
// Function to ensure that resource requirements don't go beyond
diff --git a/nextflow_schema.json b/nextflow_schema.json
index 292a5fdd7..0e7a9e623 100644
--- a/nextflow_schema.json
+++ b/nextflow_schema.json
@@ -195,6 +195,13 @@
"hidden": true,
"fa_icon": "fas fa-question-circle"
},
+ "validate_params": {
+ "type": "boolean",
+ "description": "Boolean whether to validate parameters against the schema at runtime",
+ "default": true,
+ "fa_icon": "fas fa-check-square",
+ "hidden": true
+ },
"email": {
"type": "string",
"description": "Email address for completion summary.",
@@ -257,25 +264,12 @@
"hidden": true,
"description": "Parameter used for checking conda channels to be set correctly."
},
- "validate_params": {
- "type": "boolean",
- "default": "true",
- "description": "Boolean whether to validate parameters against the schema at runtime",
- "fa_icon": "fab fa-angellist",
- "hidden": true
- },
"schema_ignore_params": {
"type": "string",
"fa_icon": "fas fa-not-equal",
"description": "String to specify ignored parameters for parameter validation",
"hidden": true,
"default": "genomes"
- },
- "config_profile_name": {
- "type": "string",
- "description": "String to describe the config profile that is run.",
- "fa_icon": "fas fa-id-badge",
- "hidden": true
}
},
"fa_icon": "fas fa-file-import",
@@ -302,6 +296,7 @@
"description": "Maximum amount of memory that can be requested for any single job.",
"default": "128.GB",
"fa_icon": "fas fa-memory",
+ "pattern": "^[\\d\\.]+\\s*.(K|M|G|T)?B$",
"hidden": true,
"help_text": "Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. `--max_memory '8.GB'`"
},
@@ -310,6 +305,7 @@
"description": "Maximum amount of time that can be requested for any single job.",
"default": "240.h",
"fa_icon": "far fa-clock",
+ "pattern": "^(\\d+(\\.\\d+)?(?:\\s*|\\.?)(s|m|h|d)\\s*)+$",
"hidden": true,
"help_text": "Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. `--max_time '2.h'`"
}
@@ -344,6 +340,12 @@
"hidden": true,
"fa_icon": "fas fa-users-cog"
},
+ "config_profile_name": {
+ "type": "string",
+ "description": "Institutional config name.",
+ "hidden": true,
+ "fa_icon": "fas fa-users-cog"
+ },
"config_profile_description": {
"type": "string",
"description": "Institutional config description.",
@@ -607,7 +609,6 @@
},
"bt2n": {
"type": "integer",
- "default": 0,
"description": "Specify the -N parameter for bowtie2 (mismatches in seed). This will override defaults from alignmode/sensitivity.",
"fa_icon": "fas fa-sort-numeric-down",
"help_text": "The number of mismatches allowed in the seed during seed-and-extend procedure of Bowtie2. This will override any values set with `--bt2_sensitivity`. Can either be 0 or 1. Default: 0 (i.e. use`--bt2_sensitivity` defaults).\n\n> Modifies Bowtie2 parameters: `-N`",
@@ -618,21 +619,18 @@
},
"bt2l": {
"type": "integer",
- "default": 0,
"description": "Specify the -L parameter for bowtie2 (length of seed substrings). This will override defaults from alignmode/sensitivity.",
"fa_icon": "fas fa-ruler-horizontal",
"help_text": "The length of the seed sub-string to use during seeding. This will override any values set with `--bt2_sensitivity`. Default: 0 (i.e. use`--bt2_sensitivity` defaults: [20 for local and 22 for end-to-end](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line).\n\n> Modifies Bowtie2 parameters: `-L`"
},
"bt2_trim5": {
"type": "integer",
- "default": 0,
"description": "Specify number of bases to trim off from 5' (left) end of read before alignment.",
"fa_icon": "fas fa-cut",
"help_text": "Number of bases to trim at the 5' (left) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0\n\n> Modifies Bowtie2 parameters: `-bt2_trim5`"
},
"bt2_trim3": {
"type": "integer",
- "default": 0,
"description": "Specify number of bases to trim off from 3' (right) end of read before alignment.",
"fa_icon": "fas fa-cut",
"help_text": "Number of bases to trim at the 3' (right) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0.\n\n> Modifies Bowtie2 parameters: `-bt2_trim3`"
@@ -683,14 +681,12 @@
},
"bam_mapping_quality_threshold": {
"type": "integer",
- "default": 0,
"description": "Minimum mapping quality for reads filter.",
"fa_icon": "fas fa-greater-than-equal",
"help_text": "Specify a mapping quality threshold for mapped reads to be kept for downstream analysis. By default keeps all reads and is therefore set to `0` (basically doesn't filter anything).\n\n> Modifies samtools view parameter: `-q`"
},
"bam_filter_minreadlength": {
"type": "integer",
- "default": 0,
"fa_icon": "fas fa-ruler-horizontal",
"description": "Specify minimum read length to be kept after mapping.",
"help_text": "Specify minimum length of mapped reads. This filtering will apply at the same time as mapping quality filtering.\n\nIf used _instead_ of minimum length read filtering at AdapterRemoval, this can be useful to get more realistic endogenous DNA percentages, when most of your reads are very short (e.g. in single-stranded libraries) and would otherwise be discarded by AdapterRemoval (thus making an artificially small denominator for a typical endogenous DNA calculation). Note in this context you should not perform mapping quality filtering nor discarding of unmapped reads to ensure a correct denominator of all reads, for the endogenous DNA calculation.\n\n> Modifies filter_bam_fragment_length.py parameter: `-l`"
@@ -817,6 +813,12 @@
"fa_icon": "fas fa-greater-than-equal",
"help_text": "The maximum number of reads used for damage assessment in PMDtools. Can be used to significantly reduce the amount of time required for damage assessment in PMDTools. Note that a too low value can also obtain incorrect results.\n\n> Modifies PMDTools parameter: `-n`"
},
+ "pmdtools_platypus": {
+ "type": "boolean",
+ "description": "Append big list of base frequencies for platypus to output.",
+ "fa_icon": "fas fa-power-off",
+ "help_text": "Enables the printing of a wider list of base frequencies used by platypus as an addition to the output base misincorporation frequency table. By default turned off.\n"
+ },
"run_mapdamage_rescaling": {
"type": "boolean",
"fa_icon": "fas fa-map",
@@ -1049,7 +1051,6 @@
},
"freebayes_g": {
"type": "integer",
- "default": 0,
"description": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified in --freebayes_C.",
"fa_icon": "fab fa-think-peaks",
"help_text": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified C. Not set by default.\n\n> Modifies freebayes parameter: `-g`"