diff --git a/.github/.dockstore.yml b/.github/.dockstore.yml index 030138a0c..191fabd22 100644 --- a/.github/.dockstore.yml +++ b/.github/.dockstore.yml @@ -3,3 +3,4 @@ version: 1.2 workflows: - subclass: nfl primaryDescriptorPath: /nextflow.config + publish: True diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 97f223d86..3e4a4cfa2 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -69,7 +69,7 @@ If you wish to contribute a new step, please use the following coding standards: 2. Write the process block (see below). 3. Define the output channel if needed (see below). 4. Add any new flags/options to `nextflow.config` with a default (see below). -5. Add any new flags/options to `nextflow_schema.json` **with help text** (with `nf-core schema build .`) +5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build .`). 6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter). 7. Add sanity checks for all relevant parameters. 8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`. @@ -87,7 +87,7 @@ Once there, use `nf-core schema build .` to add to `nextflow_schema.json`. ### Default processes resource requirements -Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. +Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. :warning: Note that in nf-core/eager we currently have our own custom process labels, so please check `base.config`! diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md index f00ef2e57..b461caca3 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -57,7 +57,7 @@ Have you provided the following extra information/files: ## Container engine -- Engine: +- Engine: - version: - Image tag: diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md index eadff09eb..c7ca5c253 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -1,6 +1,6 @@ --- name: Feature request -about: Suggest an idea for the nf-core website +about: Suggest an idea for the nf-core/eager pipeline labels: enhancement --- diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 57a13ac3e..4d46a3ac7 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -15,9 +15,9 @@ Learn more about contributing: [CONTRIBUTING.md](https://github.com/nf-core/eage - [ ] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! - - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py` - - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md) - - [ ] If necessary, also make a PR on the nf-core/eager _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. + - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py` + - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md) + - [ ] If necessary, also make a PR on the nf-core/eager _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. - [ ] Make sure your code lints (`nf-core lint .`). - [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`). - [ ] Usage Documentation in `docs/usage.md` is updated. diff --git a/.github/workflows/awsfulltest.yml b/.github/workflows/awsfulltest.yml index 51475927c..4e03e75be 100644 --- a/.github/workflows/awsfulltest.yml +++ b/.github/workflows/awsfulltest.yml @@ -9,6 +9,16 @@ on: types: [completed] workflow_dispatch: + +env: + AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} + AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} + TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }} + AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }} + AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }} + AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} + + jobs: run-awstest: name: Run AWS full tests @@ -26,13 +36,6 @@ jobs: # Add full size test data (but still relatively small datasets for few samples) # on the `test_full.config` test runs with only one set of parameters # Then specify `-profile test_full` instead of `-profile test` on the AWS batch command - env: - AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} - AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }} - AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }} - AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }} - AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} run: | aws batch submit-job \ --region eu-west-1 \ diff --git a/.github/workflows/awstest.yml b/.github/workflows/awstest.yml index 7ffc9c417..6e0a9538c 100644 --- a/.github/workflows/awstest.yml +++ b/.github/workflows/awstest.yml @@ -6,6 +6,16 @@ name: nf-core AWS test on: workflow_dispatch: + +env: + AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} + AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} + TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }} + AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }} + AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }} + AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} + + jobs: run-awstest: name: Run AWS tests @@ -22,13 +32,6 @@ jobs: - name: Start AWS batch job # For example: adding multiple test runs with different parameters # Remember that you can parallelise this by using strategy.matrix - env: - AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} - AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }} - AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }} - AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }} - AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} run: | aws batch submit-job \ --region eu-west-1 \ diff --git a/.github/workflows/branch.yml b/.github/workflows/branch.yml index a08150144..909b52d6b 100644 --- a/.github/workflows/branch.yml +++ b/.github/workflows/branch.yml @@ -13,7 +13,7 @@ jobs: - name: Check PRs if: github.repository == 'nf-core/eager' run: | - { [[ ${{github.event.pull_request.head.repo.full_name}} == nf-core/eager ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] + { [[ ${{github.event.pull_request.head.repo.full_name }} == nf-core/eager ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] # If the above check failed, post a comment on the PR explaining the failure @@ -23,13 +23,22 @@ jobs: uses: mshick/add-pr-comment@v1 with: message: | + ## This PR is against the `master` branch :x: + + * Do not close this PR + * Click _Edit_ and change the `base` to `dev` + * This CI test will remain failed until you push a new commit + + --- + Hi @${{ github.event.pull_request.user.login }}, - It looks like this pull-request is has been made against the ${{github.event.pull_request.head.repo.full_name}} `master` branch. + It looks like this pull-request is has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `master` branch. The `master` branch on nf-core repositories should always contain code from the latest release. - Because of this, PRs to `master` are only allowed if they come from the ${{github.event.pull_request.head.repo.full_name}} `dev` branch. + Because of this, PRs to `master` are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch. You do not need to close this PR, you can change the target branch to `dev` by clicking the _"Edit"_ button at the top of this page. + Note that even after this, the test will continue to show as failing until you push a new commit. Thanks again for your contribution! repo-token: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 946c9caa1..a8bfa6ba1 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -20,7 +20,7 @@ jobs: strategy: matrix: # Nextflow versions: check pipeline minimum and current latest - nxf_ver: ['20.07.1', ''] + nxf_ver: ['20.07.1', '21.03.0-edge'] steps: - name: Check out pipeline code uses: actions/checkout@v2 @@ -34,13 +34,13 @@ jobs: - name: Build new docker image if: env.MATCHED_FILES - run: docker build --no-cache . -t nfcore/eager:2.3.2 + run: docker build --no-cache . -t nfcore/eager:2.3.3 - name: Pull docker image if: ${{ !env.MATCHED_FILES }} run: | docker pull nfcore/eager:dev - docker tag nfcore/eager:dev nfcore/eager:2.3.2 + docker tag nfcore/eager:dev nfcore/eager:2.3.3 - name: Install Nextflow env: @@ -125,7 +125,7 @@ jobs: nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_bedtools_coverage --anno_file 'https://github.com/nf-core/test-datasets/raw/eager/reference/Mammoth/Mammoth_MT_Krause.gff3' - name: GENOTYPING_HC Test running GATK HaplotypeCaller run: | - nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker --run_genotyping --genotyping_tool 'hc' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_hc_emitrefconf 'BP_RESOLUTION' + nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_fna,docker --run_genotyping --genotyping_tool 'hc' --gatk_hc_out_mode 'EMIT_ALL_ACTIVE_SITES' --gatk_hc_emitrefconf 'BP_RESOLUTION' - name: GENOTYPING_FB Test running FreeBayes run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --genotyping_tool 'freebayes' @@ -146,13 +146,13 @@ jobs: nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_pmdtools - name: GENOTYPING_UG AND MULTIVCFANALYZER Test running GATK UnifiedGenotyper and MultiVCFAnalyzer, additional VCFS run: | - nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer --additional_vcf_files 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/vcf/JK2772_CATCAGTGAGTAGA_L008_R1_001.fastq.gz.tengrand.fq.combined.fq.mapped_rmdup.bam.unifiedgenotyper.vcf.gz' --write_allele_frequencies + nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer --additional_vcf_files 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/vcf/JK2772_CATCAGTGAGTAGA_L008_R1_001.fastq.gz.tengrand.fq.combined.fq.mapped_rmdup.bam.unifiedgenotyper.vcf.gz' --write_allele_frequencies - name: COMPLEX LANE/LIBRARY MERGING Test running lane and library merging prior to GATK UnifiedGenotyper and running MultiVCFAnalyzer run: | - nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --run_genotyping --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer + nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_complex,docker --run_genotyping --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' --run_multivcfanalyzer - name: GENOTYPING_UG ON TRIMMED BAM Test run: | - nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --run_trim_bam --genotyping_source 'trimmed' --genotyping_tool 'ug' --gatk_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' + nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_genotyping --run_trim_bam --genotyping_source 'trimmed' --genotyping_tool 'ug' --gatk_ug_out_mode 'EMIT_ALL_SITES' --gatk_ug_genotype_model 'SNP' - name: BAM_INPUT Run the basic pipeline with the bam input profile, skip AdapterRemoval as no convertBam run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_bam,docker --skip_adapterremoval diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml index d99d4d751..fcde400ce 100644 --- a/.github/workflows/linting.yml +++ b/.github/workflows/linting.yml @@ -19,6 +19,34 @@ jobs: run: npm install -g markdownlint-cli - name: Run Markdownlint run: markdownlint ${GITHUB_WORKSPACE} -c ${GITHUB_WORKSPACE}/.github/markdownlint.yml + + # If the above check failed, post a comment on the PR explaining the failure + - name: Post PR comment + if: failure() + uses: mshick/add-pr-comment@v1 + with: + message: | + ## Markdown linting is failing + + To keep the code consistent with lots of contributors, we run automated code consistency checks. + To fix this CI test, please run: + + * Install `markdownlint-cli` + * On Mac: `brew install markdownlint-cli` + * Everything else: [Install `npm`](https://www.npmjs.com/get-npm) then [install `markdownlint-cli`](https://www.npmjs.com/package/markdownlint-cli) (`npm install -g markdownlint-cli`) + * Fix the markdown errors + * Automatically: `markdownlint . --config .github/markdownlint.yml --fix` + * Manually resolve anything left from `markdownlint . --config .github/markdownlint.yml` + + Once you push these changes the test should pass, and you can hide this comment :+1: + + We highly recommend setting up markdownlint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help! + + Thanks again for your contribution! + repo-token: ${{ secrets.GITHUB_TOKEN }} + allow-repeats: false + + YAML: runs-on: ubuntu-latest steps: @@ -29,7 +57,34 @@ jobs: - name: Install yaml-lint run: npm install -g yaml-lint - name: Run yaml-lint - run: yamllint $(find ${GITHUB_WORKSPACE} -type f -name "*.yml") + run: yamllint $(find ${GITHUB_WORKSPACE} -type f -name "*.yml" -o -name "*.yaml") + + # If the above check failed, post a comment on the PR explaining the failure + - name: Post PR comment + if: failure() + uses: mshick/add-pr-comment@v1 + with: + message: | + ## YAML linting is failing + + To keep the code consistent with lots of contributors, we run automated code consistency checks. + To fix this CI test, please run: + + * Install `yaml-lint` + * [Install `npm`](https://www.npmjs.com/get-npm) then [install `yaml-lint`](https://www.npmjs.com/package/yaml-lint) (`npm install -g yaml-lint`) + * Fix the markdown errors + * Run the test locally: `yamllint $(find . -type f -name "*.yml" -o -name "*.yaml")` + * Fix any reported errors in your YAML files + + Once you push these changes the test should pass, and you can hide this comment :+1: + + We highly recommend setting up yaml-lint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help! + + Thanks again for your contribution! + repo-token: ${{ secrets.GITHUB_TOKEN }} + allow-repeats: false + + nf-core: runs-on: ubuntu-latest steps: @@ -48,6 +103,7 @@ jobs: with: python-version: '3.6' architecture: 'x64' + - name: Install dependencies run: | python -m pip install --upgrade pip @@ -68,7 +124,7 @@ jobs: if: ${{ always() }} uses: actions/upload-artifact@v2 with: - name: linting-log-file + name: linting-logs path: | lint_log.txt lint_results.md diff --git a/.nf-core-lint.yml b/.nf-core-lint.yml new file mode 100644 index 000000000..496fea360 --- /dev/null +++ b/.nf-core-lint.yml @@ -0,0 +1,6 @@ +files_unchanged: + - assets/multiqc_config.yaml + - .github/CONTRIBUTING.md + - .github/ISSUE_TEMPLATE/bug_report.md + - docs/README.md + diff --git a/CHANGELOG.md b/CHANGELOG.md index 013927858..e289edece 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,25 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html). +## v2.3.3 - 2021-01-06 + +### `Added` + +- [#349](https://github.com/nf-core/eager/issues/349) - Added option enabling platypus formatted output of pmdtools misincorporation frequencies. + +### `Fixed` + +- [#719](https://github.com/nf-core/eager/pull/719) - Fix filename for bam output of `mapdamage_rescaling` +- [#707](https://github.com/nf-core/eager/pull/707) - Fix typo in UnifiedGenotyper IndelRealigner command +- Fixed some Java tools not following process memory specifications +- Updated template to nf-core/tools 1.13.2 +- [#711](https://github.com/nf-core/eager/pull/711) - Fix conditional execution preventing multivcfanalyze to run +- [#714](https://github.com/nf-core/eager/issues/714) - Fixes bug in nuc contamination by upgrading to latest MultiQC v1.10.1 bugfix release + +### `Dependencies` + +### `Deprecated` + ## [2.3.2] - 2021-03-16 ### `Added` diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md index 405fb1bfd..f4fd052f1 100644 --- a/CODE_OF_CONDUCT.md +++ b/CODE_OF_CONDUCT.md @@ -1,46 +1,111 @@ -# Contributor Covenant Code of Conduct +# Code of Conduct at nf-core (v1.0) ## Our Pledge -In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation. +In the interest of fostering an open, collaborative, and welcoming environment, we as contributors and maintainers of nf-core, pledge to making participation in our projects and community a harassment-free experience for everyone, regardless of: -## Our Standards +- Age +- Body size +- Familial status +- Gender identity and expression +- Geographical location +- Level of experience +- Nationality and national origins +- Native language +- Physical and neurological ability +- Race or ethnicity +- Religion +- Sexual identity and orientation +- Socioeconomic status -Examples of behavior that contributes to creating a positive environment include: +Please note that the list above is alphabetised and is therefore not ranked in any order of preference or importance. -* Using welcoming and inclusive language -* Being respectful of differing viewpoints and experiences -* Gracefully accepting constructive criticism -* Focusing on what is best for the community -* Showing empathy towards other community members +## Preamble -Examples of unacceptable behavior by participants include: +> Note: This Code of Conduct (CoC) has been drafted by the nf-core Safety Officer and been edited after input from members of the nf-core team and others. "We", in this document, refers to the Safety Officer and members of the nf-core core team, both of whom are deemed to be members of the nf-core community and are therefore required to abide by this Code of Conduct. This document will amended periodically to keep it up-to-date, and in case of any dispute, the most current version will apply. -* The use of sexualized language or imagery and unwelcome sexual attention or advances -* Trolling, insulting/derogatory comments, and personal or political attacks -* Public or private harassment -* Publishing others' private information, such as a physical or electronic address, without explicit permission -* Other conduct which could reasonably be considered inappropriate in a professional setting +An up-to-date list of members of the nf-core core team can be found [here](https://nf-co.re/about). Our current safety officer is Renuka Kudva. + +nf-core is a young and growing community that welcomes contributions from anyone with a shared vision for [Open Science Policies](https://www.fosteropenscience.eu/taxonomy/term/8). Open science policies encompass inclusive behaviours and we strive to build and maintain a safe and inclusive environment for all individuals. + +We have therefore adopted this code of conduct (CoC), which we require all members of our community and attendees in nf-core events to adhere to in all our workspaces at all times. Workspaces include but are not limited to Slack, meetings on Zoom, Jitsi, YouTube live etc. + +Our CoC will be strictly enforced and the nf-core team reserve the right to exclude participants who do not comply with our guidelines from our workspaces and future nf-core activities. + +We ask all members of our community to help maintain a supportive and productive workspace and to avoid behaviours that can make individuals feel unsafe or unwelcome. Please help us maintain and uphold this CoC. + +Questions, concerns or ideas on what we can include? Contact safety [at] nf-co [dot] re ## Our Responsibilities -Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. +The safety officer is responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behaviour. + +The safety officer in consultation with the nf-core core team have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. + +Members of the core team or the safety officer who violate the CoC will be required to recuse themselves pending investigation. They will not have access to any reports of the violations and be subject to the same actions as others in violation of the CoC. + +## When are where does this Code of Conduct apply? + +Participation in the nf-core community is contingent on following these guidelines in all our workspaces and events. This includes but is not limited to the following listed alphabetically and therefore in no order of preference: + +- Communicating with an official project email address. +- Communicating with community members within the nf-core Slack channel. +- Participating in hackathons organised by nf-core (both online and in-person events). +- Participating in collaborative work on GitHub, Google Suite, community calls, mentorship meetings, email correspondence. +- Participating in workshops, training, and seminar series organised by nf-core (both online and in-person events). This applies to events hosted on web-based platforms such as Zoom, Jitsi, YouTube live etc. +- Representing nf-core on social media. This includes both official and personal accounts. + +## nf-core cares 😊 + +nf-core's CoC and expectations of respectful behaviours for all participants (including organisers and the nf-core team) include but are not limited to the following (listed in alphabetical order): + +- Ask for consent before sharing another community member’s personal information (including photographs) on social media. +- Be respectful of differing viewpoints and experiences. We are all here to learn from one another and a difference in opinion can present a good learning opportunity. +- Celebrate your accomplishments at events! (Get creative with your use of emojis 🎉 🥳 💯 🙌 !) +- Demonstrate empathy towards other community members. (We don’t all have the same amount of time to dedicate to nf-core. If tasks are pending, don’t hesitate to gently remind members of your team. If you are leading a task, ask for help if you feel overwhelmed.) +- Engage with and enquire after others. (This is especially important given the geographically remote nature of the nf-core community, so let’s do this the best we can) +- Focus on what is best for the team and the community. (When in doubt, ask) +- Graciously accept constructive criticism, yet be unafraid to question, deliberate, and learn. +- Introduce yourself to members of the community. (We’ve all been outsiders and we know that talking to strangers can be hard for some, but remember we’re interested in getting to know you and your visions for open science!) +- Show appreciation and **provide clear feedback**. (This is especially important because we don’t see each other in person and it can be harder to interpret subtleties. Also remember that not everyone understands a certain language to the same extent as you do, so **be clear in your communications to be kind.**) +- Take breaks when you feel like you need them. +- Using welcoming and inclusive language. (Participants are encouraged to display their chosen pronouns on Zoom or in communication on Slack.) + +## nf-core frowns on 😕 + +The following behaviours from any participants within the nf-core community (including the organisers) will be considered unacceptable under this code of conduct. Engaging or advocating for any of the following could result in expulsion from nf-core workspaces. + +- Deliberate intimidation, stalking or following and sustained disruption of communication among participants of the community. This includes hijacking shared screens through actions such as using the annotate tool in conferencing software such as Zoom. +- “Doxing” i.e. posting (or threatening to post) another person’s personal identifying information online. +- Spamming or trolling of individuals on social media. +- Use of sexual or discriminatory imagery, comments, or jokes and unwelcome sexual attention. +- Verbal and text comments that reinforce social structures of domination related to gender, gender identity and expression, sexual orientation, ability, physical appearance, body size, race, age, religion or work experience. + +### Online Trolling + +The majority of nf-core interactions and events are held online. Unfortunately, holding events online comes with the added issue of online trolling. This is unacceptable, reports of such behaviour will be taken very seriously, and perpetrators will be excluded from activities immediately. + +All community members are required to ask members of the group they are working within for explicit consent prior to taking screenshots of individuals during video calls. + +## Procedures for Reporting CoC violations -Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. +If someone makes you feel uncomfortable through their behaviours or actions, report it as soon as possible. -## Scope +You can reach out to members of the [nf-core core team](https://nf-co.re/about) and they will forward your concerns to the safety officer(s). -This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. +Issues directly concerning members of the core team will be dealt with by other members of the core team and the safety manager, and possible conflicts of interest will be taken into account. nf-core is also in discussions about having an ombudsperson, and details will be shared in due course. -## Enforcement +All reports will be handled with utmost discretion and confidentially. -Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team on [Slack](https://nf-co.re/join/slack). The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately. +## Attribution and Acknowledgements -Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership. +- The [Contributor Covenant, version 1.4](http://contributor-covenant.org/version/1/4) +- The [OpenCon 2017 Code of Conduct](http://www.opencon2017.org/code_of_conduct) (CC BY 4.0 OpenCon organisers, SPARC and Right to Research Coalition) +- The [eLife innovation sprint 2020 Code of Conduct](https://sprint.elifesciences.org/code-of-conduct/) +- The [Mozilla Community Participation Guidelines v3.1](https://www.mozilla.org/en-US/about/governance/policies/participation/) (version 3.1, CC BY-SA 3.0 Mozilla) -## Attribution +## Changelog -This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct/][version] +### v1.0 - March 12th, 2021 -[homepage]: https://contributor-covenant.org -[version]: https://www.contributor-covenant.org/version/1/4/code-of-conduct/ +- Complete rewrite from original [Contributor Covenant](http://contributor-covenant.org/) CoC. diff --git a/Dockerfile b/Dockerfile index 773a11a32..88e0429a8 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,4 +1,4 @@ -FROM nfcore/base:1.12.1 +FROM nfcore/base:1.13.3 LABEL authors="The nf-core/eager community" \ description="Docker image containing all software requirements for the nf-core/eager pipeline" @@ -7,10 +7,10 @@ COPY environment.yml / RUN conda env create --quiet -f /environment.yml && conda clean -a # Add conda installation dir to PATH (instead of doing 'conda activate') -ENV PATH /opt/conda/envs/nf-core-eager-2.3.2/bin:$PATH +ENV PATH /opt/conda/envs/nf-core-eager-2.3.3/bin:$PATH # Dump the details of the installed packages to a file for posterity -RUN conda env export --name nf-core-eager-2.3.2 > nf-core-eager-2.3.2.yml +RUN conda env export --name nf-core-eager-2.3.3 > nf-core-eager-2.3.3.yml # Instruct R processes to use these empty files instead of clashing with a local version RUN touch .Rprofile diff --git a/README.md b/README.md index 43eec0138..ac9e19a4e 100644 --- a/README.md +++ b/README.md @@ -29,12 +29,12 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool 1. Install [`nextflow`](https://nf-co.re/usage/installation) (version >= 20.04.0) -2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) or [`Podman`](https://podman.io/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_ +2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/), [`Podman`](https://podman.io/), [`Shifter`](https://nersc.gitlab.io/development/shifter/how-to-use/) or [`Charliecloud`](https://hpc.github.io/charliecloud/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_ 3. Download the pipeline and test it on a minimal dataset with a single command: ```bash - nextflow run nf-core/eager -profile test_tsv, + nextflow run nf-core/eager -profile test_tsv, ``` > Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile ` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment. @@ -199,7 +199,6 @@ You can cite the `nf-core` publication as follows: > Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. > > _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). -> ReadCube: [Full Access Link](https://rdcu.be/b1GjZ) In addition, references of tools and data used in this pipeline are as follows: diff --git a/assets/email_template.html b/assets/email_template.html index b1f8792e0..36bfc9c8d 100644 --- a/assets/email_template.html +++ b/assets/email_template.html @@ -1,6 +1,5 @@ - diff --git a/assets/multiqc_config.yaml b/assets/multiqc_config.yaml index c105fcb4e..0d8c7c28a 100644 --- a/assets/multiqc_config.yaml +++ b/assets/multiqc_config.yaml @@ -269,4 +269,4 @@ report_section_order: nf-core-eager-summary: order: -1001 -export_plots: true \ No newline at end of file +export_plots: true diff --git a/assets/nf-core-eager_logo.png b/assets/nf-core-eager_logo.png index 4d301d806..d12a4ca65 100644 Binary files a/assets/nf-core-eager_logo.png and b/assets/nf-core-eager_logo.png differ diff --git a/conf/benchmarking_vikingfish.config b/conf/benchmarking_vikingfish.config index b0c456c61..765cf1f4d 100644 --- a/conf/benchmarking_vikingfish.config +++ b/conf/benchmarking_vikingfish.config @@ -14,7 +14,7 @@ params { //Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv' // Genome reference - fasta = 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gadus_morhua/representative/GCF_902167405.1_gadMor3.0/GCF_902167405.1_gadMor3.0_genomic.fna.gz' + fasta = 's3://nf-core-awsmegatests/eager/ENA_Data_Fish/GCF_902167405.1_gadMor3.0_genomic.fna.gz' bwaalnn = 0.04 bwaalnl = 1024 diff --git a/conf/test.config b/conf/test.config index d3a5fea2d..9cffc92f5 100644 --- a/conf/test.config +++ b/conf/test.config @@ -22,4 +22,6 @@ params { single_end = false // Genome references fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta' + // Ignore `--input` as otherwise the parameter validation will throw an error + schema_ignore_params = 'genomes,input_paths,input' } diff --git a/conf/test_full.config b/conf/test_full.config index 175e593ae..da2827e77 100644 --- a/conf/test_full.config +++ b/conf/test_full.config @@ -12,7 +12,8 @@ params { config_profile_description = 'Full test dataset to check nf-core/eager function' // Input data for full size test - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv' + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Benchmarking/benchmarking_vikingfish.tsv' + // Genome reference fasta = 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gadus_morhua/representative/GCF_902167405.1_gadMor3.0/GCF_902167405.1_gadMor3.0_genomic.fna.gz' @@ -31,7 +32,6 @@ params { } process { - withName:'adapter_removal'{ cpus = { check_max( 8, 'cpus' ) } memory = { check_max( 16.GB * task.attempt, 'memory' ) } @@ -52,5 +52,7 @@ process { memory = { check_max( 16.GB * task.attempt, 'memory' ) } time = { check_max( 8.h * task.attempt, 'time' ) } } - + + // Ignore `--input` as otherwise the parameter validation will throw an error + schema_ignore_params = 'genomes,input_paths,input' } diff --git a/docs/images/nf-core-eager_logo.png b/docs/images/nf-core-eager_logo.png index 4d301d806..0cc5e6531 100644 Binary files a/docs/images/nf-core-eager_logo.png and b/docs/images/nf-core-eager_logo.png differ diff --git a/docs/output.md b/docs/output.md index 0433aed11..cc07d9a69 100644 --- a/docs/output.md +++ b/docs/output.md @@ -1,9 +1,5 @@ # nf-core/eager: Output -## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/eager/output](https://nf-co.re/eager/output) - -> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ - ## Introduction The output of nf-core/eager primarily consists of the following main components: output alignment files (e.g. VCF, BAM or FASTQ files), and summary statistics of the whole run presented in a [`MultiQC`](https://multiqc.info) report. Intermediate files and module-specific statistics files are also retained depending on your particular run configuration. @@ -23,25 +19,25 @@ results/ work/ ``` -- The parent directory `` is the parent directory of the run, either the directory the pipeline was run from or as specified by the `--outdir` flag. The default name of the output directory (unless otherwise specified) will be `./results/`. +* The parent directory `` is the parent directory of the run, either the directory the pipeline was run from or as specified by the `--outdir` flag. The default name of the output directory (unless otherwise specified) will be `./results/`. ### Primary Output Directories These directories are the ones you will use on a day-to-day basis and are those which you should familiarise yourself with. -- The `MultiQC` directory is the most important directory and contains the main summary report of the run in HTML format, which can be viewed in a web-browser of your choice. The sub-directory contains the MultiQC collected data used to build the HTML report. The Report allows you to get an overview of the sequencing and mapping quality as well as aDNA metrics (see the [MultiQC Report](#multiqc-report) section for more detail). -- A `` directory contains the (cleaned-up) output from a particular software module. This is the second most important set of directories. This contains output files such as FASTQ, BAM, statistics, and/or plot files of a specific module (see the [Output Files](#output-files) section for more detail). The latter two are only needed when you need finer detail about that particular part of the pipeline. +* The `MultiQC` directory is the most important directory and contains the main summary report of the run in HTML format, which can be viewed in a web-browser of your choice. The sub-directory contains the MultiQC collected data used to build the HTML report. The Report allows you to get an overview of the sequencing and mapping quality as well as aDNA metrics (see the [MultiQC Report](#multiqc-report) section for more detail). +* A `` directory contains the (cleaned-up) output from a particular software module. This is the second most important set of directories. This contains output files such as FASTQ, BAM, statistics, and/or plot files of a specific module (see the [Output Files](#output-files) section for more detail). The latter two are only needed when you need finer detail about that particular part of the pipeline. ### Secondary Output Directories These are less important directories which are used less often, normally in the context of bug-reporting. -- `pipeline_info/`: [Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. - - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`. - - Documentation for interpretation of results in HTML format: `results_description.html`. -- `reference_genome/` contains either text files describing the location of specified reference genomes, and if not already supplied when running the pipeline, auxiliary indexing files. This is often useful when re-running other samples using the same reference genome, but is otherwise often not important. -- The `work/` directory contains all the `nextflow` processing directories. This is where `nextflow` actually does all the work, but in an efficient programmatic procedure that is not intuitive to human-readers. Due to this, the directory is often not important to a user as all the useful output files are linked to the module directories (see above). Otherwise, this directory maybe useful when a bug-reporting. +* `pipeline_info/`: [Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. + * Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. + * Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`. + * Documentation for interpretation of results in HTML format: `results_description.html`. +* `reference_genome/` contains either text files describing the location of specified reference genomes, and if not already supplied when running the pipeline, auxiliary indexing files. This is often useful when re-running other samples using the same reference genome, but is otherwise often not important. +* The `work/` directory contains all the `nextflow` processing directories. This is where `nextflow` actually does all the work, but in an efficient programmatic procedure that is not intuitive to human-readers. Due to this, the directory is often not important to a user as all the useful output files are linked to the module directories (see above). Otherwise, this directory maybe useful when a bug-reporting. > :warning: Note that `work/` will be created wherever you are running the `nextflow run` command from, unless you specify the location with `-w`, i.e. it will not by default be in `outdir`!. @@ -55,7 +51,7 @@ For more information about how to use MultiQC reports, see [http://multiqc.info] #### Background -This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed - however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks. +This is the main summary table produced by MultiQC that the report begins with. This section of the report is generated by MultiQC itself rather than stats produced by a specific module. It shows whatever each module considers to be as the 'most important' values to be displayed — however the nf-core/eager version has been somewhat customised to make it as close to the EAGER (v1) ReportTable format as possible, with some opinionated tweaks. #### Table @@ -65,40 +61,40 @@ Each column name is supplied by the module, so you may see similar column names. The possible columns displayed by default are as follows: -- **Sample Name** This is the log file name without file suffix(s). This will depend on the module outputs. -- **Seqs** This is from Pre-AdapterRemoval FastQC. Represents the number of raw reads in your untrimmed and (paired end) unmerged FASTQ file. Each row should be approximately equal to the number of reads you requested to be sequenced, divided by the number of FASTQ files you received for that library. -- **Length** This is from Pre-AdapterRemoval FastQC. This is the average read length in your untrimmed and (paired end) unmerged FASTQ file and should represent the number of cycles of your sequencing chemistry. -- **%GC** This is from Pre-AdapterRemoval FastQC. This is the average GC content in percent of all the reads in your untrimmed and (paired end) unmerged FASTQ file. -- **GC content** This is from FastP. This is the average GC of all reads in your untrimmed and unmerged FASTSQ file after poly-G tail trimming. If you have lots of tails, this value should drop from the pre-AdapterRemoval FastQC %GC column. -- **% Trimmed** This is from AdapterRemoval. It is the percentage of reads which had an adapter sequence removed from the end of the read. -- **Seqs** This is from Post-AdapterRemoval FastQC. Represents the number of preprocessed reads in your adapter trimmed (paired end) merged FASTQ file. The loss between this number and the Pre-AdapterRemoval FastQC can give you an idea of the quality of trimming and merging. -- **%GC** This is from Post-AdapterRemoval FastQC. Represents the average GC of all preprocessed reads in your adapter trimmed (paired end) merged FASTQ file. -- **Length** This is from post-AdapterRemoval FastQC. This is the average read length in your trimmed and (paired end) merged FASTQ file and should represent the 'realistic' average lengths of your DNA molecules -- **% Aligned** This is from bowtie2. It reports the percentage of input reads that mapped to your reference genome. This number will be likely similar to Endogenous DNA % (see below). -- **Mappability** This is from MALT. It reports the percentage of the off-target reads (from mapping), that could map to your MALT metagenomic database. This can often be low for aDNA due to short reads and database bias. -- **% Unclassified** This is from Kraken. It reports the percentage of reads that could not be aligned and taxonomically assigned against your Kraken metagenomic database. This can often be high for aDNA due to short reads and database bias. -- **Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _prior_ map quality filtering. -- **Endogenous DNA (%)** This is from the endorS.py tool. It displays a percentage of mapped reads over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). Assuming a perfect ancient sample with no modern contamination, this would be the amount of true ancient DNA in the sample. However this value _most likely_ include contamination and will not entirely be the true 'endogenous' content. -- **Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _after_ map quality filtering (note the column name does not distinguish itself from prior-map quality filtering, but the post-filter column is always second) -- **Endogenous DNA Post (%)** This is from the endorS.py tool. It displays a percentage of mapped reads _after_ BAM filtering (e.g. for mapping quality) over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). This column will only be displayed if BAM filtering is turned on and is based on the original mapping for total reads, and mapped reads as calculated from the post-filtering BAM. -- **ClusterFactor** This is from DeDup. This is a value representing the how many duplicates in the library exist for each unique read. A cluster factor close to one replicates a highly complex library and could be sequenced further. Generally with a value of more than 2 you will not be gaining much more information by sequencing deeper. -- **Dups** This is from Picard's markDuplicates. It represents the percentage of reads in your library that were exact duplicates of other reads in your database. The lower the better, as high duplication rate means lots of sequencing of the same information (and therefore is not time or cost effective). -- **X Prime Y>Z N base** These columns are from DamageProfiler. The prime numbers represent which end of the reads the damage is referring to. The Y>Z is the type of substitution (C>T is the true damage, G>A is the complementary). You should see for no- and half- UDG treatment a decrease in frequency from the 1st to 2nd base. -- **Mean Read Length** This is from DamageProfiler. This is the mean length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary. -- **Median Read Length** This is from DamageProfiler. This is the median length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary. -- **Aligned** This is from Qualimap. This is the total number of _deduplicated_ reads that mapped to your reference genome. This is the **best** number to report for final mapped reads in final publications. -- **Mean/Median Coverage** This is from Qualimap. This is the mean/median number of times a base on your reference genome was covered by a read (i.e. depth coverage). This average includes bases with 0 reads covering that position. -- **>= 1X** to **>= 5X** These are from Qualimap. This is the percentage of the genome covered at that particular depth coverage. -- **% GC** This is the mean GC content in percent of all mapped reads post-deduplication. This should normally be close to the GC content of your reference genome. -- **MT to Nuclear Ratio** This from MTtoNucRatio. This reports the number of reads aligned to a mitochondrial entry in your reference FASTA to all other entries. This will typically be high but will vary depending on tissue type. -- **XRate** This is from Sex.DetERRmine. This is the relative depth of coverage on the X-chromosome. -- **YRate** This is from Sex.DetERRmine. This is the relative depth of coverage on the Y-chromosome. -- **#SNPs Covered** This is from eigenstrat\_snp\_coverage. The number of called SNPs after genotyping with pileupcaller. -- **#SNPs Total** This is from eigenstrat\_snp\_coverage. The maximum number of covered SNPs, i.e. the number of SNPs in the .snp file provided to pileupcaller with `--pileupcaller_snpfile`. -- **Number of SNPs** This is from ANGSD. The number of SNPs left after removing sites with no data in a 5 base pair surrounding region. -- **Contamination Estimate (Method1_ML)** This is from the nuclear contamination function of ANGSD. The Maximum Likelihood contamination estimate according to Method 1. The estimates using Method of Moments and/or those based on Method 2 can be unhidden through the "Configure Columns" button. -- **Estimate Error (Method1_ML)** This is from ANGSD. The standard error of the Method1 Maximum likelihood estimate. The errors associated with Method of Moments and/or Method2 estimates can be unhidden through the "Configure Columns" button. -- **% Hets** This is from MultiVCFAnalyzer. This reports the number of SNPs on an assumed haploid organism that have two possible alleles. A high percentage may indicate cross-mapping from a related species. +* **Sample Name** This is the log file name without file suffix(s). This will depend on the module outputs. +* **Seqs** This is from Pre-AdapterRemoval FastQC. Represents the number of raw reads in your untrimmed and (paired end) unmerged FASTQ file. Each row should be approximately equal to the number of reads you requested to be sequenced, divided by the number of FASTQ files you received for that library. +* **Length** This is from Pre-AdapterRemoval FastQC. This is the average read length in your untrimmed and (paired end) unmerged FASTQ file and should represent the number of cycles of your sequencing chemistry. +* **%GC** This is from Pre-AdapterRemoval FastQC. This is the average GC content in percent of all the reads in your untrimmed and (paired end) unmerged FASTQ file. +* **GC content** This is from FastP. This is the average GC of all reads in your untrimmed and unmerged FASTSQ file after poly-G tail trimming. If you have lots of tails, this value should drop from the pre-AdapterRemoval FastQC %GC column. +* **% Trimmed** This is from AdapterRemoval. It is the percentage of reads which had an adapter sequence removed from the end of the read. +* **Seqs** This is from Post-AdapterRemoval FastQC. Represents the number of preprocessed reads in your adapter trimmed (paired end) merged FASTQ file. The loss between this number and the Pre-AdapterRemoval FastQC can give you an idea of the quality of trimming and merging. +* **%GC** This is from Post-AdapterRemoval FastQC. Represents the average GC of all preprocessed reads in your adapter trimmed (paired end) merged FASTQ file. +* **Length** This is from post-AdapterRemoval FastQC. This is the average read length in your trimmed and (paired end) merged FASTQ file and should represent the 'realistic' average lengths of your DNA molecules +* **% Aligned** This is from bowtie2. It reports the percentage of input reads that mapped to your reference genome. This number will be likely similar to Endogenous DNA % (see below). +* **Mappability** This is from MALT. It reports the percentage of the off-target reads (from mapping), that could map to your MALT metagenomic database. This can often be low for aDNA due to short reads and database bias. +* **% Unclassified** This is from Kraken. It reports the percentage of reads that could not be aligned and taxonomically assigned against your Kraken metagenomic database. This can often be high for aDNA due to short reads and database bias. +* **Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _prior_ map quality filtering. +* **Endogenous DNA (%)** This is from the endorS.py tool. It displays a percentage of mapped reads over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). Assuming a perfect ancient sample with no modern contamination, this would be the amount of true ancient DNA in the sample. However this value _most likely_ include contamination and will not entirely be the true 'endogenous' content. +* **Reads Mapped** This is from Samtools. This is the raw number of preprocessed reads mapped to your reference genome _after_ map quality filtering (note the column name does not distinguish itself from prior-map quality filtering, but the post-filter column is always second) +* **Endogenous DNA Post (%)** This is from the endorS.py tool. It displays a percentage of mapped reads _after_ BAM filtering (e.g. for mapping quality) over total reads that went into mapped (i.e. the percentage DNA content of the library that matches the reference). This column will only be displayed if BAM filtering is turned on and is based on the original mapping for total reads, and mapped reads as calculated from the post-filtering BAM. +* **ClusterFactor** This is from DeDup. This is a value representing the how many duplicates in the library exist for each unique read. A cluster factor close to one replicates a highly complex library and could be sequenced further. Generally with a value of more than 2 you will not be gaining much more information by sequencing deeper. +* **Dups** This is from Picard's markDuplicates. It represents the percentage of reads in your library that were exact duplicates of other reads in your database. The lower the better, as high duplication rate means lots of sequencing of the same information (and therefore is not time or cost effective). +* **X Prime Y>Z N base** These columns are from DamageProfiler. The prime numbers represent which end of the reads the damage is referring to. The Y>Z is the type of substitution (C>T is the true damage, G>A is the complementary). You should see for no- and half-UDG treatment a decrease in frequency from the 1st to 2nd base. +* **Mean Read Length** This is from DamageProfiler. This is the mean length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary. +* **Median Read Length** This is from DamageProfiler. This is the median length of all de-duplicated mapped reads. Ancient DNA normally will have a mean between 30-75, however this can vary. +* **Aligned** This is from Qualimap. This is the total number of _deduplicated_ reads that mapped to your reference genome. This is the **best** number to report for final mapped reads in final publications. +* **Mean/Median Coverage** This is from Qualimap. This is the mean/median number of times a base on your reference genome was covered by a read (i.e. depth coverage). This average includes bases with 0 reads covering that position. +* **>= 1X** to **>= 5X** These are from Qualimap. This is the percentage of the genome covered at that particular depth coverage. +* **% GC** This is the mean GC content in percent of all mapped reads post-deduplication. This should normally be close to the GC content of your reference genome. +* **MT to Nuclear Ratio** This from MTtoNucRatio. This reports the number of reads aligned to a mitochondrial entry in your reference FASTA to all other entries. This will typically be high but will vary depending on tissue type. +* **XRate** This is from Sex.DetERRmine. This is the relative depth of coverage on the X-chromosome. +* **YRate** This is from Sex.DetERRmine. This is the relative depth of coverage on the Y-chromosome. +* **#SNPs Covered** This is from eigenstrat\_snp\_coverage. The number of called SNPs after genotyping with pileupcaller. +* **#SNPs Total** This is from eigenstrat\_snp\_coverage. The maximum number of covered SNPs, i.e. the number of SNPs in the .snp file provided to pileupcaller with `--pileupcaller_snpfile`. +* **Number of SNPs** This is from ANGSD. The number of SNPs left after removing sites with no data in a 5 base pair surrounding region. +* **Contamination Estimate (Method1_ML)** This is from the nuclear contamination function of ANGSD. The Maximum Likelihood contamination estimate according to Method 1. The estimates using Method of Moments and/or those based on Method 2 can be unhidden through the "Configure Columns" button. +* **Estimate Error (Method1_ML)** This is from ANGSD. The standard error of the Method1 Maximum likelihood estimate. The errors associated with Method of Moments and/or Method2 estimates can be unhidden through the "Configure Columns" button. +* **% Hets** This is from MultiVCFAnalyzer. This reports the number of SNPs on an assumed haploid organism that have two possible alleles. A high percentage may indicate cross-mapping from a related species. For other non-default columns (activated under 'Configure Columns'), hover over the column name for further descriptions. @@ -120,7 +116,7 @@ For further reading and documentation see the [FastQC help pages](http://www.bio #### Sequence Counts -This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself - unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over. +This shows a barplot with the overall number of sequences (x axis) in your raw library after demultiplexing, **per file** (y-axis). If you have paired end data, you will have one bar for Read 1 (or forward), and a second bar for Read 2 (or reverse). Each entire bar should represent approximately what you requested from the sequencer itself — unless you have your library sequenced over multiple lanes, where it should be what you request divided by the number of lanes it was split over. A section of the bar will also show an approximate estimation of the fraction of the total number of reads that are duplicates of another. This can derive from over-amplification of the library, or lots of single adapters. This can be later checked with the Deduplication check. A good library and sequencing run should have very low amounts of duplicates reads. @@ -140,9 +136,9 @@ You will often see that the first 5 or so bases have slightly lower quality than Things to watch out for: -- all positions having Phred scores less than 27 -- a sharp drop-off of quality early in the read -- for paired-end data, if either R1 or R2 is significantly lower quality across the whole read compared to the complementary read. +* all positions having Phred scores less than 27 +* a sharp drop-off of quality early in the read +* for paired-end data, if either R1 or R2 is significantly lower quality across the whole read compared to the complementary read. #### Per Sequence Quality Scores @@ -154,8 +150,8 @@ This is a further summary of the previous plot. This is a histogram of the _over Things to watch out for: -- bi-modal peaks which suggests artefacts in some of the sequencing cycles -- all peaks being in orange or red sections which suggests an overall bad sequencing run (possibly due to a faulty flow-cell). +* bi-modal peaks which suggests artefacts in some of the sequencing cycles +* all peaks being in orange or red sections which suggests an overall bad sequencing run (possibly due to a faulty flow-cell). #### Per Base Sequencing Content @@ -169,7 +165,7 @@ You expect to see whole heatmap to be a relatively equal block of colour (normal Things to watch out for: -- If you see a particular colour becoming more prominent this suggests there is an over-representation of those bases at that base-pair range across all reads (e.g. 20-24bp). This could happen if you have lots of PCR duplicates, or poly-G tails from Illumina NextSeq/NovaSeq 2-colour chemistry data (where no fluorescence can mean both G or 'no-call'). +* If you see a particular colour becoming more prominent this suggests there is an over-representation of those bases at that base-pair range across all reads (e.g. 20-24bp). This could happen if you have lots of PCR duplicates, or poly-G tails from Illumina NextSeq/NovaSeq 2-colour chemistry data (where no fluorescence can mean both G or 'no-call'). > If you see Poly-G tails, we recommend to turn on FastP poly-G trimming with EAGER. See the 'running' documentation page for details. @@ -183,7 +179,7 @@ This line graph shows the number percentage reads (y-axis) with an average perce Things to watch out for: -- If you see particularly high percent GC content peak with NextSeq/NovaSeq data, you may have lots of PCR duplicates, or poly-G tails from Illumina NextSeq/NovaSeq 2-colour chemistry data (where no fluorescence can mean both G or 'no-call'). Consider re-running nf-core/eager using the poly-G trimming option from `fastp` See the 'running' documentation page for details. +* If you see particularly high percent GC content peak with NextSeq/NovaSeq data, you may have lots of PCR duplicates, or poly-G tails from Illumina NextSeq/NovaSeq 2-colour chemistry data (where no fluorescence can mean both G or 'no-call'). Consider re-running nf-core/eager using the poly-G trimming option from `fastp` See the 'running' documentation page for details. #### Per Base N Content @@ -203,11 +199,11 @@ This plot is some-what similar to looking at duplication rate or 'cluster factor

-A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) - suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates. +A good library should have very low rates of duplication (vast majority of reads having a duplication rate of 1) — suggesting 'high complexity' or lots of unique reads and useful data. This is represented as a steep drop in the line plot and possible a very small curve at about a duplication rate of 2 or 3 and then remaining at ~0 for higher duplication rates. Note that good libraries may sometimes have small peaks at high duplication levels. This maybe due to free-adapters (with no inserts), or mono-nucleotide reads (e.g. GGGGG in NextSeq/NovaSeq data). -Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels - so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information). +Bad libraries which have extremely low input DNA (so during amplification the same molecules been amplified repeatedly), or a good library that has been erroneously over-amplified will show very high duplication levels — so a very slowly decreasing curve. Alternatively, if your library construction failed and many adapters were not ligated to insert molecules, a high duplication rate may be caused by these free-adapters (see 'Overrepresented sequences' for more information). > **NB:** amplicon libraries such as for 16S rRNA analysis may appear here as having high duplication rates and these peaks can be ignored. This can be verified if no contaminants are found in the 'Overrepresented sequences' section. @@ -259,7 +255,7 @@ After filtering, you should see that the average GC content along the reads is n Things to look out for: -- If you see a distinct GC content increase at the end of the reads, but are not removed after filtering, check to see where along the read the increase seems to start. If it is less than 10 base pairs from the end, consider reducing the overlap parameter `--complexity_filter_poly_g_min`, which tells FastP how far in the read the Gs need to go before removing them. +* If you see a distinct GC content increase at the end of the reads, but are not removed after filtering, check to see where along the read the increase seems to start. If it is less than 10 base pairs from the end, consider reducing the overlap parameter `--complexity_filter_poly_g_min`, which tells FastP how far in the read the Gs need to go before removing them. ### AdapterRemoval @@ -267,10 +263,10 @@ Things to look out for: AdapterRemoval a tool that does the post-sequencing clean up of your sequencing reads. It performs the following functions -- 'Merges' (or 'collapses') forward and reverse reads of Paired End data -- Removes remaining library indexing adapters -- Trims low quality base tails from ends of reads -- Removes too-short reads +* 'Merges' (or 'collapses') forward and reverse reads of Paired End data +* Removes remaining library indexing adapters +* Trims low quality base tails from ends of reads +* Removes too-short reads In more detail merging is where the same read from the forward and reverse files of a single library (based on the flowcell coordinates), are compared to find a stretch of sequence that are the same. If this overlap reaches certain quality thresholds, the two reads are 'collapsed' into a single read, with the base quality scores are updated accordingly accounting for the increase quality call precision. @@ -290,10 +286,10 @@ The most important value is the **Retained Read Pairs** which gives you the fina Other Categories: -- If paired-end, the **Singleton [mate] R1(/R2)** categories represent reads which were unable to be collapsed, possibly due to the reads being too long to overlap. -- If paired-end, **Full-length collapsed pairs** are reads which were collapsed and did not require low-quality bases at end of reads to be removed. -- If paired-end, **Truncated collapsed pairs** are paired-end that were collapsed but did required the removal of low quality bases at the end of reads. -- **Discarded [mate] R1/R2** represent reads which were a part of a pair, but one member of the pair did not reach other quality criteria and was discarded. However the other member of the pair is still retained in the output file as it still reached other quality criteria. +* If paired-end, the **Singleton [mate] R1(/R2)** categories represent reads which were unable to be collapsed, possibly due to the reads being too long to overlap. +* If paired-end, **Full-length collapsed pairs** are reads which were collapsed and did not require low-quality bases at end of reads to be removed. +* If paired-end, **Truncated collapsed pairs** are paired-end that were collapsed but did required the removal of low quality bases at the end of reads. +* **Discarded [mate] R1/R2** represent reads which were a part of a pair, but one member of the pair did not reach other quality criteria and was discarded. However the other member of the pair is still retained in the output file as it still reached other quality criteria.

@@ -307,11 +303,11 @@ If you see high numbers of discarded or truncated reads, you should check your F The length distribution plots show the number of reads at each read-length. You can change the plot to display different categories. -- All represent the overall distribution of reads. In the case of paired-end sequencing You may see a peak at the turn around from forward to reverse cycles. -- **Mate 1** and **Mate 2** represents the length of the forward and reverse read respectively prior collapsing -- **Singleton** represent those reads that had a one member of a pair discarded -- **Collapsed** and **Collapsed Truncated** represent reads that overlapped and able to merge into a single read, with the latter including base-quality trimming off ends of reads. These plots will start with a vertical rise representing where you are above the minimum-read threshold you set. -- **Discarded** here represents the number of reads that did not each the read length filter. You will likely see a vertical drop at what your threshold was set to. +* All represent the overall distribution of reads. In the case of paired-end sequencing You may see a peak at the turn around from forward to reverse cycles. +* **Mate 1** and **Mate 2** represents the length of the forward and reverse read respectively prior collapsing +* **Singleton** represent those reads that had a one member of a pair discarded +* **Collapsed** and **Collapsed Truncated** represent reads that overlapped and able to merge into a single read, with the latter including base-quality trimming off ends of reads. These plots will start with a vertical rise representing where you are above the minimum-read threshold you set. +* **Discarded** here represents the number of reads that did not each the read length filter. You will likely see a vertical drop at what your threshold was set to.

@@ -357,7 +353,7 @@ Due to low 'endogenous' content of aDNA, and the high biodiversity of modern or

- This can also be influenced by the type of database you supplied - many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability. + This can also be influenced by the type of database you supplied — many databases have an over-abundance of taxa of clinical or economic interest, so when you have a large amount of uncharacterised environmental taxa, this may also result in low mappability. #### Taxonomic assignment success @@ -376,7 +372,7 @@ there is some sequencing artefact (although it could just be badly preserved and #### Background -Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment - meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths. +Kraken is another metagenomic classifier, but takes a different approach to alignment as with [MALT](#malt). It uses 'K-mer similarity' between reads and references to very efficiently find similar patterns in sequences. It does not however, do alignment — meaning you cannot screen for authentication criteria such as damage patterns and fragment lengths. It is useful when you do not have large computing power or you want very rapid but rough approximation of the metagenomic profile of your sample. @@ -384,7 +380,7 @@ You will receive output for each *library*. This means that if you use TSV input #### Top Taxa -This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes - therefore a large fraction of 'unclassified' can be quite normal. +This plot gives you an approximation of the abundance of the five top taxa identified. Typically for ancient DNA, this will be quite a small fraction of taxa, as archaeological and museum samples have a large biodiversity from environmental microbes — therefore a large fraction of 'unclassified' can be quite normal.

@@ -428,15 +424,15 @@ DeDup is a duplicate removal tool which searches for PCR duplicates and removes This stacked bar plot shows as a whole the total number of reads in the BAM file going into DeDup. The different sections of a given bar represents the following: -- **Not Removed** - the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below). -- **Reverse Removed** - the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step). -- **Forward Removed** - the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step). -- **Merged Removed** - the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step). +* **Not Removed** — the overall number of reads remaining after duplicate removal. These may have had a duplicate (see below). +* **Reverse Removed** — the number of reads that found to be a duplicate of another and removed that were un-collapsed reverse reads (from the earlier read merging step). +* **Forward Removed** — the number of reads that found to be a duplicate of another and removed that were an un-collapsed forward reads (from the earlier read merging step). +* **Merged Removed** — the number of reads that were found to be a duplicate and removed that were a collapsed read (from the earlier read merging step). Exceptions to the above: -- If you do not have paired end data, you will not have sections for 'Merged removed' or 'Reverse removed'. -- If you use the `--dedup_all_merged` flag, you will not have the 'Forward removed' or 'Reverse removed' sections. +* If you do not have paired end data, you will not have sections for 'Merged removed' or 'Reverse removed'. +* If you use the `--dedup_all_merged` flag, you will not have the 'Forward removed' or 'Reverse removed' sections.

@@ -444,8 +440,8 @@ Exceptions to the above: Things to look out for: -- The smaller the number of the duplicates removed the better. If you have a small number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence. -- If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, or a lot of left-over adapters that were able to map to your genome. +* The smaller the number of the duplicates removed the better. If you have a small number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence. +* If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, or a lot of left-over adapters that were able to map to your genome. ### Picard @@ -455,7 +451,7 @@ Picard is a toolkit for general BAM file manipulation with many different functi #### Mark Duplicates -The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well- preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below. +The deduplication stats plot shows you how many reads were detected and then removed during deduplication of a mapped BAM file. Well-preserved and constructed libraries will typically have many unique reads and few duplicates. These libraries are often good candidates for deeper sequencing (if required), but low-endogenous DNA libraries that have been over-amplified will have few unique reads and many copies of each read. For better calculations you can see the [Preseq](#preseq) module below.

@@ -465,8 +461,8 @@ The amount of unmapped reads will depend on whether you have filtered out unmapp Things to look out for: -- The smaller the number of the duplicates removed the better. If you have a smaller number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence. -- If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, a badly preserved sample with a very low yield, or a lot of left-over adapters that were able to map to your genome. +* The smaller the number of the duplicates removed the better. If you have a smaller number of duplicates, and wish to sequence deeper, you can use the preseq module (see below) to make an estimate on how much deeper to sequence. +* If you have a very large number of duplicates that were removed this may suggest you have an over amplified library, a badly preserved sample with a very low yield, or a lot of left-over adapters that were able to map to your genome. ### Preseq @@ -492,9 +488,9 @@ The dashed line represents a 'perfect' library containing only unique molecules Plateauing can be caused by a number of reasons: -- You have simply sequenced your library to exhaustion -- You have an over-amplified library with many PCR duplicates. You should consider rebuilding the library to maximise data to cost ratio -- You have a low quality library made up of mappable sequencing artefacts that were able to pass filtering (e.g. adapters) +* You have simply sequenced your library to exhaustion +* You have an over-amplified library with many PCR duplicates. You should consider rebuilding the library to maximise data to cost ratio +* You have a low quality library made up of mappable sequencing artefacts that were able to pass filtering (e.g. adapters) ### DamageProfiler @@ -504,9 +500,9 @@ DamageProfiler is a tool which calculates a variety of standard 'aDNA' metrics f Therefore, three main characteristics of ancient DNA are: -- Short DNA fragments -- Elevated G and As (purines) just before strand breaks -- Increased C and Ts at ends of fragments +* Short DNA fragments +* Elevated G and As (purines) just before strand breaks +* Increased C and Ts at ends of fragments You will receive output for each deduplicated *library*. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes of the library in one value. @@ -516,12 +512,12 @@ The MultiQC DamageProfiler module misincorporation plots shows the percent frequ When looking at the misincorporation plots, keep the following in mind: -- As few-base single-stranded overhangs are more likely to occur than long overhangs, we expect to see a gradual decrease in the frequency of the modifications from position 1 to the inside of the reads. -- If your library has been **partially-UDG treated**, only the first one or two bases will display the misincorporation frequency. -- If your library has been **UDG treated** you will expect to see extremely-low to no misincorporations at read ends. -- If your library is **single-stranded**, you will expect to see only C to T misincorporations at both 5' and 3' ends of the fragments. -- We generally expect that the older the sample, or the less-ideal preservational environment (hot/wet) the greater the frequency of C to T/G to A. -- The curve will be not smooth then you have few reads informing the frequency calculation. Read counts of less than 500 are likely not reliable. +* As few-base single-stranded overhangs are more likely to occur than long overhangs, we expect to see a gradual decrease in the frequency of the modifications from position 1 to the inside of the reads. +* If your library has been **partially-UDG treated**, only the first one or two bases will display the misincorporation frequency. +* If your library has been **UDG treated** you will expect to see extremely-low to no misincorporations at read ends. +* If your library is **single-stranded**, you will expect to see only C to T misincorporations at both 5' and 3' ends of the fragments. +* We generally expect that the older the sample, or the less-ideal preservational environment (hot/wet) the greater the frequency of C to T/G to A. +* The curve will be not smooth then you have few reads informing the frequency calculation. Read counts of less than 500 are likely not reliable.

@@ -535,9 +531,9 @@ The MultiQC DamageProfiler module length distribution plots show the frequency o When looking at the length distribution plots, keep in mind the following: -- Your curves will likely not start at 0, and will start wherever your minimum read-length setting was when removing adapters. -- You should typically see the bulk of the distribution falling between 40-120bp, which is normal for aDNA -- You may see large peaks at paired-end turn-arounds, due to very-long reads that could not overlap for merging being present, however this reads are normally from modern contamination. +* Your curves will likely not start at 0, and will start wherever your minimum read-length setting was when removing adapters. +* You should typically see the bulk of the distribution falling between 40-120bp, which is normal for aDNA +* You may see large peaks at paired-end turn-arounds, due to very-long reads that could not overlap for merging being present, however this reads are normally from modern contamination. ### QualiMap @@ -565,14 +561,14 @@ The greater the number of bases covered at as high as possible fold coverage, th Things to watch out for: -- You will typically see a direct decay from the lowest coverage to higher. A large range of coverages along the X axis is potentially suspicious. -- If you have stacking of reads i.e. a small region with an abnormally large amount of reads despite the rest of the reference being quite shallowly covered, this will artificially increase your coverage. This would be represented by a small peak that is a much further along the X axis away from the main distribution of reads. +* You will typically see a direct decay from the lowest coverage to higher. A large range of coverages along the X axis is potentially suspicious. +* If you have stacking of reads i.e. a small region with an abnormally large amount of reads despite the rest of the reference being quite shallowly covered, this will artificially increase your coverage. This would be represented by a small peak that is a much further along the X axis away from the main distribution of reads. #### Cumulative Genome Coverage This plot shows how much of the genome in percentage (X axis) is covered by a given fold depth coverage (Y axis). -An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage - something particular true for large genome such has for humans. +An ideal plot for this is to see an increasing curve, representing larger greater fractions of the genome being increasingly covered at higher depth. However, for low-coverage ancient DNA data, you will be more likely to see decreasing curves starting at a large percentage of the genome being covered at 0 fold coverage — something particular true for large genomes such as for humans.

@@ -588,9 +584,9 @@ This plot shows the distribution of the frequency of reads at different GC conte Things to watch out for: -- This plot should normally show a normal distribution around the average GC content of your reference genome. -- Bimodal peaks may represent lab-based artefacts that should be further investigated. -- Skews of the peak to a higher GC content that the reference in Illumina dual-colour chemistry data (e.g. NextSeq or NovaSeq), may suggest long poly-G tails that are mapping to poly-G stretches of your genome. The nf-core/eager trimming option `--complexity_filter_poly_g` can be used to remove these tails by utilising the tool FastP for detection and trimming. +* This plot should normally show a normal distribution around the average GC content of your reference genome. +* Bimodal peaks may represent lab-based artefacts that should be further investigated. +* Skews of the peak to a higher GC content that the reference in Illumina dual-colour chemistry data (e.g. NextSeq or NovaSeq), may suggest long poly-G tails that are mapping to poly-G stretches of your genome. The nf-core/eager trimming option `--complexity_filter_poly_g` can be used to remove these tails by utilising the tool FastP for detection and trimming. ### Sex.DetERRmine @@ -636,7 +632,7 @@ This table shows the contents of the `snpStatistics.tsv` file produced by MultiV You can get different variants of the call statistics bar plot, depending on how you configured the MultiVCFAnalyzer options. -If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains - particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling). +If you ran with `--min_allele_freq_hom` and `--min_allele_freq_het` set to two different values (left panel A in the figure below), this allows you to assess the number of multi-allelic positions that were called in your genome. Typically MultiVCFAnalyzer is used for analysing smallish haploid genomes (such as mitochondrial or bacterial genomes), therefore a position with multiple possible 'alleles' suggests some form of cross-mapping from other taxa or presence of multiple strains. If this is the case, you will need to be careful with downstream analysis of the consensus sequence (e.g. for phylogenetic tree analysis) as you may accidentally pick up SNPs from other taxa/strains — particularly when dealing with low coverage data. Therefore if you have a high level of 'het' values (see image), you should carefully check your alignments manually to see how clean your genomes are, or whether you can do some form of strain separation (e.g. by majority/minority calling).

@@ -650,32 +646,31 @@ This section gives a brief summary of where to look for what files for downstrea Each module has it's own output directory which sit alongside the `MultiQC/` directory from which you opened the report. -- `reference_genome/` - this directory contains the indexing files of your input reference genome (i.e. the various `bwa` indices, a `samtools`' `.fai` file, and a picard `.dict`), if you used the `--saveReference` flag. -- `fastqc/` - this contains the original per-FASTQ FastQC reports that are summarised with MultiQC. These occur in both `html` (the report) and `.zip` format (raw data). The `after_clipping` folder contains the same but for after AdapterRemoval. -- `adapterremoval/` - this contains the log files (ending with `.settings`) with raw trimming (and merging) statistics after AdapterRemoval. In the `output` sub-directory, are the output trimmed (and merged) FASTQ files. These you can use for downstream applications such as taxonomic binning for metagenomic studies. -- `mapping/` - this contains a sub-directory corresponding to the mapping tool you used, inside of which will be the initial BAM files containing the reads that mapped to your reference genome with no modification (see below). You will also find a corresponding BAM index file (ending in `.csi` or `.bam`), and if running the `bowtie2` mapper - a log ending in `_bt2.log`. You can use these for downstream applications e.g. if you wish to use a different de-duplication tool not included in nf-core/eager (although please feel free to add a new module request on the Github repository's [issue page](https://github.com/nf-core/eager/issues)!). -- `samtools/` - this contains two sub-directories. `stats/` contain the raw mapping statistics files (ending in `.stats`) from directly after mapping. `filter/` contains BAM files that have had a mapping quality filter applied (set by the `--bam_mapping_quality_threshold` flag) and a corresponding index file. Furthermore, if you selected `--bam_discard_unmapped`, you will find your separate file with only unmapped reads in the format you selected. Note unmapped read BAM files will _not_ have an index file. -- `deduplication/` - this contains a sub-directory called `dedup/`, inside here are sample specific directories. Each directory contains a BAM file containing mapped reads but with PCR duplicates removed, a corresponding index file and two stats file. `.hist.` contains raw data for a deduplication histogram used for tools like preseq (see below), and the `.log` contains overall summary deduplication statistics. -- `endorSpy/` - this contains all JSON files exported from the endorSpy endogenous DNA calculation tool. The JSON files are generated specifically for display in the MultiQC general statistics table and is otherwise very likely not useful for you. -- `preseq/` - this contains a `.ccurve` file for every BAM file that had enough deduplication statistics to generate a complexity curve for estimating the amount unique reads that will be yield if the library is re-sequenced. You can use this file for plotting e.g. in `R` to find your sequencing target depth. -- `qualimap/` - this contains a sub-directory for every sample, which includes a qualimap report and associated raw statistic files. You can open the `.html` file in your internet browser to see the in-depth report (this will be more detailed than in MultiQC). This includes stuff like percent coverage, depth coverage, GC content and so on of your mapped reads. -- `damageprofiler/` - this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files. -- `pmdtools/` - this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`. -- `trimmed_bam/` - this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read). -- `damage_rescaling/` - this contains rescaled BAM files from mapDamage2. These BAM files have damage probabilistically removed via a bayesian model, and can be used for downstream genotyping. -- `genotyping/` - this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics. -- `multivcfanalyzer/` - this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files. -- `sex_determination/` - this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer. -- `nuclear_contamination/` - this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2. -- `bedtools/` - this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position). -- `metagenomic_complexity_filter` - this contains the output from filtering of input reads to metagenomic classification of low-sequence complexity reads as performed by `bbduk`. This will include the filtered FASTQ files (`*_lowcomplexityremoved.fq.gz`) and also the run-time log (`_bbduk.stats`) for each sample. **Note:** there are no sections in the MultiQC report for this module, therefore you must check the `._bbduk.stats` files to get summary statistics of the filtering. -- `metagenomic_classification/` - this contains the output for a given metagenomic classifier. - - Running MALT will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested. - - Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table. You will also get a Kraken kmer duplication table, in a [KrakenUniq](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0) fashion. This is very useful to check for breadth of coverage and detect read stacking. A small number of aligned reads (low coverage) and a kmer duplication >1 is usually a sign of read stacking, usually indicative of a false positive hit (e.g. from over-amplified libraries). *Kmer duplication is defined as: number of kmers / number of unique kmers*. You will find two kraken reports formats available: - - the `*.kreport` which is the old report format, without distinct minimizer count information, used by some tools such as [Pavian](https://github.com/fbreitwieser/pavian) - - the `*.kraken2_report` which is the new kraken report format, with the distinct minimizer count information. - - Finally, the `*.kraken.out` file are the direct output of Kraken2 -- `maltextract/` - this contains a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA) -- `consensus_sequence/` - this contains three FASTA files from VCF2Genome of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively. -- `librarymerged_bams/` - these contain the final BAM files that would go into genotyping (if genotyping is turned on). This means the files will contain all libraries of a given sample (including trimmed non-UDG or half-UDG treated libraries, if BAM trimming turned on) +* `reference_genome/`: this directory contains the indexing files of your input reference genome (i.e. the various `bwa` indices, a `samtools`' `.fai` file, and a picard `.dict`), if you used the `--saveReference` flag. +* `fastqc/`: this contains the original per-FASTQ FastQC reports that are summarised with MultiQC. These occur in both `html` (the report) and `.zip` format (raw data). The `after_clipping` folder contains the same but for after AdapterRemoval. +* `adapterremoval/`: this contains the log files (ending with `.settings`) with raw trimming (and merging) statistics after AdapterRemoval. In the `output` sub-directory, are the output trimmed (and merged) FASTQ files. These you can use for downstream applications such as taxonomic binning for metagenomic studies. +* `mapping/`: this contains a sub-directory corresponding to the mapping tool you used, inside of which will be the initial BAM files containing the reads that mapped to your reference genome with no modification (see below). You will also find a corresponding BAM index file (ending in `.csi` or `.bam`), and if running the `bowtie2` mapper: a log ending in `_bt2.log`. You can use these for downstream applications e.g. if you wish to use a different de-duplication tool not included in nf-core/eager (although please feel free to add a new module request on the Github repository's [issue page](https://github.com/nf-core/eager/issues)!). +* `samtools/`: this contains two sub-directories. `stats/` contain the raw mapping statistics files (ending in `.stats`) from directly after mapping. `filter/` contains BAM files that have had a mapping quality filter applied (set by the `--bam_mapping_quality_threshold` flag) and a corresponding index file. Furthermore, if you selected `--bam_discard_unmapped`, you will find your separate file with only unmapped reads in the format you selected. Note unmapped read BAM files will _not_ have an index file. +* `deduplication/`: this contains a sub-directory called `dedup/`, inside here are sample specific directories. Each directory contains a BAM file containing mapped reads but with PCR duplicates removed, a corresponding index file and two stats file. `.hist.` contains raw data for a deduplication histogram used for tools like preseq (see below), and the `.log` contains overall summary deduplication statistics. +* `endorSpy/`: this contains all JSON files exported from the endorSpy endogenous DNA calculation tool. The JSON files are generated specifically for display in the MultiQC general statistics table and is otherwise very likely not useful for you. +* `preseq/`: this contains a `.ccurve` file for every BAM file that had enough deduplication statistics to generate a complexity curve for estimating the amount unique reads that will be yield if the library is re-sequenced. You can use this file for plotting e.g. in `R` to find your sequencing target depth. +* `qualimap/`: this contains a sub-directory for every sample, which includes a qualimap report and associated raw statistic files. You can open the `.html` file in your internet browser to see the in-depth report (this will be more detailed than in MultiQC). This includes stuff like percent coverage, depth coverage, GC content and so on of your mapped reads. +* `damageprofiler/`: this contains sample specific directories containing raw statistics and damage plots from DamageProfiler. The `.pdf` files can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the `.txt` files. +* `pmdtools/`: this contains raw output statistics of pmdtools (estimates of frequencies of substitutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`. +* `trimmed_bam/`: this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_half_udg_left`, `--bamutils_clip_half_udg_right`, `--bamutils_clip_none_udg_left`, and `--bamutils_clip_none_udg_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read). +* `damage_rescaling/`: this contains rescaled BAM files from mapDamage2. These BAM files have damage probabilistically removed via a bayesian model, and can be used for downstream genotyping. +* `genotyping/`: this contains all the (gzipped) genotyping files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have files corresponding to each of your deduplicated BAM files (except pileupcaller), or any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). If `--gatk_ug_keep_realign_bam` supplied, this may also contain BAM files from InDel realignment when using GATK 3 and UnifiedGenotyping for variant calling. When pileupcaller is used to create eigenstrat genotypes, this directory also contains eigenstrat SNP coverage statistics. +* `multivcfanalyzer/`: this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files. +* `sex_determination/`: this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer. +* `nuclear_contamination/`: this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2. +* `bedtools/`: this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position). +* `metagenomic_complexity_filter`: this contains the output from filtering of input reads to metagenomic classification of low-sequence complexity reads as performed by `bbduk`. This will include the filtered FASTQ files (`*_lowcomplexityremoved.fq.gz`) and also the run-time log (`_bbduk.stats`) for each sample. **Note:** there are no sections in the MultiQC report for this module, therefore you must check the `._bbduk.stats` files to get summary statistics of the filtering. +* `metagenomic_classification/`: this contains the output for a given metagenomic classifier. + * Running MALT will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested. + * Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table. You will also get a Kraken kmer duplication table, in a [KrakenUniq](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0) fashion. This is very useful to check for breadth of coverage and detect read stacking. A small number of aligned reads (low coverage) and a kmer duplication >1 is usually a sign of read stacking, usually indicative of a false positive hit (e.g. from over-amplified libraries). *Kmer duplication is defined as: number of kmers / number of unique kmers*. You will find two kraken reports formats available: + * the `*.kreport` which is the old report format, without distinct minimizer count information, used by some tools such as [Pavian](https://github.com/fbreitwieser/pavian) + * the `*.kraken2_report` which is the new kraken report format, with the distinct minimizer count information. + * finally, the `*.kraken.out` file are the direct output of Kraken2 +* `maltextract/`: this contains a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA) +* `consensus_sequence/`: this contains three FASTA files from VCF2Genome of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively. +* `librarymerged_bams/`: these contain the final BAM files that would go into genotyping (if genotyping is turned on). This means the files will contain all libraries of a given sample (including trimmed non-UDG or half-UDG treated libraries, if BAM trimming turned on) diff --git a/docs/usage.md b/docs/usage.md index 7d13ff440..82ce25cc0 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -81,7 +81,7 @@ twice the amount of CPU and memory. This will occur two times before failing. Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. -Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Conda) - see below. +Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. > We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. @@ -92,22 +92,28 @@ They are loaded in sequence, so later profiles can overwrite earlier profiles. If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended. -- `docker` - - A generic configuration profile to be used with [Docker](https://docker.com/) - - Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) -- `singularity` - - A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) - - Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) -- `podman` - - A generic configuration profile to be used with [Podman](https://podman.io/) - - Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) -- `conda` - - Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity or Podman. - - A generic configuration profile to be used with [Conda](https://conda.io/docs/) - - Pulls most software from [Bioconda](https://bioconda.github.io/) -- `test_tsv` - - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters +* `docker` + * A generic configuration profile to be used with [Docker](https://docker.com/) + * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) +* `singularity` + * A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) + * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) +* `podman` + * A generic configuration profile to be used with [Podman](https://podman.io/) + * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) +* `shifter` + * A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/) + * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) +* `charliecloud` + * A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/) + * Pulls software from Docker Hub: [`nfcore/eager`](https://hub.docker.com/r/nfcore/eager/) +* `conda` + * Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud. + * A generic configuration profile to be used with [Conda](https://conda.io/docs/) + * Pulls most software from [Bioconda](https://bioconda.github.io/) +* `test_tsv + * A profile with a complete configuration for automated testing + * Includes links to test data so needs no other parameters > *Important*: If running nf-core/eager on a cluster - ask your system > administrator what profile to use. @@ -118,17 +124,17 @@ clusters**, and are centrally maintained at regular users of nf-core/eager, if you don't see your own institution here check the [nf-core/configs](https://github.com/nf-core/configs) repository. -- `uzh` - - A profile for the University of Zurich Research Cloud - - Loads Singularity and defines appropriate resources for running the +* `uzh` + * A profile for the University of Zurich Research Cloud + * Loads Singularity and defines appropriate resources for running the pipeline. -- `binac` - - A profile for the BinAC cluster at the University of Tuebingen 0 Loads +* `binac` + * A profile for the BinAC cluster at the University of Tuebingen 0 Loads Singularity and defines appropriate resources for running the pipeline -- `shh` - - A profile for the S/CDAG cluster at the Department of Archaeogenetics of +* `shh` + * A profile for the S/CDAG cluster at the Department of Archaeogenetics of the Max Planck Institute for the Science of Human History - - Loads Singularity and defines appropriate resources for running the pipeline + * Loads Singularity and defines appropriate resources for running the pipeline **Pipeline Specific Institution Profiles** There are also pipeline-specific institution profiles. I.e., we can also offer a profile which sets special @@ -139,10 +145,10 @@ pipelines. This can be seen at We currently offer a nf-core/eager specific profile for -- `shh` - - A profiler for the S/CDAG cluster at the Department of Archaeogenetics of +* `shh` + * A profiler for the S/CDAG cluster at the Department of Archaeogenetics of the Max Planck Institute for the Science of Human History - - In addition to the nf-core wide profile, this also sets the MALT resources + * In addition to the nf-core wide profile, this also sets the MALT resources to match our commonly used databases Further institutions can be added at @@ -181,6 +187,8 @@ process { } ``` +To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: `Error executing process > 'bwa'`. In this case the name to specify in the custom config file is `bwa`. + See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information. If you are likely to be running `nf-core` pipelines regularly it may be a good @@ -277,7 +285,7 @@ If you have multiple files in different directories, you can use additional wild 4. When using the pipeline with **paired end data**, the path must use `{1,2}` notation to specify read pairs. 5. Files names must be unique, having files with the same name, but in different directories is _not_ sufficient - - This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input. + * This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input. 6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be truncated after the first `.` in the name, Ensure file names are unique prior to this! 7. For input BAM files you should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses. @@ -309,17 +317,17 @@ When using TSV_input, nf-core/eager will merge FASTQ files of libraries with the Column descriptions are as follows: -- **Sample_Name:** A text string containing the name of a given sample of which there can be multiple libraries. All libraries with the same sample name and same SeqType will be merged after deduplication. -- **Library_ID:** A text string containing a given library, which there can be multiple sequencing lanes (with the same SeqType). -- **Lane:** A number indicating which lane the library was sequenced on. Files from the libraries sequenced on different lanes (and different SeqType) will be concatenated after read clipping and merging. -- **Colour Chemistry** A number indicating whether the Illumina sequencer the library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour chemistry machine. This informs whether poly-G trimming (if turned on) should be performed. -- **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 [forward], or BAM). This will affect lane merging if different per library. -- **Organism:** A text string of the organism name of the sample or 'NA'. This currently has no functionality and can be set to 'NA', but will affect lane/library merging if different per library -- **Strandedness:** A text string indicating whether the library type is'single' or 'double'. This will affect lane/library merging if different per library. -- **UDG_Treatment:** A text string indicating whether the library was generated with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library merging if different per library. -- **R1:** A text string of a file path pointing to a forward or R1 FASTQ file. This can be used with the R2 column. File names **must be unique**, even if they are in different directories. -- **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file, or 'NA' when single end data. This can be used with the R1 column. File names **must be unique**, even if they are in different directories. -- **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot be specified at the same time as R1 or R2, both of which should be set to 'NA' +* **Sample_Name:** A text string containing the name of a given sample of which there can be multiple libraries. All libraries with the same sample name and same SeqType will be merged after deduplication. +* **Library_ID:** A text string containing a given library, which there can be multiple sequencing lanes (with the same SeqType). +* **Lane:** A number indicating which lane the library was sequenced on. Files from the libraries sequenced on different lanes (and different SeqType) will be concatenated after read clipping and merging. +* **Colour Chemistry** A number indicating whether the Illumina sequencer the library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour chemistry machine. This informs whether poly-G trimming (if turned on) should be performed. +* **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 [forward], or BAM). This will affect lane merging if different per library. +* **Organism:** A text string of the organism name of the sample or 'NA'. This currently has no functionality and can be set to 'NA', but will affect lane/library merging if different per library +* **Strandedness:** A text string indicating whether the library type is'single' or 'double'. This will affect lane/library merging if different per library. +* **UDG_Treatment:** A text string indicating whether the library was generated with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library merging if different per library. +* **R1:** A text string of a file path pointing to a forward or R1 FASTQ file. This can be used with the R2 column. File names **must be unique**, even if they are in different directories. +* **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file, or 'NA' when single end data. This can be used with the R1 column. File names **must be unique**, even if they are in different directories. +* **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot be specified at the same time as R1 or R2, both of which should be set to 'NA' For example, the following TSV table: @@ -332,32 +340,32 @@ For example, the following TSV table: will have the following effects: -- After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 _with the same `SeqType`_ (and all other _metadata_ columns) will be concatenated together for each **Library**. -- After mapping, and prior BAM filtering, BAM files with different `SeqType` (but with all other metadata columns the same) will be merged together for each **Library**. -- After duplicate removal, BAM files with different `Library_ID`s but with the same `Sample_Name` and the same `UDG_Treatment` will be merged together. -- If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the same `Sample_Name`. +* After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 _with the same `SeqType`_ (and all other _metadata_ columns) will be concatenated together for each **Library**. +* After mapping, and prior BAM filtering, BAM files with different `SeqType` (but with all other metadata columns the same) will be merged together for each **Library**. +* After duplicate removal, BAM files with different `Library_ID`s but with the same `Sample_Name` and the same `UDG_Treatment` will be merged together. +* If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the same `Sample_Name`. Note the following important points and limitations for setting up: -- The TSV must use actual tabs (not spaces) between cells. -- *File* names must be unique regardless of file path, due to risk of over-writing (see: [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)). - - If it is 'too late' and you already have duplicate file names, a workaround is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file. -- Lane IDs must be unique for each sequencing of each library. - - If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly. - - This also applies to the SeqType column, i.e. with the example above, if one run is PE and one run is SE, you need to give fake lane IDs to one of the runs as well. -- All _BAM_ files must be specified as `SE` under `SeqType`. - - You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses. -- nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration -- Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files (unless you use `--run_convertbam`), as only FASTQ files are lane-merged together. -- Same libraries that are sequenced on different sequencing configurations (i.e single- and paired-end data), will be merged after mapping and will _always_ be considered 'paired-end' during downstream processes - - **Important** running DeDup in this context is _not_ recommended, as PE and SE data at the same position will _not_ be evaluated as duplicates. Therefore not all duplicates will be removed. - - When you wish to run PE/SE data together `-dedupper markduplicates` is therefore preferred. - - An error will be thrown if you try to merge both PE and SE and also supply `--skip_merging`. - - If you truly want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager - - If you _regularly_ want to run the situation above, please leave a feature request on github. -- DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging). -- nf-core/eager functionality such as `--run_trim_bam` will be applied to only non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values will reflect the values of all libraries combined - after being damage trimmed etc.). -- Genotyping will be typically performed on each `sample` independently, as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you **must** give each the SS and DS libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file collision` error in steps such as `sexdeterrmine`, and then you will need to merge these yourself. We will consider changing this behaviour in the future if there is enough interest. +* The TSV must use actual tabs (not spaces) between cells. +* *File* names must be unique regardless of file path, due to risk of over-writing (see: [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)). + * If it is 'too late' and you already have duplicate file names, a workaround is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file. +* Lane IDs must be unique for each sequencing of each library. + * If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly. + * This also applies to the SeqType column, i.e. with the example above, if one run is PE and one run is SE, you need to give fake lane IDs to one of the runs as well. +* All _BAM_ files must be specified as `SE` under `SeqType`. + * You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses. +* nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration +* Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files (unless you use `--run_convertbam`), as only FASTQ files are lane-merged together. +* Same libraries that are sequenced on different sequencing configurations (i.e single- and paired-end data), will be merged after mapping and will _always_ be considered 'paired-end' during downstream processes + * **Important** running DeDup in this context is _not_ recommended, as PE and SE data at the same position will _not_ be evaluated as duplicates. Therefore not all duplicates will be removed. + * When you wish to run PE/SE data together `-dedupper markduplicates` is therefore preferred. + * An error will be thrown if you try to merge both PE and SE and also supply `--skip_merging`. + * If you truly want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager + * If you _regularly_ want to run the situation above, please leave a feature request on github. +* DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging). +* nf-core/eager functionality such as `--run_trim_bam` will be applied to only non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values will reflect the values of all libraries combined - after being damage trimmed etc.). +* Genotyping will be typically performed on each `sample` independently, as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you **must** give each the SS and DS libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file collision` error in steps such as `sexdeterrmine`, and then you will need to merge these yourself. We will consider changing this behaviour in the future if there is enough interest. ## Clean up @@ -419,7 +427,7 @@ In some cases it maybe no output log is produced by a particular tool for MultiQ Known cases include: -- Qualimap: there will be no MultiQC output if the BAM file is empty. An empty BAM file is produced when no reads map to the reference and causes Qualimap to crash - this is crash is ignored by nf-core/eager (to allow the rest of the pipeline to continue) and will therefore have no log file for that particular sample/library +* Qualimap: there will be no MultiQC output if the BAM file is empty. An empty BAM file is produced when no reads map to the reference and causes Qualimap to crash - this is crash is ignored by nf-core/eager (to allow the rest of the pipeline to continue) and will therefore have no log file for that particular sample/library ## Tutorials @@ -536,10 +544,10 @@ If you change into this with `cd` and run `ls -la` you should see a collection of normal files, symbolic links (symlinks) and hidden files (indicated with `.` at the beginning of the file name). -- Symbolic links: are typically input files from previous processes. -- Normal files: are typically successfully completed output files from some of +* Symbolic links: are typically input files from previous processes. +* Normal files: are typically successfully completed output files from some of some of the commands in the process -- Hidden files are Nextflow generated files and include the submission commands +* Hidden files are Nextflow generated files and include the submission commands as well as log files When you have an error run, you can firstly check the contents of the output @@ -596,9 +604,9 @@ DNA to map and cause false positive SNP calls. Within nf-core, there are two main levels of configs -- Institutional-level profiles: these normally define things like paths to +* Institutional-level profiles: these normally define things like paths to common storage, resource maximums, scheduling system -- Pipeline-level profiles: these normally define parameters specifically for a +* Pipeline-level profiles: these normally define parameters specifically for a pipeline (such as mapping parameters, turning specific modules on or off) As well as allowing more efficiency and control at cluster or Institutional @@ -656,11 +664,11 @@ This would be translated as follows. If your parameters looked like the following -Parameter | Resolved Parameters | institution | cluster | my_paper -----------------|------------------------|-------------|----------|---------- ---executor | singularity | singularity | \ | \ ---max_memory | 256GB | 756GB | 256GB | \ ---bwa_aln | 0.1 | \ | 0.01 | 0.1 +| Parameter | Resolved Parameters | institution | cluster | my_paper | +| ----------------|------------------------|-------------|----------|----------| +| --executor | singularity | singularity | \ | \ | +| --max_memory | 256GB | 756GB | 256GB | \ | +| --bwa_aln | 0.1 | \ | 0.01 | 0.1 | (where '\' is a parameter not defined in a given profile.) @@ -735,14 +743,14 @@ the `hpc_blue` profile, but the `mapper` parameter has been changed from The order of loading of different configuration files can be seen here: -Loading Order | Configuration File --------------:|:------------------- -1 | `nextflow.config` in your current directory, -2 | (if using a script for `nextflow run`) a `nextflow.config` in the directory the script is located -3 | `config` stored in your human directory under `~/.nextflow/` -4 | `.config` if you specify in the `nextflow run` command with `-c` -5 | general nf-core institutional configurations stored at [nf-core/configs](https://github.com/nf-core/configs) -6 | pipeline-specific nf-core institutional configurations at [nf-core/configs](https://github.com/nf-core/configs) +| Loading Order | Configuration File | +| -------------:|:----------------------------------------------------------------------------------------------------------------| +| 1 | `nextflow.config` in your current directory | +| 2 | (if using a script for `nextflow run`) a `nextflow.config` in the directory the script is located | +| 3 | `config` stored in your human directory under `~/.nextflow/` | +| 4 | `.config` if you specify in the `nextflow run` command with `-c` | +| 5 | general nf-core institutional configurations stored at [nf-core/configs](https://github.com/nf-core/configs) | +| 6 | pipeline-specific nf-core institutional configurations at [nf-core/configs](https://github.com/nf-core/configs) | This loading order of these `.config` files will not normally affect the settings you use for the pipeline run itself; `-profiles` are normally more @@ -1115,7 +1123,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ <...> ``` @@ -1144,7 +1152,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ <...> ``` @@ -1169,7 +1177,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ @@ -1194,7 +1202,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ @@ -1221,7 +1229,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ @@ -1251,7 +1259,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ @@ -1287,7 +1295,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ @@ -1321,7 +1329,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ @@ -1362,7 +1370,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ @@ -1404,7 +1412,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/hs37d5.fa.fai' \ --seq_dict '../Reference/genome/hs37d5.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --preserve5p \ --mergedonly \ @@ -1456,59 +1464,59 @@ For example, I normally look for things like: General Stats Table: -- Do I see the expected number of raw sequencing reads (summed across each set +* Do I see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing? -- Does the percentage of trimmed reads look normal for aDNA, and do lengths +* Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA? -- Does ClusterFactor or 'Dups' look high (e.g. >2 or >10% respectively) +* Does ClusterFactor or 'Dups' look high (e.g. >2 or >10% respectively) suggesting over-amplified or badly preserved samples? -- Do the mapped reads show increased frequency of C>Ts on the 5' end of +* Do the mapped reads show increased frequency of C>Ts on the 5' end of molecules? -- Is the number of SNPs used for nuclear contamination really low for any +* Is the number of SNPs used for nuclear contamination really low for any individuals (e.g. < 100)? If so, then the estimates might not be very accurate. FastQC (pre-AdapterRemoval): -- Do I see any very early drop off of sequence quality scores suggesting a +* Do I see any very early drop off of sequence quality scores suggesting a problematic sequencing run? -- Do I see outlier GC content distributions? -- Do I see high sequence duplication levels? +* Do I see outlier GC content distributions? +* Do I see high sequence duplication levels? AdapterRemoval: -- Do I see high numbers of singletons or discarded read pairs? +* Do I see high numbers of singletons or discarded read pairs? FastQC (post-AdapterRemoval): -- Do I see improved sequence quality scores along the length of reads? -- Do I see reduced adapter content levels? +* Do I see improved sequence quality scores along the length of reads? +* Do I see reduced adapter content levels? Samtools Flagstat (pre/post Filter): -- Do I see outliers, e.g. with unusually high levels of human DNA, (indicative +* Do I see outliers, e.g. with unusually high levels of human DNA, (indicative of contamination) that require downstream closer assessment? Are your samples exceptionally preserved? If not, a value higher than e.g. 50% might require your attention. DeDup/Picard MarkDuplicates: -- Do I see large numbers of duplicates being removed, possibly indicating +* Do I see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples? DamageProfiler: -- Do I see evidence of damage on human DNA? - - High numbers of mapped reads but no damage may indicate significant +* Do I see evidence of damage on human DNA? + * High numbers of mapped reads but no damage may indicate significant modern contamination. - - Was the read trimming I specified enough to overcome damage effects? + * Was the read trimming I specified enough to overcome damage effects? SexDetERRmine: -- Do the relative coverages on the X and Y chromosome fall within the expected +* Do the relative coverages on the X and Y chromosome fall within the expected areas of the plot? -- Do all individuals have enough data for accurate sex determination? -- Do the proportions of autosomal/X/Y reads make sense? If there is an +* Do all individuals have enough data for accurate sex determination? +* Do the proportions of autosomal/X/Y reads make sense? If there is an overrepresentation of reads within one bin, is the data enriched for that bin? > Detailed documentation and descriptions for all MultiQC modules can be seen in @@ -1735,7 +1743,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ <...> ``` @@ -1764,7 +1772,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ <...> ``` @@ -1785,7 +1793,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --run_bam_filtering \ --bam_unmapped_type 'fastq' \ @@ -1815,7 +1823,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --run_bam_filtering \ --bam_unmapped_type 'fastq' \ @@ -1842,7 +1850,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/GRCh38.fa.fai' \ --seq_dict '../Reference/genome/GRCh38.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --run_bam_filtering \ --bam_unmapped_type 'fastq' \ @@ -1899,58 +1907,58 @@ For example, I normally look for things like: General Stats Table: -- Do I see the expected number of raw sequencing reads (summed across each set +* Do I see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing? -- Does the percentage of trimmed reads look normal for aDNA, and do lengths +* Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA? -- Does ClusterFactor or 'Dups' look high suggesting over-amplified or +* Does ClusterFactor or 'Dups' look high suggesting over-amplified or badly preserved samples (e.g. >2 or >10% respectively - however given this is on the human reads this is just a rule of thumb and may not reflect the quality of the metagenomic profile) ? -- Does the human DNA show increased frequency of C>Ts on the 5' end of +* Does the human DNA show increased frequency of C>Ts on the 5' end of molecules? FastQC (pre-AdapterRemoval): -- Do I see any very early drop off of sequence quality scores suggesting +* Do I see any very early drop off of sequence quality scores suggesting problematic sequencing run? -- Do I see outlier GC content distributions? -- Do I see high sequence duplication levels? +* Do I see outlier GC content distributions? +* Do I see high sequence duplication levels? AdapterRemoval: -- Do I see high numbers of singletons or discarded read pairs? +* Do I see high numbers of singletons or discarded read pairs? FastQC (post-AdapterRemoval): -- Do I see improved sequence quality scores along the length of reads? -- Do I see reduced adapter content levels? +* Do I see improved sequence quality scores along the length of reads? +* Do I see reduced adapter content levels? MALT: -- Do I have a reasonable level of mappability? - - Somewhere between 10-30% can be pretty normal for aDNA, whereas e.g. <1% +* Do I have a reasonable level of mappability? + * Somewhere between 10-30% can be pretty normal for aDNA, whereas e.g. <1% requires careful manual assessment -- Do I have a reasonable taxonomic assignment success? - - You hope to have a large number of the mapped reads (from the mappability +* Do I have a reasonable taxonomic assignment success? + * You hope to have a large number of the mapped reads (from the mappability plot) that also have taxonomic assignment. Samtools Flagstat (pre/post Filter): -- Do I see outliers, e.g. with unusually high levels of human DNA, (indicative +* Do I see outliers, e.g. with unusually high levels of human DNA, (indicative of contamination) that require downstream closer assessment? DeDup/Picard MarkDuplicates: -- Do I see large numbers of duplicates being removed, possibly indicating +* Do I see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples? DamageProfiler: -- Do I see evidence of damage on human DNA? Note this is just a +* Do I see evidence of damage on human DNA? Note this is just a rule-of-thumb/corroboration of any signals you might find in the metagenomic screening and not essential. - - If you have high numbers of human DNA reads but no damage may indicate + * If you have high numbers of human DNA reads but no damage may indicate significant modern contamination. > Detailed documentation and descriptions for all MultiQC modules can be seen in @@ -2199,7 +2207,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ <...> ``` @@ -2228,7 +2236,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ <...> ``` @@ -2252,7 +2260,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ @@ -2276,7 +2284,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ @@ -2306,7 +2314,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ @@ -2337,7 +2345,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ @@ -2375,7 +2383,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ @@ -2416,7 +2424,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ @@ -2459,7 +2467,7 @@ nextflow run nf-core/eager \ --fasta_index '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.fai' \ --seq_dict '../Reference/genome/Yersinia_pestis_C092_GCF_000009065.1_ASM906v1.fa.dict' \ --outdir './results/' \ -- w './work/' \ +-w './work/' \ --complexity_filter_poly_g \ --bwaalnn 0.01 \ --bwaalnl 16 \ @@ -2521,80 +2529,80 @@ results. For example, I normally look for things like: General Stats Table: -- Do I see the expected number of raw sequencing reads (summed across each set +* Do I see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing? -- Does the percentage of trimmed reads look normal for aDNA, and do lengths +* Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA? -- Does the Endogenous DNA (%) columns look reasonable (high enough to indicate +* Does the Endogenous DNA (%) columns look reasonable (high enough to indicate you have received enough coverage for downstream, and/or do you lose an unusually high reads after filtering ) -- Does ClusterFactor or '% Dups' look high (e.g. >2 or >10% respectively - high +* Does ClusterFactor or '% Dups' look high (e.g. >2 or >10% respectively - high values suggesting over-amplified or badly preserved samples i.e. low complexity; note that genome-enrichment libraries may by their nature look higher). -- Do you see an increased frequency of C>Ts on the 5' end of molecules in the +* Do you see an increased frequency of C>Ts on the 5' end of molecules in the mapped reads? -- Do median read lengths look relatively low (normally <= 100 bp) indicating +* Do median read lengths look relatively low (normally <= 100 bp) indicating typically fragmented aDNA? -- Does the % coverage decrease relatively gradually at each depth coverage, and +* Does the % coverage decrease relatively gradually at each depth coverage, and does not drop extremely drastically -- Does the Median coverage and percent >3x (or whatever you set) show sufficient +* Does the Median coverage and percent >3x (or whatever you set) show sufficient coverage for reliable SNP calls and that a good proportion of the genome is covered indicating you have the right reference genome? -- Do you see a high proportion of % Hets, indicating many multi-allelic sites +* Do you see a high proportion of % Hets, indicating many multi-allelic sites (and possibly presence of cross-mapping from other species, that may lead to false positive or less confident SNP calls)? FastQC (pre-AdapterRemoval): -- Do I see any very early drop off of sequence quality scores suggesting +* Do I see any very early drop off of sequence quality scores suggesting problematic sequencing run? -- Do I see outlier GC content distributions? -- Do I see high sequence duplication levels? +* Do I see outlier GC content distributions? +* Do I see high sequence duplication levels? AdapterRemoval: -- Do I see high numbers of singletons or discarded read pairs? +* Do I see high numbers of singletons or discarded read pairs? FastQC (post-AdapterRemoval): -- Do I see improved sequence quality scores along the length of reads? -- Do I see reduced adapter content levels? +* Do I see improved sequence quality scores along the length of reads? +* Do I see reduced adapter content levels? Samtools Flagstat (pre/post Filter): -- Do I see outliers, e.g. with unusually low levels of mapped reads, (indicative +* Do I see outliers, e.g. with unusually low levels of mapped reads, (indicative of badly preserved samples) that require downstream closer assessment? DeDup/Picard MarkDuplicates: -- Do I see large numbers of duplicates being removed, possibly indicating +* Do I see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples? PreSeq: -- Do I see a large drop off of a sample's curve away from the theoretical +* Do I see a large drop off of a sample's curve away from the theoretical complexity? If so, this may indicate it's not worth performing deeper sequencing as you will get few unique reads (vs. duplicates that are not any more informative than the reads you've already sequenced) DamageProfiler: -- Do I see evidence of damage on the microbial DNA (i.e. a % C>T of more than ~5% in +* Do I see evidence of damage on the microbial DNA (i.e. a % C>T of more than ~5% in the first few nucleotide positions?) ? If not, possibly your mapped reads are deriving from modern contamination. QualiMap: -- Do you see a peak of coverage (X) at a good level, e.g. >= 3x, indicating +* Do you see a peak of coverage (X) at a good level, e.g. >= 3x, indicating sufficient coverage for reliable SNP calls? MultiVCFAnalyzer: -- Do I have a good number of called SNPs that suggest the samples have genomes +* Do I have a good number of called SNPs that suggest the samples have genomes with sufficient nucleotide diversity to inform phylogenetic analysis? -- Do you have a large number of discarded SNP calls? -- Are the % Hets very high indicating possible cross-mapping from off-target +* Do you have a large number of discarded SNP calls? +* Are the % Hets very high indicating possible cross-mapping from off-target organisms that may confounding variant calling? > Detailed documentation and descriptions for all MultiQC modules can be seen in diff --git a/environment.yml b/environment.yml index 3475472e0..a55929d30 100644 --- a/environment.yml +++ b/environment.yml @@ -1,6 +1,6 @@ # You can use this file to create a conda environment for this pipeline: # conda env create -f environment.yml -name: nf-core-eager-2.3.2 +name: nf-core-eager-2.3.3 channels: - conda-forge - bioconda @@ -26,7 +26,7 @@ dependencies: - bioconda::qualimap=2.2.2d - bioconda::vcf2genome=0.91 - bioconda::damageprofiler=0.4.9 # Don't upgrade - later versions don't allow java 8 - - bioconda::multiqc=1.10 + - bioconda::multiqc=1.10.1 - bioconda::pmdtools=0.60 - bioconda::bedtools=2.29.2 - conda-forge::libiconv=1.15 diff --git a/lib/NfcoreSchema.groovy b/lib/NfcoreSchema.groovy index 42e32dc1c..54935ec81 100644 --- a/lib/NfcoreSchema.groovy +++ b/lib/NfcoreSchema.groovy @@ -18,13 +18,12 @@ class NfcoreSchema { * whether the given paremeters adhere to the specificiations */ /* groovylint-disable-next-line UnusedPrivateMethodParameter */ - private static ArrayList validateParameters(params, jsonSchema, log) { + private static void validateParameters(params, jsonSchema, log) { def has_error = false //=====================================================================// // Check for nextflow core params and unexpected params def json = new File(jsonSchema).text def Map schemaParams = (Map) new JsonSlurper().parseText(json).get('definitions') - def specifiedParamKeys = params.keySet() def nf_params = [ // Options for base `nextflow` command 'bg', @@ -105,7 +104,7 @@ class NfcoreSchema { } } - for (specifiedParam in specifiedParamKeys) { + for (specifiedParam in params.keySet()) { // nextflow params if (nf_params.contains(specifiedParam)) { log.error "ERROR: You used a core Nextflow option with two hyphens: '--${specifiedParam}'. Please resubmit with '-${specifiedParam}'" @@ -122,6 +121,10 @@ class NfcoreSchema { // Validate parameters against the schema InputStream inputStream = new File(jsonSchema).newInputStream() JSONObject rawSchema = new JSONObject(new JSONTokener(inputStream)) + + // Remove anything that's in params.schema_ignore_params + rawSchema = removeIgnoredParams(rawSchema, params) + Schema schema = SchemaLoader.load(rawSchema) // Clean the parameters @@ -144,26 +147,21 @@ class NfcoreSchema { } // Check for unexpected parameters - // Getting this message a lot for parameters that you *do* expect? - // You can make a csv list of expected params not in the schema with 'params.schema_ignore_params' - // for example, in your institutional config if (unexpectedParams.size() > 0) { Map colors = log_colours(params.monochrome_logs) println '' def warn_msg = 'Found unexpected parameters:' for (unexpectedParam in unexpectedParams) { - warn_msg = warn_msg + "\n* --${unexpectedParam}: ${paramsJSON[unexpectedParam].toString()}" + warn_msg = warn_msg + "\n* --${unexpectedParam}: ${params[unexpectedParam].toString()}" } log.warn warn_msg - log.info "- ${colors.dim}(Hide this message with 'params.schema_ignore_params')${colors.reset} -" + log.info "- ${colors.dim}Ignore this warning: params.schema_ignore_params = \"${unexpectedParams.join(',')}\" ${colors.reset}" println '' } if (has_error) { System.exit(1) } - - return unexpectedParams } // Loop over nested exceptions and print the causingException @@ -191,6 +189,47 @@ class NfcoreSchema { } } + // Remove an element from a JSONArray + private static JSONArray removeElement(jsonArray, element){ + def list = [] + int len = jsonArray.length() + for (int i=0;i + if(rawSchema.keySet().contains('definitions')){ + rawSchema.definitions.each { definition -> + for (key in definition.keySet()){ + if (definition[key].get("properties").keySet().contains(ignore_param)){ + // Remove the param to ignore + definition[key].get("properties").remove(ignore_param) + // If the param was required, change this + if (definition[key].has("required")) { + def cleaned_required = removeElement(definition[key].required, ignore_param) + definition[key].put("required", cleaned_required) + } + } + } + } + } + if(rawSchema.keySet().contains('properties') && rawSchema.get('properties').keySet().contains(ignore_param)) { + rawSchema.get("properties").remove(ignore_param) + } + if(rawSchema.keySet().contains('required') && rawSchema.required.contains(ignore_param)) { + def cleaned_required = removeElement(rawSchema.required, ignore_param) + rawSchema.put("required", cleaned_required) + } + } + return rawSchema + } + private static Map cleanParameters(params) { def new_params = params.getClass().newInstance(params) for (p in params) { @@ -310,7 +349,8 @@ class NfcoreSchema { private static LinkedHashMap params_read(String json_schema) throws Exception { def json = new File(json_schema).text def Map schema_definitions = (Map) new JsonSlurper().parseText(json).get('definitions') - def Map schema_properties = (Map) new JsonSlurper().parseText(json).get('properties') /* Tree looks like this in nf-core schema + def Map schema_properties = (Map) new JsonSlurper().parseText(json).get('properties') + /* Tree looks like this in nf-core schema * definitions <- this is what the first get('definitions') gets us group 1 title @@ -329,7 +369,13 @@ class NfcoreSchema { parameter 1 type description + * properties <- parameters can also be ungrouped, outside of definitions + parameter 1 + type + description */ + + // Grouped params def params_map = new LinkedHashMap() schema_definitions.each { key, val -> def Map group = schema_definitions."$key".properties // Gets the property object of the group @@ -522,28 +568,4 @@ class NfcoreSchema { return output } - static String params_summary_multiqc(workflow, summary) { - String summary_section = '' - for (group in summary.keySet()) { - def group_params = summary.get(group) // This gets the parameters of that particular group - if (group_params) { - summary_section += "

$group

\n" - summary_section += "
\n" - for (param in group_params.keySet()) { - summary_section += "
$param
${group_params.get(param) ?: 'N/A'}
\n" - } - summary_section += "
\n" - } - } - - String yaml_file_text = "id: '${workflow.manifest.name.replace('/','-')}-summary'\n" - yaml_file_text += "description: ' - this information is collected when the pipeline is started.'\n" - yaml_file_text += "section_name: '${workflow.manifest.name} Workflow Summary'\n" - yaml_file_text += "section_href: 'https://github.com/${workflow.manifest.name}'\n" - yaml_file_text += "plot_type: 'html'\n" - yaml_file_text += "data: |\n" - yaml_file_text += "${summary_section}" - return yaml_file_text - } - } diff --git a/main.nf b/main.nf index f3ad4b2a0..6033028c6 100644 --- a/main.nf +++ b/main.nf @@ -11,128 +11,23 @@ ------------------------------------------------------------------------------------------------------------ */ +log.info Headers.nf_core(workflow, params.monochrome_logs) -// Show help message -params.help = false +//////////////////////////////////////////////////// +/* -- PRINT HELP -- */ +////////////////////////////////////////////////////+ def json_schema = "$projectDir/nextflow_schema.json" if (params.help) { - def command = "nextflow run nf-core/eager -profile --reads'*_R{1,2}.fastq.gz' --fasta '.fasta'" + def command = "nextflow run nf-core/eager --input '*_R{1,2}.fastq.gz' -profile docker" log.info NfcoreSchema.params_help(workflow, params, json_schema, command) exit 0 } //////////////////////////////////////////////////// /* -- VALIDATE PARAMETERS -- */ -//////////////////////////////////////////////////// - -def unexpectedParams = [] +////////////////////////////////////////////////////+ if (params.validate_params) { - unexpectedParams = NfcoreSchema.validateParameters(params, json_schema, log) -} - -// Info required for completion email and summary -def multiqc_report = [] - -// Small console separator to make it easier to read errors after launch -println "" - - - -//////////////////////////////////////////////////// -/* -- VALIDATE INPUTS -- */ -//////////////////////////////////////////////////// - -/**FASTA input handling -**/ - -if (params.fasta) { - file(params.fasta, checkIfExists: true) - lastPath = params.fasta.lastIndexOf(File.separator) - lastExt = params.fasta.lastIndexOf(".") - fasta_base = params.fasta.substring(lastPath+1) - index_base = params.fasta.substring(lastPath+1,lastExt) - if (params.fasta.endsWith('.gz')) { - fasta_base = params.fasta.substring(lastPath+1,lastExt) - index_base = fasta_base.substring(0,fasta_base.lastIndexOf(".")) - - } -} else { - exit 1, "[nf-core/eager] error: please specify --fasta with the path to your reference" -} - -// Validate reference inputs -if("${params.fasta}".endsWith(".gz")){ - process unzip_reference{ - tag "${zipped_fasta}" - - input: - path zipped_fasta from file(params.fasta) // path doesn't like it if a string of an object is not prefaced with a root dir (/), so use file() to resolve string before parsing to `path` - - output: - path "$unzip" into ch_fasta into ch_fasta_for_bwaindex,ch_fasta_for_bt2index,ch_fasta_for_faidx,ch_fasta_for_seqdict,ch_fasta_for_circulargenerator,ch_fasta_for_circularmapper,ch_fasta_for_damageprofiler,ch_fasta_for_qualimap,ch_fasta_for_pmdtools,ch_fasta_for_genotyping_ug,ch_fasta_for_genotyping_hc,ch_fasta_for_genotyping_freebayes,ch_fasta_for_genotyping_pileupcaller,ch_fasta_for_vcf2genome,ch_fasta_for_multivcfanalyzer,ch_fasta_for_genotyping_angsd,ch_fasta_for_damagerescaling - - script: - unzip = zipped_fasta.toString() - '.gz' - """ - pigz -f -d -p ${task.cpus} $zipped_fasta - """ - } - } else { - fasta_for_indexing = Channel - .fromPath("${params.fasta}", checkIfExists: true) - .into{ ch_fasta_for_bwaindex; ch_fasta_for_bt2index; ch_fasta_for_faidx; ch_fasta_for_seqdict; ch_fasta_for_circulargenerator; ch_fasta_for_circularmapper; ch_fasta_for_damageprofiler; ch_fasta_for_qualimap; ch_fasta_for_pmdtools; ch_fasta_for_genotyping_ug; ch_fasta__for_genotyping_hc; ch_fasta_for_genotyping_hc; ch_fasta_for_genotyping_freebayes; ch_fasta_for_genotyping_pileupcaller; ch_fasta_for_vcf2genome; ch_fasta_for_multivcfanalyzer;ch_fasta_for_genotyping_angsd;ch_fasta_for_damagerescaling } -} - -// Check that fasta index file path ends in '.fai' -if (params.fasta_index && !params.fasta_index.endsWith(".fai")) { - exit 1, "The specified fasta index file (${params.fasta_index}) is not valid. Fasta index files should end in '.fai'." -} - -// Check if genome exists in the config file. params.genomes is from igenomes.conf, params.genome specified by user -if ( params.genome && !params.genomes.containsKey(params.genome)) { - exit 1, "[nf-core/eager] error: the provided genome '${params.genome}' is not available in the iGenomes file. Currently the available genomes are ${params.genomes.keySet().join(", ")}." -} - -// Mapper validation -if (params.mapper != 'bwaaln' && !params.mapper == 'circularmapper' && !params.mapper == 'bwamem' && !params.mapper == "bowtie2"){ - exit 1, "[nf-core/eager] error: invalid mapper option. Options are: 'bwaaln', 'bwamem', 'circularmapper', 'bowtie2'. Default: 'bwaaln'. Found parameter: --mapper '${params.mapper}'." -} - -if (params.mapper == 'bowtie2' && params.bt2_alignmode != 'local' && params.bt2_alignmode != 'end-to-end' ) { - exit 1, "[nf-core/eager] error: invalid bowtie2 alignment mode. Options: 'local', 'end-to-end'. Found parameter: --bt2_alignmode '${params.bt2_alignmode}'" -} - -if (params.mapper == 'bowtie2' && params.bt2_sensitivity != 'no-preset' && params.bt2_sensitivity != 'very-fast' && params.bt2_sensitivity != 'fast' && params.bt2_sensitivity != 'sensitive' && params.bt2_sensitivity != 'very-sensitive' ) { - exit 1, "[nf-core/eager] error: invalid bowtie2 sensitivity mode. Options: 'no-preset', 'very-fast', 'fast', 'sensitive', 'very-sensitive'. Options are for both alignmodes Found parameter: --bt2_sensitivity '${params.bt2_sensitivity}'." -} - -if (params.bt2n != 0 && params.bt2n != 1) { - exit 1, "[nf-core/eager] error: invalid bowtie2 --bt2n (-N) parameter. Options: 0, 1. Found parameter: --bt2n ${params.bt2n}." - -} - -// Index files provided? Then check whether they are correct and complete -if( params.bwa_index != '' && (params.mapper == 'bwaaln' | params.mapper == 'bwamem' | params.mapper == 'circularmapper')){ - Channel - .fromPath(params.bwa_index, checkIfExists: true) - .ifEmpty { exit 1, "[nf-core/eager] error: bwa indices not found in: ${index_base}." } - .into {bwa_index; bwa_index_bwamem} - - bt2_index = Channel.empty() -} - -if( params.bt2_index != '' && params.mapper == 'bowtie2' ){ - lastPath = params.bt2_index.lastIndexOf(File.separator) - bt2_dir = params.bt2_index.substring(0,lastPath+1) - bt2_base = params.bt2_index.substring(lastPath+1) - - Channel - .fromPath(params.bt2_index, checkIfExists: true) - .ifEmpty { exit 1, "[nf-core/eager] error: bowtie2 indices not found in: ${bt2_dir}." } - .into {bt2_index; bt2_index_bwamem} - - bwa_index = Channel.empty() - bwa_index_bwamem = Channel.empty() + NfcoreSchema.validateParameters(params, json_schema, log) } // Validate BAM input isn't set to paired_end @@ -150,122 +45,30 @@ if ( params.skip_collapse && params.skip_trim ) { exit 1, "[nf-core/eager error]: you have specified to skip both merging and trimming of paired end samples. Use --skip_adapterremoval instead." } -// Host removal mode validation -if (params.hostremoval_input_fastq){ - if (!(['remove','replace'].contains(params.hostremoval_mode))) { - exit 1, "[nf-core/eager] error: --hostremoval_mode can only be set to 'remove' or 'replace'." - } -} - -if (params.bam_unmapped_type == '') { - exit 1, "[nf-core/eager] error: please specify valid unmapped read output format. Options: 'discard', 'keep', 'bam', 'fastq', 'both'. Found parameter: --bam_unmapped_type '${params.bam_unmapped_type}'." -} - // Bedtools validation if(params.run_bedtools_coverage && params.anno_file == ''){ exit 1, "[nf-core/eager] error: you have turned on bedtools coverage, but not specified a BED or GFF file with --anno_file. Please validate your parameters." } -// Set up channels for annotation file -if (!params.run_bedtools_coverage){ - ch_anno_for_bedtools = Channel.empty() -} else { - ch_anno_for_bedtools = Channel.fromPath(params.anno_file, checkIfExists: true) - .ifEmpty { exit 1, "[nf-core/eager] error: bedtools annotation file not found. Supplied parameter: --anno_file ${params.anno_file}."} -} - // BAM filtering validation if (!params.run_bam_filtering && params.bam_mapping_quality_threshold != 0) { exit 1, "[nf-core/eager] error: please turn on BAM filtering if you want to perform mapping quality filtering! Provide: --run_bam_filtering." } -if (params.run_bam_filtering && params.bam_unmapped_type != 'discard' && params.bam_unmapped_type != 'keep' && params.bam_unmapped_type != 'bam' && params.bam_unmapped_type != 'fastq' && params.bam_unmapped_type != 'both' ) { - exit 1, "[nf-core/eager] error: please specify how to deal with unmapped reads. Options: 'discard', 'keep', 'bam', 'fastq', 'both'." -} - -// Deduplication validation -if (params.dedupper != 'dedup' && params.dedupper != 'markduplicates') { - exit 1, "[nf-core/eager] error: Selected deduplication tool is not recognised. Options: 'dedup' or 'markduplicates'. Found parameter: --dedupper '${params.dedupper}'." -} - if (params.dedupper == 'dedup' && !params.mergedonly) { log.warn "[nf-core/eager] Warning: you are using DeDup but without specifying --mergedonly for AdapterRemoval, dedup will likely fail! See documentation for more information." } -// SexDetermination channel set up and bedfile validation -if (params.sexdeterrmine_bedfile == '') { - ch_bed_for_sexdeterrmine = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt") -} else { - ch_bed_for_sexdeterrmine = Channel.fromPath(params.sexdeterrmine_bedfile, checkIfExists: true) -} - // Genotyping validation if (params.run_genotyping){ - if (params.genotyping_source != 'raw' && params.genotyping_source != 'pmd' && params.genotyping_source != 'trimmed' && params.genotyping_source != 'rescaled' ) { - exit 1, "[nf-core/eager] error: please specify a valid genotyping source. Options: 'raw', 'pmd', 'trimmed', 'rescaled'. Found parameter: --genotyping_source '${params.genotyping_source}'." - } - - if (params.genotyping_tool != 'ug' && params.genotyping_tool != 'hc' && params.genotyping_tool != 'freebayes' && params.genotyping_tool != 'pileupcaller' && params.genotyping_tool != 'angsd' ) { - exit 1, "[nf-core/eager] error: please specify a valid genotyper. Options: 'ug', 'hc', 'freebayes', 'pileupcaller'. Found parameter: --genotyping_tool '${params.genotyping_tool}'." - } - - if (params.gatk_ug_out_mode != 'EMIT_VARIANTS_ONLY' && params.gatk_ug_out_mode != 'EMIT_ALL_CONFIDENT_SITES' && params.gatk_ug_out_mode != 'EMIT_ALL_SITES') { - exit 1, "[nf-core/eager] error: please check your GATK output mode. Options are: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_SITES'. Found parameter: --gatk_ug_out_mode '${params.gatk_out_mode}'." - } - - if (params.gatk_hc_out_mode != 'EMIT_VARIANTS_ONLY' && params.gatk_hc_out_mode != 'EMIT_ALL_CONFIDENT_SITES' && params.gatk_hc_out_mode != 'EMIT_ALL_ACTIVE_SITES') { - exit 1, "[nf-core/eager] error: please check your GATK output mode. Options are: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_SITES'. Found parameter: --gatk_out_mode '${params.gatk_out_mode}'." - } - - if (params.genotyping_tool == 'ug' && (params.gatk_ug_genotype_model != 'SNP' && params.gatk_ug_genotype_model != 'INDEL' && params.gatk_ug_genotype_model != 'BOTH' && params.gatk_ug_genotype_model != 'GENERALPLOIDYSNP' && params.gatk_ug_genotype_model != 'GENERALPLOIDYINDEL')) { - exit 1, "[nf-core/eager] error: please check your UnifiedGenotyper genotype model. Options: 'SNP', 'INDEL', 'BOTH', 'GENERALPLOIDYSNP', 'GENERALPLOIDYINDEL'. Found parameter: --gatk_ug_genotype_model '${params.gatk_ug_genotype_model}'." - } - - if (params.genotyping_tool == 'hc' && (params.gatk_hc_emitrefconf != 'NONE' && params.gatk_hc_emitrefconf != 'GVCF' && params.gatk_hc_emitrefconf != 'BP_RESOLUTION')) { - exit 1, "[nf-core/eager] error: please check your HaplotyperCaller reference confidence parameter. Options: 'NONE', 'GVCF', 'BP_RESOLUTION'. Found parameter: --gatk_hc_emitrefconf '${params.gatk_hc_emitrefconf}'." - } - - if (params.genotyping_tool == 'pileupcaller' && ! ( params.pileupcaller_method == 'randomHaploid' || params.pileupcaller_method == 'randomDiploid' || params.pileupcaller_method == 'majorityCall' ) ) { - exit 1, "[nf-core/eager] error: please check your pileupCaller method parameter. Options: 'randomHaploid', 'randomDiploid', 'majorityCall'. Found parameter: --pileupcaller_method '${params.pileupcaller_method}'." - } - if (params.genotyping_tool == 'pileupcaller' && ( params.pileupcaller_bedfile == '' || params.pileupcaller_snpfile == '' ) ) { exit 1, "[nf-core/eager] error: please check your pileupCaller bed file and snp file parameters. You must supply a bed file and a snp file." } - if (params.genotyping_tool == 'angsd' && ! ( params.angsd_glmodel == 'samtools' || params.angsd_glmodel == 'gatk' || params.angsd_glmodel == 'soapsnp' || params.angsd_glmodel == 'syk' ) ) { - exit 1, "[nf-core/eager] error: please check your ANGSD genotyping model! Options: 'samtools', 'gatk', 'soapsnp', 'syk'. Found parameter: --angsd_glmodel' ${params.angsd_glmodel}'." - } - if (params.genotyping_tool == 'angsd' && ! ( params.angsd_glformat == 'text' || params.angsd_glformat == 'binary' || params.angsd_glformat == 'binary_three' || params.angsd_glformat == 'beagle' ) ) { exit 1, "[nf-core/eager] error: please check your ANGSD output format! Options: 'text', 'binary', 'binary_three', 'beagle'. Found parameter: --angsd_glformat '${params.angsd_glformat}'." } - - if ( !params.angsd_createfasta && params.angsd_fastamethod != 'random' ) { - exit 1, "[nf-core/eager] error: to output a ANGSD FASTA file, please turn on FASTA creation with --angsd_createfasta." - } - - if ( params.angsd_createfasta && !( params.angsd_fastamethod == 'random' || params.angsd_fastamethod == 'common' ) ) { - exit 1, "[nf-core/eager] error: please check your ANGSD FASTA file creation method. Options: 'random', 'common'. Found parameter: --angsd_fastamethod '${params.angsd_fastamethod}'." - } - - if (params.genotyping_tool == 'pileupcaller' && ! ( params.pileupcaller_transitions_mode == 'AllSites' || params.pileupcaller_transitions_mode == 'TransitionsMissing' || params.pileupcaller_transitions_mode == 'SkipTransitions') ) { - exit 1, "[nf-core/eager] error: please check your pileupCaller transitions mode parameter. Options: 'AllSites', 'TransitionsMissing', 'SkipTransitions'. Found parameter: --pileupcaller_transitions_mode '${params.pileupcaller_transitions_mode}'" - } -} - - // pileupCaller channel generation and input checks for 'random sampling' genotyping -if (params.pileupcaller_bedfile.isEmpty()) { - ch_bed_for_pileupcaller = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt") -} else { - ch_bed_for_pileupcaller = Channel.fromPath(params.pileupcaller_bedfile, checkIfExists: true) -} - -if (params.pileupcaller_snpfile.isEmpty ()) { - ch_snp_for_pileupcaller = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy2.txt") -} else { - ch_snp_for_pileupcaller = Channel.fromPath(params.pileupcaller_snpfile, checkIfExists: true) } // Consensus sequence generation validation @@ -298,8 +101,6 @@ if (params.run_multivcfanalyzer) { } } -// Metagenomic validation - if (params.run_metagenomic_screening) { if ( params.bam_unmapped_type == "discard" ) { exit 1, "[nf-core/eager] error: metagenomic classification can only run on unmapped reads. Please supply --bam_unmapped_type 'fastq'. Supplied: --bam_unmapped_type '${params.bam_unmapped_type}'." @@ -309,22 +110,10 @@ if (params.run_metagenomic_screening) { exit 1, "[nf-core/eager] error: metagenomic classification can only run on unmapped reads in FASTQ format. Please supply --bam_unmapped_type 'fastq'. Found parameter: --bam_unmapped_type '${params.bam_unmapped_type}'." } - if (params.metagenomic_tool != 'malt' && params.metagenomic_tool != 'kraken') { - exit 1, "[nf-core/eager] error: metagenomic classification can currently only be run with 'malt' or 'kraken' (kraken2). Please check your classifier. Found parameter: --metagenomic_tool '${params.metagenomic_tool}'." - } - if (params.database == '' ) { exit 1, "[nf-core/eager] error: metagenomic classification requires a path to a database directory. Please specify one with --database '/path/to/database/'." } - if (params.metagenomic_tool == 'malt' && params.malt_mode != 'BlastN' && params.malt_mode != 'BlastP' && params.malt_mode != 'BlastX') { - exit 1, "[nf-core/eager] error: unknown MALT mode specified. Options: 'BlastN', 'BlastP', 'BlastX'. Found parameter: --malt_mode '${params.malt_mode}'." - } - - if (params.metagenomic_tool == 'malt' && params.malt_alignment_mode != 'Local' && params.malt_alignment_mode != 'SemiGlobal') { - exit 1, "[nf-core/eager] error: unknown MALT alignment mode specified. Options: 'Local', 'SemiGlobal'. Found parameter: --malt_alignment_mode '${params.malt_alignment_mode}'." - } - if (params.metagenomic_tool == 'malt' && params.malt_min_support_mode == 'percent' && params.metagenomic_min_support_reads != 1) { exit 1, "[nf-core/eager] error: incompatible MALT min support configuration. Percent can only be used with --malt_min_support_percent. You modified: --metagenomic_min_support_reads." } @@ -333,22 +122,11 @@ if (params.run_metagenomic_screening) { exit 1, "[nf-core/eager] error: incompatible MALT min support configuration. Reads can only be used with --malt_min_supportreads. You modified: --malt_min_support_percent." } - if (params.metagenomic_tool == 'malt' && params.malt_memory_mode != 'load' && params.malt_memory_mode != 'page' && params.malt_memory_mode != 'map') { - exit 1, "[nf-core/eager] error: unknown MALT memory mode specified. Options: 'load', 'page', 'map'. Found parameter: --malt_memory_mode '${params.malt_memory_mode}'." - } - if (!params.metagenomic_min_support_reads.toString().isInteger()){ exit 1, "[nf-core/eager] error: incompatible min_support_reads configuration. min_support_reads can only be used with integers. --metagenomic_min_support_reads Found parameter: ${params.metagenomic_min_support_reads}." } } -// Create input channel for MALT database directory, checking directory exists -if ( params.database == '') { - ch_db_for_malt = Channel.empty() -} else { - ch_db_for_malt = Channel.fromPath(params.database, checkIfExists: true) -} - // MaltExtract validation if (params.run_maltextract) { @@ -363,11 +141,117 @@ if (params.run_maltextract) { if (params.maltextract_taxon_list == '') { exit 1, "[nf-core/eager] error: MaltExtract requires a taxon list specifying the target taxa of interest. Specify the file with --params.maltextract_taxon_list." } +} - if (params.maltextract_filter != 'def_anc' && params.maltextract_filter != 'default' && params.maltextract_filter != 'ancient' && params.maltextract_filter != 'scan' && params.maltextract_filter != 'crawl' && params.maltextract_filter != 'srna') { - exit 1, "[nf-core/eager] error: unknown MaltExtract filter specified. Options are: 'def_anc', 'default', 'ancient', 'scan', 'crawl', 'srna'. Found parameter: --maltextract_filter '${params.maltextract_filter}'." - } +///////////////////////////////////////////////////////// +/* -- VALIDATE INPUT FILES -- */ +///////////////////////////////////////////////////////// + +// Set up channels for annotation file +if (!params.run_bedtools_coverage){ + ch_anno_for_bedtools = Channel.empty() +} else { + ch_anno_for_bedtools = Channel.fromPath(params.anno_file, checkIfExists: true) + .ifEmpty { exit 1, "[nf-core/eager] error: bedtools annotation file not found. Supplied parameter: --anno_file ${params.anno_file}."} +} + +if (params.fasta) { + file(params.fasta, checkIfExists: true) + lastPath = params.fasta.lastIndexOf(File.separator) + lastExt = params.fasta.lastIndexOf(".") + fasta_base = params.fasta.substring(lastPath+1) + index_base = params.fasta.substring(lastPath+1,lastExt) + if (params.fasta.endsWith('.gz')) { + fasta_base = params.fasta.substring(lastPath+1,lastExt) + index_base = fasta_base.substring(0,fasta_base.lastIndexOf(".")) + + } +} else { + exit 1, "[nf-core/eager] error: please specify --fasta with the path to your reference" +} + +// Validate reference inputs +if("${params.fasta}".endsWith(".gz")){ + process unzip_reference{ + tag "${zipped_fasta}" + + input: + path zipped_fasta from file(params.fasta) // path doesn't like it if a string of an object is not prefaced with a root dir (/), so use file() to resolve string before parsing to `path` + + output: + path "$unzip" into ch_fasta into ch_fasta_for_bwaindex,ch_fasta_for_bt2index,ch_fasta_for_faidx,ch_fasta_for_seqdict,ch_fasta_for_circulargenerator,ch_fasta_for_circularmapper,ch_fasta_for_damageprofiler,ch_fasta_for_qualimap,ch_fasta_for_pmdtools,ch_fasta_for_genotyping_ug,ch_fasta_for_genotyping_hc,ch_fasta_for_genotyping_freebayes,ch_fasta_for_genotyping_pileupcaller,ch_fasta_for_vcf2genome,ch_fasta_for_multivcfanalyzer,ch_fasta_for_genotyping_angsd,ch_fasta_for_damagerescaling + + script: + unzip = zipped_fasta.toString() - '.gz' + """ + pigz -f -d -p ${task.cpus} $zipped_fasta + """ + } + } else { + fasta_for_indexing = Channel + .fromPath("${params.fasta}", checkIfExists: true) + .into{ ch_fasta_for_bwaindex; ch_fasta_for_bt2index; ch_fasta_for_faidx; ch_fasta_for_seqdict; ch_fasta_for_circulargenerator; ch_fasta_for_circularmapper; ch_fasta_for_damageprofiler; ch_fasta_for_qualimap; ch_fasta_for_pmdtools; ch_fasta_for_genotyping_ug; ch_fasta__for_genotyping_hc; ch_fasta_for_genotyping_hc; ch_fasta_for_genotyping_freebayes; ch_fasta_for_genotyping_pileupcaller; ch_fasta_for_vcf2genome; ch_fasta_for_multivcfanalyzer;ch_fasta_for_genotyping_angsd;ch_fasta_for_damagerescaling } +} + +// Check that fasta index file path ends in '.fai' +if (params.fasta_index && !params.fasta_index.endsWith(".fai")) { + exit 1, "The specified fasta index file (${params.fasta_index}) is not valid. Fasta index files should end in '.fai'." +} + +// Check if genome exists in the config file. params.genomes is from igenomes.conf, params.genome specified by user +if (params.genomes && params.genome && !params.genomes.containsKey(params.genome)) { + exit 1, "The provided genome '${params.genome}' is not available in the iGenomes file. Currently the available genomes are ${params.genomes.keySet().join(', ')}" +} + +// Index files provided? Then check whether they are correct and complete +if( params.bwa_index != '' && (params.mapper == 'bwaaln' | params.mapper == 'bwamem' | params.mapper == 'circularmapper')){ + Channel + .fromPath(params.bwa_index, checkIfExists: true) + .ifEmpty { exit 1, "[nf-core/eager] error: bwa indices not found in: ${index_base}." } + .into {bwa_index; bwa_index_bwamem} + + bt2_index = Channel.empty() +} + +if( params.bt2_index != '' && params.mapper == 'bowtie2' ){ + lastPath = params.bt2_index.lastIndexOf(File.separator) + bt2_dir = params.bt2_index.substring(0,lastPath+1) + bt2_base = params.bt2_index.substring(lastPath+1) + + Channel + .fromPath(params.bt2_index, checkIfExists: true) + .ifEmpty { exit 1, "[nf-core/eager] error: bowtie2 indices not found in: ${bt2_dir}." } + .into {bt2_index; bt2_index_bwamem} + + bwa_index = Channel.empty() + bwa_index_bwamem = Channel.empty() +} + +// SexDetermination channel set up and bedfile validation +if (params.sexdeterrmine_bedfile == '') { + ch_bed_for_sexdeterrmine = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt") +} else { + ch_bed_for_sexdeterrmine = Channel.fromPath(params.sexdeterrmine_bedfile, checkIfExists: true) +} + + // pileupCaller channel generation and input checks for 'random sampling' genotyping +if (params.pileupcaller_bedfile.isEmpty()) { + ch_bed_for_pileupcaller = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy.txt") +} else { + ch_bed_for_pileupcaller = Channel.fromPath(params.pileupcaller_bedfile, checkIfExists: true) +} + +if (params.pileupcaller_snpfile.isEmpty ()) { + ch_snp_for_pileupcaller = Channel.fromPath("$projectDir/assets/nf-core_eager_dummy2.txt") +} else { + ch_snp_for_pileupcaller = Channel.fromPath(params.pileupcaller_snpfile, checkIfExists: true) +} +// Create input channel for MALT database directory, checking directory exists +if ( params.database == '') { + ch_db_for_malt = Channel.empty() +} else { + ch_db_for_malt = Channel.fromPath(params.database, checkIfExists: true) } // Create input channel for MaltExtract taxon list, to allow downloading of taxon list, checking file exists. @@ -384,16 +268,25 @@ if ( params.maltextract_ncbifiles == '' ) { ch_ncbifiles_for_maltextract = Channel.fromPath(params.maltextract_ncbifiles, checkIfExists: true) } +//////////////////////////////////////////////////// +/* -- Collect configuration parameters -- */ +//////////////////////////////////////////////////// -// Has the run name been specified by the user? -// this has the bonus effect of catching both -name and --name -if (!(workflow.runName ==~ /[a-z]+_[a-z]+/)) { - custom_runName = workflow.runName +// Check if genome exists in the config file +if (params.genomes && params.genome && !params.genomes.containsKey(params.genome)) { + exit 1, "The provided genome '${params.genome}' is not available in the iGenomes file. Currently the available genomes are ${params.genomes.keySet().join(', ')}" } -//////////////////////////////////////////////////// -/* -- CONFIG FILES -- */ -//////////////////////////////////////////////////// +// Check AWS batch settings +if (workflow.profile.contains('awsbatch')) { + // AWSBatch sanity checking + if (!params.awsqueue || !params.awsregion) exit 1, 'Specify correct --awsqueue and --awsregion parameters on AWSBatch!' + // Check outdir paths to be S3 buckets if running on AWSBatch + // related: https://github.com/nextflow-io/nextflow/issues/813 + if (!params.outdir.startsWith('s3:')) exit 1, 'Outdir not on S3 - specify S3 Bucket to run on AWSBatch!' + // Prevent trace files to be stored on S3 since S3 does not support rolling files. + if (params.tracedir.startsWith('s3:')) exit 1, 'Specify a local tracedir or run without trace! S3 cannot be used for tracefiles.' +} ch_multiqc_config = file("$projectDir/assets/multiqc_config.yaml", checkIfExists: true) ch_multiqc_custom_config = params.multiqc_config ? Channel.fromPath(params.multiqc_config, checkIfExists: true) : Channel.empty() @@ -416,6 +309,7 @@ tsv_path = null if (params.input && (has_extension(params.input, "tsv"))) tsv_path = params.input ch_input_sample = Channel.empty() + if (tsv_path) { tsv_file = file(tsv_path) @@ -490,28 +384,62 @@ ch_bam_channel // Also need to send raw files for lane merging, if we want to host removed fastq ch_fastq_channel .into { ch_input_for_skipconvertbam; ch_input_for_lanemerge_hostremovalfastq } + +//////////////////////////////////////////////////// +/* -- PRINT PARAMETER SUMMARY -- */ +//////////////////////////////////////////////////// -/////////////////////////////////////////////////// -/* -- HEADER LOG INFO -- */ -/////////////////////////////////////////////////// - -//Add header -log.info Headers.nf_core(workflow, params.monochrome_logs) - -//Add Summary Parameters -def summary_params = NfcoreSchema.params_summary_map(workflow, params, json_schema) log.info NfcoreSchema.params_summary_log(workflow, params, json_schema) -// Check that conda channels are set-up correctly -if (params.enable_conda) { - Checks.check_conda_channels(log) -} +// Header log info +def summary = [:] +if (workflow.revision) summary['Pipeline Release'] = workflow.revision +summary['Run Name'] = workflow.runName +summary['Input'] = params.input +summary['Fasta Ref'] = params.fasta +summary['Data Type'] = params.single_end ? 'Single-End' : 'Paired-End' +summary['Max Resources'] = "$params.max_memory memory, $params.max_cpus cpus, $params.max_time time per job" +if (workflow.containerEngine) summary['Container'] = "$workflow.containerEngine - $workflow.container" +summary['Output dir'] = params.outdir +summary['Launch dir'] = workflow.launchDir +summary['Working dir'] = workflow.workDir +summary['Script dir'] = workflow.projectDir +summary['User'] = workflow.userName +if (workflow.profile.contains('awsbatch')) { + summary['AWS Region'] = params.awsregion + summary['AWS Queue'] = params.awsqueue + summary['AWS CLI'] = params.awscli +} +summary['Config Profile'] = workflow.profile +if (params.config_profile_description) summary['Config Profile Description'] = params.config_profile_description +if (params.config_profile_contact) summary['Config Profile Contact'] = params.config_profile_contact +if (params.config_profile_url) summary['Config Profile URL'] = params.config_profile_url +summary['Config Files'] = workflow.configFiles.join(', ') +if (params.email || params.email_on_fail) { + summary['E-mail Address'] = params.email + summary['E-mail on failure'] = params.email_on_fail + summary['MultiQC maxsize'] = params.max_multiqc_email_size +} + +Channel.from(summary.collect{ [it.key, it.value] }) + .map { k,v -> "
$k
${v ?: 'N/A'}
" } + .reduce { a, b -> return [a, b].join("\n ") } + .map { x -> """ + id: 'nf-core-eager-summary' + description: " - this information is collected when the pipeline is started." + section_name: 'nf-core/eager Workflow Summary' + section_href: 'https://github.com/nf-core/eager' + plot_type: 'html' + data: | +
+ $x +
+ """.stripIndent() } + .set { ch_workflow_summary } -// Check AWS batch settings -Checks.aws_batch(workflow, params) // Check the hostnames against configured profiles -Checks.hostname(workflow, params, log) +checkHostname() log.info "Schaffa, Schaffa, Genome Baua!" @@ -1298,7 +1226,7 @@ process bwamem { // CircularMapper reference preparation and mapping for circular genomes e.g. mtDNA process circulargenerator{ - label 'sc_tiny' + label 'sc_medium' tag "$prefix" publishDir "${params.outdir}/reference_genome/circularmapper_index", mode: params.publish_dir_mode, saveAs: { filename -> if (params.save_reference) filename @@ -1320,7 +1248,7 @@ process circulargenerator{ script: prefix = "${fasta.baseName}_${params.circularextension}.fasta" """ - circulargenerator -e ${params.circularextension} -i $fasta -s ${params.circulartarget} + circulargenerator -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i $fasta -s ${params.circulartarget} bwa index $prefix """ @@ -1353,7 +1281,7 @@ process circularmapper{ bwa aln -t ${task.cpus} $elongated_root $r1 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.r1.sai bwa aln -t ${task.cpus} $elongated_root $r2 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.r2.sai bwa sampe -r "@RG\\tID:ILLUMINA-${libraryid}\\tSM:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $elongated_root ${libraryid}.r1.sai ${libraryid}.r2.sai $r1 $r2 > tmp.out - realignsamfile -e ${params.circularextension} -i tmp.out -r $fasta $filter + realignsamfile -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i tmp.out -r $fasta $filter samtools sort -@ ${task.cpus} -O bam tmp_realigned.bam > ${libraryid}_"${seqtype}".mapped.bam samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} """ @@ -1361,7 +1289,7 @@ process circularmapper{ """ bwa aln -t ${task.cpus} $elongated_root $r1 -n ${params.bwaalnn} -l ${params.bwaalnl} -k ${params.bwaalnk} -f ${libraryid}.sai bwa samse -r "@RG\\tID:ILLUMINA-${libraryid}\\tSM:${libraryid}\\tPL:illumina\\tPU:ILLUMINA-${libraryid}-${seqtype}" $elongated_root ${libraryid}.sai $r1 > tmp.out - realignsamfile -e ${params.circularextension} -i tmp.out -r $fasta $filter + realignsamfile -Xmx${task.memory.toGiga()}g -e ${params.circularextension} -i tmp.out -r $fasta $filter samtools sort -@ ${task.cpus} -O bam tmp_realigned.bam > "${libraryid}"_"${seqtype}".mapped.bam samtools index "${libraryid}"_"${seqtype}".mapped.bam ${size} """ @@ -1567,7 +1495,7 @@ ch_branched_for_seqtypemerge = ch_mapping_for_seqtype_merging """ samtools merge ${libraryid}_seqtypemerged.bam ${bam} ## Have to set validation as lenient because of BWA issue: "I see a read stands out the end of a chromosome and is flagged as unmapped (flag 0x4). [...]" http://bio-bwa.sourceforge.net/ - picard AddOrReplaceReadGroups I=${libraryid}_seqtypemerged.bam O=${libraryid}_seqtypemerged_rg.bam RGID=1 RGLB="${libraryid}_seqtypemerged" RGPL=illumina RGPU=4410 RGSM="${libraryid}_seqtypemerged" VALIDATION_STRINGENCY=LENIENT + picard -Xmx${task.memory.toGiga()}g AddOrReplaceReadGroups I=${libraryid}_seqtypemerged.bam O=${libraryid}_seqtypemerged_rg.bam RGID=1 RGLB="${libraryid}_seqtypemerged" RGPL=illumina RGPU=4410 RGSM="${libraryid}_seqtypemerged" VALIDATION_STRINGENCY=LENIENT samtools index ${libraryid}_seqtypemerged_rg.bam ${size} """ @@ -1938,7 +1866,7 @@ process library_merge { """ samtools merge ${samplename}_libmerged_rmdup.bam ${bam} ## Have to set validation as lenient because of BWA issue: "I see a read stands out the end of a chromosome and is flagged as unmapped (flag 0x4). [...]" http://bio-bwa.sourceforge.net/ - picard AddOrReplaceReadGroups I=${samplename}_libmerged_rmdup.bam O=${samplename}_libmerged_rg_rmdup.bam RGID=1 RGLB="${samplename}_merged" RGPL=illumina RGPU=4410 RGSM="${samplename}_merged" VALIDATION_STRINGENCY=LENIENT + picard -Xmx${task.memory.toGiga()}g AddOrReplaceReadGroups I=${samplename}_libmerged_rmdup.bam O=${samplename}_libmerged_rg_rmdup.bam RGID=1 RGLB="${samplename}_merged" RGPL=illumina RGPU=4410 RGSM="${samplename}_merged" VALIDATION_STRINGENCY=LENIENT samtools index ${samplename}_libmerged_rg_rmdup.bam ${size} """ } @@ -2081,8 +2009,8 @@ process mapdamage_rescaling { def singlestranded = strandedness == "single" ? '--single-stranded' : '' def size = params.large_ref ? '-c' : '' """ - mapDamage -i ${bam} -r ${fasta} --rescale --rescale-out ${bam}_rescaled.bam --rescale-length-5p ${params.rescale_length_5p} --rescale-length-3p=${params.rescale_length_3p} ${singlestranded} - samtools index ${bam}_rescaled.bam ${size} + mapDamage -i ${bam} -r ${fasta} --rescale --rescale-out ${base}_rescaled.bam --rescale-length-5p ${params.rescale_length_5p} --rescale-length-3p=${params.rescale_length_3p} ${singlestranded} + samtools index ${base}_rescaled.bam ${size} """ } @@ -2114,14 +2042,15 @@ process pmdtools { snpcap = '' } def size = params.large_ref ? '-c' : '' + def platypus = params.pmdtools_platypus ? '--platypus' : '' """ #Run Filtering step - samtools calmd -b $bam $fasta | samtools view -h - | pmdtools --threshold ${params.pmdtools_threshold} $treatment $snpcap --header | samtools view -@ ${task.cpus} -Sb - > "${libraryid}".pmd.bam + samtools calmd -b ${bam} ${fasta} | samtools view -h - | pmdtools --threshold ${params.pmdtools_threshold} ${treatment} ${snpcap} --header | samtools view -@ ${task.cpus} -Sb - > "${libraryid}".pmd.bam #Run Calc Range step ## To allow early shut off of pipe: https://github.com/nextflow-io/nextflow/issues/1564 trap 'if [[ \$? == 141 ]]; then echo "Shutting samtools early due to -n parameter" && samtools index ${libraryid}.pmd.bam ${size}; exit 0; fi' EXIT - samtools calmd -b $bam $fasta | samtools view -h - | pmdtools --deamination --range ${params.pmdtools_range} $treatment $snpcap -n ${params.pmdtools_max_reads} > "${libraryid}".cpg.range."${params.pmdtools_range}".txt + samtools calmd -b ${bam} ${fasta} | samtools view -h - | pmdtools --deamination ${platypus} --range ${params.pmdtools_range} ${treatment} ${snpcap} -n ${params.pmdtools_max_reads} > "${libraryid}".cpg.range."${params.pmdtools_range}".txt echo "Running indexing" samtools index ${libraryid}.pmd.bam ${size} @@ -2219,7 +2148,7 @@ process additional_library_merge { def size = params.large_ref ? '-c' : '' """ samtools merge ${samplename}_libmerged_add.bam ${bam} - picard AddOrReplaceReadGroups I=${samplename}_libmerged_add.bam O=${samplename}_libmerged_rg_add.bam RGID=1 RGLB="${samplename}_additionalmerged" RGPL=illumina RGPU=4410 RGSM="${samplename}_additionalmerged" VALIDATION_STRINGENCY=LENIENT + picard -Xmx${task.memory.toGiga()}g AddOrReplaceReadGroups I=${samplename}_libmerged_add.bam O=${samplename}_libmerged_rg_add.bam RGID=1 RGLB="${samplename}_additionalmerged" RGPL=illumina RGPU=4410 RGSM="${samplename}_additionalmerged" VALIDATION_STRINGENCY=LENIENT samtools index ${samplename}_libmerged_rg_add.bam ${size} """ } @@ -2557,7 +2486,7 @@ process vcf2genome { def fasta_head = "${params.vcf2genome_header}" == '' ? "${samplename}" : "${params.vcf2genome_header}" """ pigz -f -d -p ${task.cpus} *.vcf.gz - vcf2genome -draft ${out}.fasta -draftname "${fasta_head}" -in ${vcf.baseName} -minc ${params.vcf2genome_minc} -minfreq ${params.vcf2genome_minfreq} -minq ${params.vcf2genome_minq} -ref ${fasta} -refMod ${out}_refmod.fasta -uncertain ${out}_uncertainy.fasta + vcf2genome -Xmx${task.memory.toGiga()}g -draft ${out}.fasta -draftname "${fasta_head}" -in ${vcf.baseName} -minc ${params.vcf2genome_minc} -minfreq ${params.vcf2genome_minfreq} -minq ${params.vcf2genome_minq} -ref ${fasta} -refMod ${out}_refmod.fasta -uncertain ${out}_uncertainy.fasta pigz -p ${task.cpus} *.fasta pigz -p ${task.cpus} *.vcf """ @@ -2566,10 +2495,10 @@ process vcf2genome { // More complex consensus caller with additional filtering functionality (e.g. for heterozygous calls) to generate SNP tables and other things sometimes used in aDNA bacteria studies // Create input channel for MultiVCFAnalyzer, possibly mixing with pre-made VCFs. -if (params.additional_vcf_files == '') { - ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it[7] }.collect() +if (!params.additional_vcf_files) { + ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it[-1] }.collect() } else { - ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it [7] }.collect().mix(ch_extravcfs_for_multivcfanalyzer) + ch_vcfs_for_multivcfanalyzer = ch_ug_for_multivcfanalyzer.map{ it [-1] }.collect().mix(ch_extravcfs_for_multivcfanalyzer) } process multivcfanalyzer { @@ -2577,11 +2506,11 @@ process multivcfanalyzer { publishDir "${params.outdir}/multivcfanalyzer", mode: params.publish_dir_mode when: - params.genotyping_tool == 'ug' && params.run_multivcfanalyzer && params.gatk_ploidy == '2' + params.genotyping_tool == 'ug' && params.run_multivcfanalyzer && params.gatk_ploidy.toString() == '2' input: - file vcf from ch_vcfs_for_multivcfanalyzer.collect() - file fasta from ch_fasta_for_multivcfanalyzer.collect() + file vcf from ch_vcfs_for_multivcfanalyzer + file fasta from ch_fasta_for_multivcfanalyzer output: file('fullAlignment.fasta.gz') @@ -2600,7 +2529,7 @@ process multivcfanalyzer { def write_freqs = params.write_allele_frequencies ? "T" : "F" """ gunzip -f *.vcf.gz - multivcfanalyzer ${params.snp_eff_results} ${fasta} ${params.reference_gff_annotations} . ${write_freqs} ${params.min_genotype_quality} ${params.min_base_coverage} ${params.min_allele_freq_hom} ${params.min_allele_freq_het} ${params.reference_gff_exclude} *.vcf + multivcfanalyzer -Xmx${task.memory.toGiga()}g ${params.snp_eff_results} ${fasta} ${params.reference_gff_annotations} . ${write_freqs} ${params.min_genotype_quality} ${params.min_base_coverage} ${params.min_allele_freq_hom} ${params.min_allele_freq_het} ${params.reference_gff_exclude} *.vcf pigz -p ${task.cpus} *.tsv *.txt snpAlignment.fasta snpAlignmentIncludingRefGenome.fasta fullAlignment.fasta """ } @@ -2627,7 +2556,7 @@ process multivcfanalyzer { script: """ - mtnucratio ${bam} "${params.mtnucratio_header}" + mtnucratio -Xmx${task.memory.toGiga()}g ${bam} "${params.mtnucratio_header}" """ } @@ -2986,7 +2915,9 @@ process output_documentation { """ } -// Collect all software versions for inclusion in MultiQC report +/* + * Parse software version numbers + */ process get_software_versions { label 'sc_tiny' @@ -3043,8 +2974,9 @@ process get_software_versions { } // MultiQC file generation for pipeline report -def workflow_summary = NfcoreSchema.params_summary_multiqc(workflow, summary_params) -ch_workflow_summary = Channel.value(workflow_summary) +//def workflow_summary = NfcoreSchema.params_summary_multiqc(workflow, summary_params) + +//ch_workflow_summary = Channel.value(workflow_summary) process multiqc { label 'sc_medium' @@ -3101,17 +3033,126 @@ process multiqc { // Send completion emails if requested, so user knows data is ready workflow.onComplete { - Completion.email(workflow, params, summary_params, projectDir, log, multiqc_report) - Completion.summary(workflow, params, log, fail_percent_mapped, pass_percent_mapped) + + // Set up the e-mail variables + def subject = "[nf-core/eager] Successful: $workflow.runName" + if (!workflow.success) { + subject = "[nf-core/eager] FAILED: $workflow.runName" + } + def email_fields = [:] + email_fields['version'] = workflow.manifest.version + email_fields['runName'] = workflow.runName + email_fields['success'] = workflow.success + email_fields['dateComplete'] = workflow.complete + email_fields['duration'] = workflow.duration + email_fields['exitStatus'] = workflow.exitStatus + email_fields['errorMessage'] = (workflow.errorMessage ?: 'None') + email_fields['errorReport'] = (workflow.errorReport ?: 'None') + email_fields['commandLine'] = workflow.commandLine + email_fields['projectDir'] = workflow.projectDir + email_fields['summary'] = summary + email_fields['summary']['Date Started'] = workflow.start + email_fields['summary']['Date Completed'] = workflow.complete + email_fields['summary']['Pipeline script file path'] = workflow.scriptFile + email_fields['summary']['Pipeline script hash ID'] = workflow.scriptId + if (workflow.repository) email_fields['summary']['Pipeline repository Git URL'] = workflow.repository + if (workflow.commitId) email_fields['summary']['Pipeline repository Git Commit'] = workflow.commitId + if (workflow.revision) email_fields['summary']['Pipeline Git branch/tag'] = workflow.revision + email_fields['summary']['Nextflow Version'] = workflow.nextflow.version + email_fields['summary']['Nextflow Build'] = workflow.nextflow.build + email_fields['summary']['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp + + // On success try attach the multiqc report + def mqc_report = null + try { + if (workflow.success) { + mqc_report = ch_multiqc_report.getVal() + if (mqc_report.getClass() == ArrayList) { + log.warn "[nf-core/eager] Found multiple reports from process 'multiqc', will use only one" + mqc_report = mqc_report[0] + } + } + } catch (all) { + log.warn "[nf-core/eager] Could not attach MultiQC report to summary email" + } + + // Check if we are only sending emails on failure + email_address = params.email + if (!params.email && params.email_on_fail && !workflow.success) { + email_address = params.email_on_fail + } + + // Render the TXT template + def engine = new groovy.text.GStringTemplateEngine() + def tf = new File("$projectDir/assets/email_template.txt") + def txt_template = engine.createTemplate(tf).make(email_fields) + def email_txt = txt_template.toString() + + // Render the HTML template + def hf = new File("$projectDir/assets/email_template.html") + def html_template = engine.createTemplate(hf).make(email_fields) + def email_html = html_template.toString() + + // Render the sendmail template + def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize: params.max_multiqc_email_size.toBytes() ] + def sf = new File("$projectDir/assets/sendmail_template.txt") + def sendmail_template = engine.createTemplate(sf).make(smail_fields) + def sendmail_html = sendmail_template.toString() + + // Send the HTML e-mail + if (email_address) { + try { + if (params.plaintext_email) { throw GroovyException('Send plaintext e-mail, not HTML') } + // Try to send HTML e-mail using sendmail + [ 'sendmail', '-t' ].execute() << sendmail_html + log.info "[nf-core/eager] Sent summary e-mail to $email_address (sendmail)" + } catch (all) { + // Catch failures and try with plaintext + def mail_cmd = [ 'mail', '-s', subject, '--content-type=text/html', email_address ] + if ( mqc_report.size() <= params.max_multiqc_email_size.toBytes() ) { + mail_cmd += [ '-A', mqc_report ] + } + mail_cmd.execute() << email_html + log.info "[nf-core/eager] Sent summary e-mail to $email_address (mail)" + } + } + + // Write summary e-mail HTML to a file + def output_d = new File("${params.outdir}/pipeline_info/") + if (!output_d.exists()) { + output_d.mkdirs() + } + def output_hf = new File(output_d, "pipeline_report.html") + output_hf.withWriter { w -> w << email_html } + def output_tf = new File(output_d, "pipeline_report.txt") + output_tf.withWriter { w -> w << email_txt } + + c_green = params.monochrome_logs ? '' : "\033[0;32m"; + c_purple = params.monochrome_logs ? '' : "\033[0;35m"; + c_red = params.monochrome_logs ? '' : "\033[0;31m"; + c_reset = params.monochrome_logs ? '' : "\033[0m"; + + if (workflow.stats.ignoredCount > 0 && workflow.success) { + log.info "-${c_purple}Warning, pipeline completed, but with errored process(es) ${c_reset}-" + log.info "-${c_red}Number of ignored errored process(es) : ${workflow.stats.ignoredCount} ${c_reset}-" + log.info "-${c_green}Number of successfully ran process(es) : ${workflow.stats.succeedCount} ${c_reset}-" + } + + if (workflow.success) { + log.info "-${c_purple}[nf-core/eager]${c_green} Pipeline completed successfully${c_reset}-" + } else { + checkHostname() + log.info "-${c_purple}[nf-core/eager]${c_red} Pipeline completed with errors${c_reset}-" + } + } workflow.onError { - // Print unexpected parameters - for (p in unexpectedParams) { - log.warn "Unexpected parameter: ${p}" - } + // Print unexpected parameters - easiest is to just rerun validation + NfcoreSchema.validateParameters(params, json_schema, log) } + ///////////////////////////////////// /* -- AUXILARY FUNCTIONS -- */ ///////////////////////////////////// @@ -3279,3 +3320,24 @@ ch_reads_for_faketsv def validate_size(collection, size){ if ( collection.size() != size ) { return false } else { return true } } + +def checkHostname() { + def c_reset = params.monochrome_logs ? '' : "\033[0m" + def c_white = params.monochrome_logs ? '' : "\033[0;37m" + def c_red = params.monochrome_logs ? '' : "\033[1;91m" + def c_yellow_bold = params.monochrome_logs ? '' : "\033[1;93m" + if (params.hostnames) { + def hostname = 'hostname'.execute().text.trim() + params.hostnames.each { prof, hnames -> + hnames.each { hname -> + if (hostname.contains(hname) && !workflow.profile.contains(prof)) { + log.error '====================================================\n' + + " ${c_red}WARNING!${c_reset} You are running with `-profile $workflow.profile`\n" + + " but your machine hostname is ${c_white}'$hostname'${c_reset}\n" + + " ${c_yellow_bold}It's highly recommended that you use `-profile $prof${c_reset}`\n" + + '============================================================' + } + } + } + } +} diff --git a/nextflow.config b/nextflow.config index 5a87732e0..2ac079dad 100644 --- a/nextflow.config +++ b/nextflow.config @@ -9,6 +9,8 @@ params { // Workflow flags genome = false + input = null + input_paths = null single_end = false outdir = './results' publish_dir_mode = 'copy' @@ -22,11 +24,10 @@ params { //Pipeline options enable_conda = false validate_params = true - schema_ignore_params = 'genomes' + schema_ignore_params = 'genome' show_hidden_params = false //Input reads - input = null udg_type = 'none' single_stranded = false single_end = false @@ -45,6 +46,10 @@ params { seq_dict = '' large_ref = false save_reference = false + + // this is just to stop the iGenomes WARN as we set as FALSE by default. Otherwise should be overwritten by optional config load below. + genomes = false + //Skipping parts of the pipeline for impatient users skip_fastqc = false @@ -113,6 +118,7 @@ params { pmdtools_threshold = 3 pmdtools_reference_mask = '' pmdtools_max_reads = 10000 + pmdtools_platypus = false // mapDamage run_mapdamage_rescaling = false @@ -244,6 +250,9 @@ params { config_profile_description = false config_profile_contact = false config_profile_url = false + validate_params = true + show_hidden_params = false + schema_ignore_params = 'genomes,input_paths' // Defaults only, expecting to be overwritten max_memory = 128.GB @@ -254,7 +263,7 @@ params { // Container slug. Stable releases should specify release tag! // Developmental code should specify :dev -process.container = 'nfcore/eager:2.3.2' +process.container = 'nfcore/eager:2.3.3' // Load base.config by default for all pipelines includeConfig 'conf/base.config' @@ -274,13 +283,21 @@ try { } profiles { - conda { + conda { + docker.enabled = false + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud = false process.conda = "$projectDir/environment.yml" - params.enable_conda = true } debug { process.beforeScript = 'echo $HOSTNAME' } docker { docker.enabled = true + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false // Avoid this error: // WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. // Testing this in nf-core after discussion here https://github.com/nf-core/tools/pull/351 @@ -288,11 +305,33 @@ profiles { docker.runOptions = '-u \$(id -u):\$(id -g)' } singularity { + docker.enabled = false singularity.enabled = true + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false singularity.autoMounts = true } podman { + singularity.enabled = false + docker.enabled = false podman.enabled = true + shifter.enabled = false + charliecloud = false + } + shifter { + singularity.enabled = false + docker.enabled = false + podman.enabled = false + shifter.enabled = true + charliecloud.enabled = false + } + charliecloud { + singularity.enabled = false + docker.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = true } test { includeConfig 'conf/test.config' } test_full { includeConfig 'conf/test_full.config' } @@ -312,6 +351,8 @@ profiles { benchmarking_human { includeConfig 'conf/benchmarking_human.config' } benchmarking_vikingfish { includeConfig 'conf/benchmarking_vikingfish.config' } } + + // Load igenomes.config if required if (!params.igenomes_ignore) { includeConfig 'conf/igenomes.config' @@ -351,7 +392,7 @@ manifest { description = 'A fully reproducible and state-of-the-art ancient DNA analysis pipeline' mainScript = 'main.nf' nextflowVersion = '!>=20.07.1' - version = '2.3.2' + version = '2.3.3' } // Function to ensure that resource requirements don't go beyond diff --git a/nextflow_schema.json b/nextflow_schema.json index 292a5fdd7..0e7a9e623 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -195,6 +195,13 @@ "hidden": true, "fa_icon": "fas fa-question-circle" }, + "validate_params": { + "type": "boolean", + "description": "Boolean whether to validate parameters against the schema at runtime", + "default": true, + "fa_icon": "fas fa-check-square", + "hidden": true + }, "email": { "type": "string", "description": "Email address for completion summary.", @@ -257,25 +264,12 @@ "hidden": true, "description": "Parameter used for checking conda channels to be set correctly." }, - "validate_params": { - "type": "boolean", - "default": "true", - "description": "Boolean whether to validate parameters against the schema at runtime", - "fa_icon": "fab fa-angellist", - "hidden": true - }, "schema_ignore_params": { "type": "string", "fa_icon": "fas fa-not-equal", "description": "String to specify ignored parameters for parameter validation", "hidden": true, "default": "genomes" - }, - "config_profile_name": { - "type": "string", - "description": "String to describe the config profile that is run.", - "fa_icon": "fas fa-id-badge", - "hidden": true } }, "fa_icon": "fas fa-file-import", @@ -302,6 +296,7 @@ "description": "Maximum amount of memory that can be requested for any single job.", "default": "128.GB", "fa_icon": "fas fa-memory", + "pattern": "^[\\d\\.]+\\s*.(K|M|G|T)?B$", "hidden": true, "help_text": "Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. `--max_memory '8.GB'`" }, @@ -310,6 +305,7 @@ "description": "Maximum amount of time that can be requested for any single job.", "default": "240.h", "fa_icon": "far fa-clock", + "pattern": "^(\\d+(\\.\\d+)?(?:\\s*|\\.?)(s|m|h|d)\\s*)+$", "hidden": true, "help_text": "Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. `--max_time '2.h'`" } @@ -344,6 +340,12 @@ "hidden": true, "fa_icon": "fas fa-users-cog" }, + "config_profile_name": { + "type": "string", + "description": "Institutional config name.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + }, "config_profile_description": { "type": "string", "description": "Institutional config description.", @@ -607,7 +609,6 @@ }, "bt2n": { "type": "integer", - "default": 0, "description": "Specify the -N parameter for bowtie2 (mismatches in seed). This will override defaults from alignmode/sensitivity.", "fa_icon": "fas fa-sort-numeric-down", "help_text": "The number of mismatches allowed in the seed during seed-and-extend procedure of Bowtie2. This will override any values set with `--bt2_sensitivity`. Can either be 0 or 1. Default: 0 (i.e. use`--bt2_sensitivity` defaults).\n\n> Modifies Bowtie2 parameters: `-N`", @@ -618,21 +619,18 @@ }, "bt2l": { "type": "integer", - "default": 0, "description": "Specify the -L parameter for bowtie2 (length of seed substrings). This will override defaults from alignmode/sensitivity.", "fa_icon": "fas fa-ruler-horizontal", "help_text": "The length of the seed sub-string to use during seeding. This will override any values set with `--bt2_sensitivity`. Default: 0 (i.e. use`--bt2_sensitivity` defaults: [20 for local and 22 for end-to-end](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line).\n\n> Modifies Bowtie2 parameters: `-L`" }, "bt2_trim5": { "type": "integer", - "default": 0, "description": "Specify number of bases to trim off from 5' (left) end of read before alignment.", "fa_icon": "fas fa-cut", "help_text": "Number of bases to trim at the 5' (left) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0\n\n> Modifies Bowtie2 parameters: `-bt2_trim5`" }, "bt2_trim3": { "type": "integer", - "default": 0, "description": "Specify number of bases to trim off from 3' (right) end of read before alignment.", "fa_icon": "fas fa-cut", "help_text": "Number of bases to trim at the 3' (right) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0.\n\n> Modifies Bowtie2 parameters: `-bt2_trim3`" @@ -683,14 +681,12 @@ }, "bam_mapping_quality_threshold": { "type": "integer", - "default": 0, "description": "Minimum mapping quality for reads filter.", "fa_icon": "fas fa-greater-than-equal", "help_text": "Specify a mapping quality threshold for mapped reads to be kept for downstream analysis. By default keeps all reads and is therefore set to `0` (basically doesn't filter anything).\n\n> Modifies samtools view parameter: `-q`" }, "bam_filter_minreadlength": { "type": "integer", - "default": 0, "fa_icon": "fas fa-ruler-horizontal", "description": "Specify minimum read length to be kept after mapping.", "help_text": "Specify minimum length of mapped reads. This filtering will apply at the same time as mapping quality filtering.\n\nIf used _instead_ of minimum length read filtering at AdapterRemoval, this can be useful to get more realistic endogenous DNA percentages, when most of your reads are very short (e.g. in single-stranded libraries) and would otherwise be discarded by AdapterRemoval (thus making an artificially small denominator for a typical endogenous DNA calculation). Note in this context you should not perform mapping quality filtering nor discarding of unmapped reads to ensure a correct denominator of all reads, for the endogenous DNA calculation.\n\n> Modifies filter_bam_fragment_length.py parameter: `-l`" @@ -817,6 +813,12 @@ "fa_icon": "fas fa-greater-than-equal", "help_text": "The maximum number of reads used for damage assessment in PMDtools. Can be used to significantly reduce the amount of time required for damage assessment in PMDTools. Note that a too low value can also obtain incorrect results.\n\n> Modifies PMDTools parameter: `-n`" }, + "pmdtools_platypus": { + "type": "boolean", + "description": "Append big list of base frequencies for platypus to output.", + "fa_icon": "fas fa-power-off", + "help_text": "Enables the printing of a wider list of base frequencies used by platypus as an addition to the output base misincorporation frequency table. By default turned off.\n" + }, "run_mapdamage_rescaling": { "type": "boolean", "fa_icon": "fas fa-map", @@ -1049,7 +1051,6 @@ }, "freebayes_g": { "type": "integer", - "default": 0, "description": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified in --freebayes_C.", "fa_icon": "fab fa-think-peaks", "help_text": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified C. Not set by default.\n\n> Modifies freebayes parameter: `-g`"