diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 13b406be4..97f223d86 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -18,8 +18,8 @@ If you'd like to write some code for nf-core/eager, the standard workflow is as 1. Check that there isn't already an issue about your idea in the [nf-core/eager issues](https://github.com/nf-core/eager/issues) to avoid duplicating work * If there isn't one already, please create one so that others know you're working on this 2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [nf-core/eager repository](https://github.com/nf-core/eager) to your GitHub account -3. Make the necessary changes / additions within your forked repository (following [code contribution guidelines](https://github.com/nf-core/eager/blob/dev/.github/CONTRIBUTING.md)) -4. Use `nf-core schema build .` and add any new parameters to the pipeline JSON schema (requires nf-core tools >= 1.10). +3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions) +4. Use `nf-core schema build .` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). 5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged If you're not used to this workflow with git, you can start with some [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests) or even their [excellent `git` resources](https://try.github.io/). @@ -31,14 +31,14 @@ Typically, pull-requests are only fully reviewed when these tests are passing, t There are typically two types of tests that run: -### Lint Tests +### Lint tests `nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to. To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint ` command. If any failures or warnings are encountered, please follow the listed URL for more documentation. -### Pipeline Tests +### Pipeline tests Each `nf-core` pipeline should be set up with a minimal set of test-data. `GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully. @@ -57,19 +57,19 @@ These tests are run both with the latest available version of `Nextflow` and als For further information/help, please consult the [nf-core/eager documentation](https://nf-co.re/eager/usage) and don't hesitate to get in touch on the nf-core Slack [#eager](https://nfcore.slack.com/channels/eager) channel ([join our Slack here](https://nf-co.re/join/slack)). -# Code Contribution Guidelines +## Pipeline contribution conventions -To make the EAGER2 code and processing logic more understandable for new contributors, and to ensure quality. We are making an attempt to somewhat-standardise the way the code is written. +To make the nf-core/eager code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written. -If you wish to contribute a new module, please use the following coding standards. +### Adding a new step -The typical workflow for adding a new module is as follows: +If you wish to contribute a new step, please use the following coding standards: -1. Define the corresponding input channel into your new process from the expected previous process channel (or re-routing block, see below). +1. Define the corresponding input channel into your new process from the expected previous process channel 2. Write the process block (see below). 3. Define the output channel if needed (see below). 4. Add any new flags/options to `nextflow.config` with a default (see below). -5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build .`) +5. Add any new flags/options to `nextflow_schema.json` **with help text** (with `nf-core schema build .`) 6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter). 7. Add sanity checks for all relevant parameters. 8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`. @@ -77,16 +77,60 @@ The typical workflow for adding a new module is as follows: 10. Add a new test command in `.github/workflow/ci.yaml`. 11. If applicable add a [MultiQC](https://https://multiqc.info/) module. 12. Update MultiQC config `assets/multiqc_config.yaml` so relevant suffixes, name clean up, General Statistics Table column order, and module figures are in the right order. -13. Add new flags/options to 'usage' documentation under `docs/usage.md`. -14. Add any descriptions of MultiQC report sections and output files to `docs/output.md`. +13. Optional: Add any descriptions of MultiQC report sections and output files to `docs/output.md`. -## Default Values +### Default values -Default values should go in `nextflow.config` under the `params` scope, and `nextflow_schema.json` (latter with `nf-core schema build .`) +Parameters should be initialised / defined with default values in `nextflow.config` under the `params` scope. -## Default resource processes +Once there, use `nf-core schema build .` to add to `nextflow_schema.json`. -Defining recommended 'minimum' resource requirements (CPUs/Memory) for a process should be defined in `conf/base.config`. This can be utilised within the process using `${task.cpu}` or `${task.memory}` variables in the `script:` block. +### Default processes resource requirements + +Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. + +:warning: Note that in nf-core/eager we currently have our own custom process labels, so please check `base.config`! + +The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block. + +### Naming schemes + +Please use the following naming schemes, to make it easy to understand what is going where. + +* initial process channel: `ch_output_from_` +* intermediate and terminal channels: `ch__for_` +* skipped process output: `ch__for_`(this goes out of the bypass statement described above) + +### Nextflow version bumping + +If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]` + +### Software version reporting + +If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process. + +Add to the script block of the process, something like the following: + +```bash + --version &> v_.txt 2>&1 || true +``` + +or + +```bash + --help | head -n 1 &> v_.txt 2>&1 || true +``` + +You then need to edit the script `bin/scrape_software_versions.py` to: + +1. Add a Python regex for your tool's `--version` output (as in stored in the `v_.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1` +2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC. + +### Images and figures + +For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines). + +For all internal nf-core/eager documentation images we are using the 'Kalam' font by the Indian Type Foundry and licensed under the Open Font License. It can be found for download here [here](https://fonts.google.com/specimen/Kalam). ## Process Concept @@ -164,44 +208,3 @@ if (params.run_fastp) { } ``` - -## Naming Schemes - -Please use the following naming schemes, to make it easy to understand what is going where. - -* process output: `ch_output_from_`(this should always go into the bypass statement described above). -* skipped process output: `ch__for_`(this goes out of the bypass statement described above) -* process inputs: `ch__for_` (this goes into a process) - -## Nextflow Version Bumping - -If you have agreement from reviewers, you may bump the 'default' minimum version of nextflow (e.g. for testing), with `nf-core bump-version`. - -## Software Version Reporting - -If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process. - -Add to the script block of the process, something like the following: - -```bash - --version &> v_.txt 2>&1 || true -``` - -or - -```bash - --help | head -n 1 &> v_.txt 2>&1 || true -``` - -You then need to edit the script `bin/scrape_software_versions.py` to - -1. add a (python) regex for your tools --version output (as in stored in the `v_.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1` -2. add a HTML block entry to the `OrderedDict` for formatting in MultiQC. - -> If a tool does not unfortunately offer any printing of version data, you may add this 'manually' e.g. with `echo "v1.1" > v_.txt` - -## Images and Figures - -For all internal nf-core/eager documentation images we are using the 'Kalam' font by the Indian Type Foundry and licensed under the Open Font License. It can be found for download here [here](https://fonts.google.com/specimen/Kalam). - -For the overview image we follow the nf-core [style guidelines](https://nf-co.re/developers/design_guidelines). diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md index afbcc7c7f..f00ef2e57 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -15,12 +15,11 @@ Please delete this text and anything that's not relevant from the template below ## Check Documentation -Have you checked in the following places for your error?: +I have checked the following places for your error: -- [ ] [Frequently Asked Questions](https://github.com/nf-core/eager/blob/master/docs/usage.md#troubleshooting-and-faqs) - (for nf-core/eager specific information) -- [ ] [Troubleshooting](https://nf-co.re/usage/troubleshooting) - (for nf-core specific information) +- [ ] [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting) +- [ ] [nf-core/eager pipeline documentation](https://nf-co.re/nf-core/eager/usage) + - nf-core/eager FAQ/troubleshooting can be found [here](https://nf-co.re/eager/usage#troubleshooting-and-faqs) ## Description of the bug @@ -39,9 +38,11 @@ Steps to reproduce the behaviour: ## Log files -1. Command line: -2. The `.nextflow.log` file (which is a hidden file in whichever place you _ran_ the pipeline from - not necessarily in the output directory!) -3. See error: +Have you provided the following extra information/files: + +- [ ] The command used to run the pipeline +- [ ] The `.nextflow.log` file +- [ ] The exact error: ## System diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 0fa8432a3..57a13ac3e 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -13,13 +13,14 @@ Learn more about contributing: [CONTRIBUTING.md](https://github.com/nf-core/eage ## PR checklist -- [ ] This comment contains a description of changes (with reason) +- [ ] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! -- [ ] If necessary, also make a PR on the [nf-core/eager branch on the nf-core/test-datasets repo](https://github.com/nf-core/test-datasets/pull/new/nf-core/eager) -- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker --paired_end`). -- [ ] Make sure your code lints ([`nf-core lint .`](https://nf-co.re/tools)). -- [ ] Documentation in `docs` is updated -- [ ] `CHANGELOG.md` is updated -- [ ] `README.md` is updated - -**Learn more about contributing:** [CONTRIBUTING.md](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md) + - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py` + - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/eager/tree/master/.github/CONTRIBUTING.md) + - [ ] If necessary, also make a PR on the nf-core/eager _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. +- [ ] Make sure your code lints (`nf-core lint .`). +- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`). +- [ ] Usage Documentation in `docs/usage.md` is updated. +- [ ] Output Documentation in `docs/output.md` is updated. +- [ ] `CHANGELOG.md` is updated. +- [ ] `README.md` is updated (including new tool citations and authors/contributors). diff --git a/.github/markdownlint.yml b/.github/markdownlint.yml index 0967bbbb8..8d7eb53b0 100644 --- a/.github/markdownlint.yml +++ b/.github/markdownlint.yml @@ -1,9 +1,9 @@ # Markdownlint configuration file -default: true, +default: true line-length: false no-duplicate-header: siblings_only: true -no-inline-html: +no-inline-html: allowed_elements: - img - p diff --git a/CHANGELOG.md b/CHANGELOG.md index 08e7bdac8..3fba7801a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,9 +13,9 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0. - Fixed AWS full test profile. - [#587](https://github.com/nf-core/eager/issues/587) - Re-implemented AdapterRemovalFixPrefix for DeDup compatibility of including singletons -- [#602](https://github.com/nf-core/eager/issues/602) - Added the newly avaliable GATK 3.5 conda package. +- [#602](https://github.com/nf-core/eager/issues/602) - Added the newly available GATK 3.5 conda package. - [#610](https://github.com/nf-core/eager/issues/610) - Create bwa_index channel when specifying circularmapper as mapper -- Updated template to nf-core/tools 1.12 +- Updated template to nf-core/tools 1.12.1 - General documentation improvements ### `Deprecated` diff --git a/Dockerfile b/Dockerfile index e51515584..b9d2d771d 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,4 +1,4 @@ -FROM nfcore/base:1.12 +FROM nfcore/base:1.12.1 LABEL authors="The nf-core/eager community" \ description="Docker image containing all software requirements for the nf-core/eager pipeline" diff --git a/README.md b/README.md index 0565bdd62..05a04862a 100644 --- a/README.md +++ b/README.md @@ -16,9 +16,10 @@ ## Introduction -**nf-core/eager** is a bioinformatics best-practice analysis pipeline for NGS sequencing based ancient DNA (aDNA) data analysis. + +**nf-core/eager** is a bioinformatics best-practise analysis pipeline for NGS sequencing based ancient DNA (aDNA) data analysis. -The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. It comes with docker, singularity or conda containers making installation trivial and results highly reproducible. +The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible. The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. It comes with docker, singularity or conda containers making installation trivial and results highly reproducible.

nf-core/eager schematic workflow + +* Sequencing quality control (`FastQC`) +* Overall pipeline run summaries (`MultiQC`) + ## Documentation The nf-core/eager pipeline comes with documentation about the pipeline: [usage](https://nf-co.re/eager/usage) and [output](https://nf-co.re/eager/output). @@ -130,9 +140,8 @@ The nf-core/eager pipeline comes with documentation about the pipeline: [usage]( This pipeline was mostly written by Alexander Peltzer ([apeltzer](https://github.com/apeltzer)) and [James A. Fellows Yates](https://github.com/jfy133), with contributions from [Stephen Clayton](https://github.com/sc13-bioinf), [Thiseas C. Lamnidis](https://github.com/TCLamnidis), [Maxime Borry](https://github.com/maxibor), [Zandra Fagernäs](https://github.com/ZandraFagernas), [Aida Andrades Valtueña](https://github.com/aidaanva) and [Maxime Garcia](https://github.com/MaxUlysse) and the nf-core community. -If you would like to contribute to this pipeline, please open an issue (or even better, a pull request - please see the [contributing guidelines](.github/CONTRIBUTING.md), and ask to be added to the project - everyone is welcome to contribute here!. - -For further information or help, don't hesitate to get in touch on the [Slack `#eager` channel](https://nfcore.slack.com/channels/eager) (you can join with [this invite](https://nf-co.re/join/slack)). +We thank the following people for their extensive assistance in the development +of this pipeline: ## Authors (alphabetical) @@ -166,7 +175,29 @@ Those who have provided conceptual guidance, suggestions, bug reports etc. If you've contributed and you're missing in here, please let us know and we will add you in of course! -## Tool References +## Contributions and Support + +If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md). + +For further information or help, don't hesitate to get in touch on the [Slack `#eager` channel](https://nfcore.slack.com/channels/eager) (you can join with [this invite](https://nf-co.re/join/slack)). + +## Citations + +If you use `nf-core/eager` for your analysis, please cite the `eager` preprint as follows: +> James A. Fellows Yates, Thiseas Christos Lamnidis, Maxime Borry, Aida Andrades Valtueña, Zandra Fagneräs, Stephen Clayton, Maxime U. Garcia, Judith Neukamm, Alexander Peltzer **Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager** bioRxiv 2020.06.11.145615; [doi: https://doi.org/10.1101/2020.06.11.145615](https://doi.org/10.1101/2020.06.11.145615) + +You can cite the eager zenodo record for a specific version using the following [doi: 10.5281/zenodo.3698082](https://zenodo.org/badge/latestdoi/135918251) + +You can cite the `nf-core` publication as follows: + +> **The nf-core framework for community-curated bioinformatics pipelines.** +> +> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. +> +> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). +> ReadCube: [Full Access Link](https://rdcu.be/b1GjZ) + +In addition, references of tools and data used in this pipeline are as follows: * **EAGER v1**, CircularMapper, DeDup* Peltzer, A., Jäger, G., Herbig, A., Seitz, A., Kniep, C., Krause, J., & Nieselt, K. (2016). EAGER: efficient ancient genome reconstruction. Genome Biology, 17(1), 1–14. [https://doi.org/10.1186/s13059-016-0918-z](https://doi.org/10.1186/s13059-016-0918-z). Download: [https://github.com/apeltzer/EAGER-GUI](https://github.com/apeltzer/EAGER-GUI) and [https://github.com/apeltzer/EAGER-CLI](https://github.com/apeltzer/EAGER-CLI) * **FastQC** Download: [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) @@ -205,19 +236,3 @@ This repository uses test data from the following studies: * Fellows Yates, J. A. et al. (2017) ‘Central European Woolly Mammoth Population Dynamics: Insights from Late Pleistocene Mitochondrial Genomes’, Scientific reports, 7(1), p. 17714. [doi: 10.1038/s41598-017-17723-1](https://doi.org/10.1038/s41598-017-17723-1). * Gamba, C. et al. (2014) ‘Genome flux and stasis in a five millennium transect of European prehistory’, Nature communications, 5, p. 5257. [doi: 10.1038/ncomms6257](https://doi.org/10.1038/ncomms6257). * Star, B. et al. (2017) ‘Ancient DNA reveals the Arctic origin of Viking Age cod from Haithabu, Germany’, Proceedings of the National Academy of Sciences of the United States of America, 114(34), pp. 9152–9157. [doi: 10.1073/pnas.1710186114](https://doi.org/10.1073/pnas.1710186114). - -## Citation - -If you use `nf-core/eager` for your analysis, please cite the `eager` preprint as follows: -> James A. Fellows Yates, Thiseas Christos Lamnidis, Maxime Borry, Aida Andrades Valtueña, Zandra Fagneräs, Stephen Clayton, Maxime U. Garcia, Judith Neukamm, Alexander Peltzer **Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager** bioRxiv 2020.06.11.145615; [doi: https://doi.org/10.1101/2020.06.11.145615](https://doi.org/10.1101/2020.06.11.145615) - -You can cite the eager zenodo record for a specific version using the following [doi: 10.5281/zenodo.3698082](https://zenodo.org/badge/latestdoi/135918251) - -You can cite the `nf-core` publication as follows: - -> **The nf-core framework for community-curated bioinformatics pipelines.** -> -> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. -> -> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). -> ReadCube: [Full Access Link](https://rdcu.be/b1GjZ) diff --git a/assets/nf-core-eager_logo.png b/assets/nf-core-eager_logo.png index f464b6b3a..4d301d806 100644 Binary files a/assets/nf-core-eager_logo.png and b/assets/nf-core-eager_logo.png differ diff --git a/conf/igenomes.config b/conf/igenomes.config index caeafceb2..31b7ee613 100644 --- a/conf/igenomes.config +++ b/conf/igenomes.config @@ -21,7 +21,7 @@ params { readme = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt" mito_name = "MT" macs_gsize = "2.7e9" - blacklist = "${baseDir}/assets/blacklists/GRCh37-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/GRCh37-blacklist.bed" } 'GRCh38' { fasta = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa" @@ -33,7 +33,7 @@ params { bed12 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed" mito_name = "chrM" macs_gsize = "2.7e9" - blacklist = "${baseDir}/assets/blacklists/hg38-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" } 'GRCm38' { fasta = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/WholeGenomeFasta/genome.fa" @@ -46,7 +46,7 @@ params { readme = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/README.txt" mito_name = "MT" macs_gsize = "1.87e9" - blacklist = "${baseDir}/assets/blacklists/GRCm38-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/GRCm38-blacklist.bed" } 'TAIR10' { fasta = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa" @@ -270,7 +270,7 @@ params { bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed" mito_name = "chrM" macs_gsize = "2.7e9" - blacklist = "${baseDir}/assets/blacklists/hg38-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" } 'hg19' { fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" @@ -283,7 +283,7 @@ params { readme = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/README.txt" mito_name = "chrM" macs_gsize = "2.7e9" - blacklist = "${baseDir}/assets/blacklists/hg19-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/hg19-blacklist.bed" } 'mm10' { fasta = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa" @@ -296,7 +296,7 @@ params { readme = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/README.txt" mito_name = "chrM" macs_gsize = "1.87e9" - blacklist = "${baseDir}/assets/blacklists/mm10-blacklist.bed" + blacklist = "${projectDir}/assets/blacklists/mm10-blacklist.bed" } 'bosTau8' { fasta = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/WholeGenomeFasta/genome.fa" diff --git a/docs/images/nf-core-eager_logo.png b/docs/images/nf-core-eager_logo.png index 744089106..4d301d806 100644 Binary files a/docs/images/nf-core-eager_logo.png and b/docs/images/nf-core-eager_logo.png differ diff --git a/docs/usage.md b/docs/usage.md index 78c55cddf..7d440ce8a 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -368,7 +368,7 @@ wildcards (`*`) e.g.: 7. For input BAM files you should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating - the index files of the reference genome, even if you do not actual need a + the index files of the reference genome, even if you do not actually need a reference genome for any downstream analyses. ##### TSV Input Method @@ -463,7 +463,7 @@ will have the following effects: - After mapping, and prior BAM filtering, BAM files with different `SeqType` (but with all other metadata columns the same) will be merged together for each **Library**. -- After duplicate removal, BAM files with `Library_ID`s with the same +- After duplicate removal, BAM files with different `Library_ID`s but with the same `Sample_Name` and the same `UDG_Treatment` will be merged together. - If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the @@ -490,7 +490,7 @@ Note the following important points and limitations for setting up: - You should provide a small decoy reference genome with pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in order to avoid long computational time for generating the index files of the - reference genome, even if you do not actual need a reference genome for any + reference genome, even if you do not actually need a reference genome for any downstream analyses. - nf-core/eager will only merge multiple _lanes_ of sequencing runs with the same single-end or paired-end configuration @@ -547,7 +547,7 @@ Default: `'none'`. > **Tip**: You should provide a small decoy reference genome with pre-made indices, e.g. > the human mtDNA genome, for the mandatory parameter `--fasta` in order to > avoid long computational time for generating the index files of the reference -> genome, even if you do not actual need a reference genome for any downstream +> genome, even if you do not actually need a reference genome for any downstream > analyses. #### `--single_stranded` diff --git a/environment.yml b/environment.yml index a29487ea0..e0e05c545 100644 --- a/environment.yml +++ b/environment.yml @@ -1,3 +1,5 @@ +# You can use this file to create a conda environment for this pipeline: +# conda env create -f environment.yml name: nf-core-eager-2.2.2 channels: - conda-forge diff --git a/nextflow_schema.json b/nextflow_schema.json index 47d42f133..194d4b74b 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -19,14 +19,14 @@ "default": "null", "description": "Either paths or URLs to FASTQ/BAM data (must be surrounded with quotes). For paired end data, the path must use '{1,2}' notation to specify read pairs. Alternatively, a path to a TSV file (ending .tsv) containing file paths and sequencing/sample metadata. Allows for merging of multiple lanes/libraries/samples. Please see documentation for template.", "fa_icon": "fas fa-dna", - "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager.\nThe most efficient but more simplistic is supplying direct paths (with\nwildcards) to your FASTQ or BAM files, with each file or pair being considered a\nsingle library and each one run independently. TSV input requires creation of an\nextra file by the user and extra metadata, but allows more powerful lane and\nlibrary merging.\n\n##### Direct Input Method\n\nThis method is where you specify with `--input`, the path locations of FASTQ\n(optionally gzipped) or BAM file(s). This option is mutually exclusive to the\n[TSV input method](#tsv-input-method), which is used for more complex input\nconfigurations such as lane and library merging.\n\nWhen using the direct method of `--input` you can specify one or multiple\nsamples in one or more directories files. File names **must be unique**, even if\nin different directories. \n\nBy default, the pipeline _assumes_ you have paired-end data. If you want to run\nsingle-end data you must specify [`--single_end`]('#single_end')\n\nFor example, for a single set of FASTQs, or multiple paired-end FASTQ\nfiles in one directory, you can specify:\n\n```bash\n--input 'path/to/data/sample_*_{1,2}.fastq.gz'\n```\n\nIf you have multiple files in different directories, you can use additional\nwildcards (`*`) e.g.:\n\n```bash\n--input 'path/to/data/*/sample_*_{1,2}.fastq.gz'\n```\n\n> :warning: It is not possible to run a mixture of single-end and paired-end\n> files in one run with the paths `--input` method! Please see the [TSV input\n> method](#tsv-input-method) for possibilities.\n\n**Please note** the following requirements:\n\n1. Valid file extensions: `.fastq.gz`, `.fastq`, `.fq.gz`, `.fq`, `.bam`.\n2. The path **must** be enclosed in quotes\n3. The path must have at least one `*` wildcard character\n4. When using the pipeline with **paired end data**, the path must use `{1,2}`\n notation to specify read pairs.\n5. Files names must be unique, having files with the same name, but in different\n directories is _not_ sufficient\n - This can happen when a library has been sequenced across two sequencers on\n the same lane. Either rename the file, try a symlink with a unique name, or\n merge the two FASTQ files prior input.\n6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be\n truncated after the first `.` in the name, Ensure file names are unique prior\n to this!\n7. For input BAM files you should provide a small decoy reference genome with\n pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory\n parameter `--fasta` in order to avoid long computational time for generating\n the index files of the reference genome, even if you do not actual need a\n reference genome for any downstream analyses.\n\n##### TSV Input Method\n\nAlternatively to the [direct input method](#direct-input-method), you can supply\nto `--input` a path to a TSV file that contains paths to FASTQ/BAM files and\nadditional metadata. This allows for more complex procedures such as merging of\nsequencing data across lanes, sequencing runs, sequencing configuration types,\nand samples.\n\n

\n \"Schematic\n

\n\nThe use of the TSV `--input` method is recommended when performing\nmore complex procedures such as lane or library merging. You do not need to\nspecify `--single_end`, `--bam`, `--colour_chemistry`, `-udg_type` etc. when\nusing TSV input - this is defined within the TSV file itself. You can only\nsupply a single TSV per run (i.e. `--input '*.tsv'` will not work).\n\nThis TSV should look like the following:\n\n| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |\n|-------------|------------|------|------------------|--------|----------|--------------|---------------|----|----|-----|\n| JK2782 | JK2782 | 1 | 4 | PE | Mammoth | double | full | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz) | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz) | NA |\n| JK2802 | JK2802 | 2 | 2 | SE | Mammoth | double | full | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz) | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz) | NA |\n\nA template can be taken from\n[here](https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/TSV_template.tsv).\n\n> :warning: Cells **must not** contain spaces before or after strings, as this\n> will make the TSV unreadable by nextflow. Strings containing spaces should be\n> wrapped in quotes.\n\nWhen using TSV_input, nf-core/eager will merge FASTQ files of libraries with the\nsame `Library_ID` but different `Lanes` values after adapter clipping (and\nmerging), assuming all other metadata columns are the same. If you have the same\n`Library_ID` but with different `SeqType`, this will be merged directly after\nmapping prior BAM filtering. Finally, it will also merge BAM files with the same\n`Sample_ID` but different `Library_ID` after duplicate removal, but prior to\ngenotyping. Please see caveats to this below.\n\nColumn descriptions are as follows:\n\n- **Sample_Name:** A text string containing the name of a given sample of which\n there can be multiple libraries. All libraries with the same sample name and\n same SeqType will be merged after deduplication.\n- **Library_ID:** A text string containing a given library, which there can be\n multiple sequencing lanes (with the same SeqType).\n- **Lane:** A number indicating which lane the library was sequenced on. Files\n from the libraries sequenced on different lanes (and different SeqType) will\n be concatenated after read clipping and merging.\n- **Colour Chemistry** A number indicating whether the Illumina sequencer the\n library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour\n chemistry machine. This informs whether poly-G trimming (if turned on) should\n be performed.\n- **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with\n both an R1 [or forward] and R2 [or reverse]) and single end data (only R1\n [forward], or BAM). This will affect lane merging if different per library.\n- **Organism:** A text string of the organism name of the sample or 'NA'. This\n currently has no functionality and can be set to 'NA', but will affect\n lane/library merging if different per library\n- **Strandedness:** A text string indicating whether the library type is\n 'single' or 'double'. This will affect lane/library merging if different per\n library.\n- **UDG_Treatment:** A text string indicating whether the library was generated\n with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library\n merging if different per library.\n- **R1:** A text string of a file path pointing to a forward or R1 FASTQ file.\n This can be used with the R2 column. File names **must be unique**, even if\n they are in different directories.\n- **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file,\n or 'NA' when single end data. This can be used with the R1 column. File names\n **must be unique**, even if they are in different directories.\n- **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot\n be specified at the same time as R1 or R2, both of which should be set to 'NA'\n\nFor example, the following TSV table:\n\n| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |\n|-------------|------------|------|------------------|---------|----------|--------------|---------------|----------------------------------------------------------------|----------------------------------------------------------------|-----|\n| JK2782 | JK2782 | 7 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2782 | JK2782 | 8 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2802 | JK2802 | 7 | 4 | PE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2802_AGAATAACCTACCA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2802 | JK2802 | 8 | 4 | SE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz | NA | NA |\n\nwill have the following effects:\n\n- After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8\n _with the same `SeqType`_ (and all other _metadata_ columns) will be\n concatenated together for each **Library**.\n- After mapping, and prior BAM filtering, BAM files with different\n `SeqType` (but with all other metadata columns the same) will be merged\n together for each **Library**.\n- After duplicate removal, BAM files with `Library_ID`s with the same\n `Sample_Name` and the same `UDG_Treatment` will be merged together.\n- If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and\n half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the\n same `Sample_Name`.\n\nNote the following important points and limitations for setting up:\n\n- The TSV must use actual tabs (not spaces) between cells.\n- *File* names must be unique regardless of file path, due to risk of\n over-writing (see:\n [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)).\n - If it is 'too late' and you already have duplicate file names, a workaround is\n to concatenate the FASTQ files together and supply this to a nf-core/eager\n run. The only downside is that you will not get independent FASTQC results\n for each file.\n- Lane IDs must be unique for each sequencing of each library.\n - If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can\n give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will\n still be processed correctly.\n - This also applies to the SeqType column, i.e. with the example above, if one\n run is PE and one run is SE, you need to give fake lane IDs to one of the\n runs as well.\n- All _BAM_ files must be specified as `SE` under `SeqType`.\n - You should provide a small decoy reference genome with pre-made indices, e.g.\n the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in\n order to avoid long computational time for generating the index files of the\n reference genome, even if you do not actual need a reference genome for any\n downstream analyses.\n- nf-core/eager will only merge multiple _lanes_ of sequencing runs with the\n same single-end or paired-end configuration\n- Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files\n (unless you use `--run_convertbam`), as only FASTQ files are lane-merged\n together.\n- Same libraries that are sequenced on different sequencing configurations (i.e\n single- and paired-end data), will be merged after mapping and will _always_\n be considered 'paired-end' during downstream processes\n - **Important** running DeDup in this context is _not_ recommended, as PE and\n SE data at the same position will _not_ be evaluated as duplicates.\n Therefore not all duplicates will be removed.\n - When you wish to run PE/SE data together `-dedupper markduplicates` is\n therefore preferred.\n - An error will be thrown if you try to merge both PE and SE and also supply\n `--skip_merging`.\n - If you truly want to mix SE data and PE data but using mate-pair info for PE\n mapping, please run FASTQ preprocessing mapping manually and supply BAM\n files for downstream processing by nf-core/eager\n - If you _regularly_ want to run the situation above, please leave a feature\n request on github.\n- DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on\n each unique library separately after deduplication (but prior same-treated\n library merging).\n- nf-core/eager functionality such as `--run_trim_bam` will be applied to only\n non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries.\n- Qualimap is run on each sample, after merging of libraries (i.e. your values\n will reflect the values of all libraries combined - after being damage trimmed\n etc.).\n- Genotyping will be typically performed on each `sample` independently, as\n normally all libraries will have been merged together. However, if you have a\n mixture of single-stranded and double-stranded libraries, you will normally\n need to genotype separately. In this case you **must** give each the SS and DS\n libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file\n collision` error in steps such as `sexdeterrmine`, and then you will need to\n merge these yourself. We will consider changing this behaviour in the future\n if there is enough interest." + "help_text": "There are two possible ways of supplying input sequencing data to nf-core/eager.\nThe most efficient but more simplistic is supplying direct paths (with\nwildcards) to your FASTQ or BAM files, with each file or pair being considered a\nsingle library and each one run independently. TSV input requires creation of an\nextra file by the user and extra metadata, but allows more powerful lane and\nlibrary merging.\n\n##### Direct Input Method\n\nThis method is where you specify with `--input`, the path locations of FASTQ\n(optionally gzipped) or BAM file(s). This option is mutually exclusive to the\n[TSV input method](#tsv-input-method), which is used for more complex input\nconfigurations such as lane and library merging.\n\nWhen using the direct method of `--input` you can specify one or multiple\nsamples in one or more directories files. File names **must be unique**, even if\nin different directories. \n\nBy default, the pipeline _assumes_ you have paired-end data. If you want to run\nsingle-end data you must specify [`--single_end`]('#single_end')\n\nFor example, for a single set of FASTQs, or multiple paired-end FASTQ\nfiles in one directory, you can specify:\n\n```bash\n--input 'path/to/data/sample_*_{1,2}.fastq.gz'\n```\n\nIf you have multiple files in different directories, you can use additional\nwildcards (`*`) e.g.:\n\n```bash\n--input 'path/to/data/*/sample_*_{1,2}.fastq.gz'\n```\n\n> :warning: It is not possible to run a mixture of single-end and paired-end\n> files in one run with the paths `--input` method! Please see the [TSV input\n> method](#tsv-input-method) for possibilities.\n\n**Please note** the following requirements:\n\n1. Valid file extensions: `.fastq.gz`, `.fastq`, `.fq.gz`, `.fq`, `.bam`.\n2. The path **must** be enclosed in quotes\n3. The path must have at least one `*` wildcard character\n4. When using the pipeline with **paired end data**, the path must use `{1,2}`\n notation to specify read pairs.\n5. Files names must be unique, having files with the same name, but in different\n directories is _not_ sufficient\n - This can happen when a library has been sequenced across two sequencers on\n the same lane. Either rename the file, try a symlink with a unique name, or\n merge the two FASTQ files prior input.\n6. Due to limitations of downstream tools (e.g. FastQC), sample IDs may be\n truncated after the first `.` in the name, Ensure file names are unique prior\n to this!\n7. For input BAM files you should provide a small decoy reference genome with\n pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory\n parameter `--fasta` in order to avoid long computational time for generating\n the index files of the reference genome, even if you do not actually need a\n reference genome for any downstream analyses.\n\n##### TSV Input Method\n\nAlternatively to the [direct input method](#direct-input-method), you can supply\nto `--input` a path to a TSV file that contains paths to FASTQ/BAM files and\nadditional metadata. This allows for more complex procedures such as merging of\nsequencing data across lanes, sequencing runs, sequencing configuration types,\nand samples.\n\n

\n \"Schematic\n

\n\nThe use of the TSV `--input` method is recommended when performing\nmore complex procedures such as lane or library merging. You do not need to\nspecify `--single_end`, `--bam`, `--colour_chemistry`, `-udg_type` etc. when\nusing TSV input - this is defined within the TSV file itself. You can only\nsupply a single TSV per run (i.e. `--input '*.tsv'` will not work).\n\nThis TSV should look like the following:\n\n| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |\n|-------------|------------|------|------------------|--------|----------|--------------|---------------|----|----|-----|\n| JK2782 | JK2782 | 1 | 4 | PE | Mammoth | double | full | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz) | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz) | NA |\n| JK2802 | JK2802 | 2 | 2 | SE | Mammoth | double | full | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz) | [https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz](https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz) | NA |\n\nA template can be taken from\n[here](https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/TSV_template.tsv).\n\n> :warning: Cells **must not** contain spaces before or after strings, as this\n> will make the TSV unreadable by nextflow. Strings containing spaces should be\n> wrapped in quotes.\n\nWhen using TSV_input, nf-core/eager will merge FASTQ files of libraries with the\nsame `Library_ID` but different `Lanes` values after adapter clipping (and\nmerging), assuming all other metadata columns are the same. If you have the same\n`Library_ID` but with different `SeqType`, this will be merged directly after\nmapping prior BAM filtering. Finally, it will also merge BAM files with the same\n`Sample_ID` but different `Library_ID` after duplicate removal, but prior to\ngenotyping. Please see caveats to this below.\n\nColumn descriptions are as follows:\n\n- **Sample_Name:** A text string containing the name of a given sample of which\n there can be multiple libraries. All libraries with the same sample name and\n same SeqType will be merged after deduplication.\n- **Library_ID:** A text string containing a given library, which there can be\n multiple sequencing lanes (with the same SeqType).\n- **Lane:** A number indicating which lane the library was sequenced on. Files\n from the libraries sequenced on different lanes (and different SeqType) will\n be concatenated after read clipping and merging.\n- **Colour Chemistry** A number indicating whether the Illumina sequencer the\n library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour\n chemistry machine. This informs whether poly-G trimming (if turned on) should\n be performed.\n- **SeqType:** A text string of either 'PE' or 'SE', specifying paired end (with\n both an R1 [or forward] and R2 [or reverse]) and single end data (only R1\n [forward], or BAM). This will affect lane merging if different per library.\n- **Organism:** A text string of the organism name of the sample or 'NA'. This\n currently has no functionality and can be set to 'NA', but will affect\n lane/library merging if different per library\n- **Strandedness:** A text string indicating whether the library type is\n 'single' or 'double'. This will affect lane/library merging if different per\n library.\n- **UDG_Treatment:** A text string indicating whether the library was generated\n with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library\n merging if different per library.\n- **R1:** A text string of a file path pointing to a forward or R1 FASTQ file.\n This can be used with the R2 column. File names **must be unique**, even if\n they are in different directories.\n- **R2:** A text string of a file path pointing to a reverse or R2 FASTQ file,\n or 'NA' when single end data. This can be used with the R1 column. File names\n **must be unique**, even if they are in different directories.\n- **BAM:** A text string of a file path pointing to a BAM file, or 'NA'. Cannot\n be specified at the same time as R1 or R2, both of which should be set to 'NA'\n\nFor example, the following TSV table:\n\n| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |\n|-------------|------------|------|------------------|---------|----------|--------------|---------------|----------------------------------------------------------------|----------------------------------------------------------------|-----|\n| JK2782 | JK2782 | 7 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2782 | JK2782 | 8 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2802 | JK2802 | 7 | 4 | PE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2802_AGAATAACCTACCA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA |\n| JK2802 | JK2802 | 8 | 4 | SE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz | NA | NA |\n\nwill have the following effects:\n\n- After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8\n _with the same `SeqType`_ (and all other _metadata_ columns) will be\n concatenated together for each **Library**.\n- After mapping, and prior BAM filtering, BAM files with different\n `SeqType` (but with all other metadata columns the same) will be merged\n together for each **Library**.\n- After duplicate removal, BAM files with different `Library_ID`s but with the same\n `Sample_Name` and the same `UDG_Treatment` will be merged together.\n- If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and\n half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the\n same `Sample_Name`.\n\nNote the following important points and limitations for setting up:\n\n- The TSV must use actual tabs (not spaces) between cells.\n- *File* names must be unique regardless of file path, due to risk of\n over-writing (see:\n [https://github.com/nextflow-io/nextflow/issues/470](https://github.com/nextflow-io/nextflow/issues/470)).\n - If it is 'too late' and you already have duplicate file names, a workaround is\n to concatenate the FASTQ files together and supply this to a nf-core/eager\n run. The only downside is that you will not get independent FASTQC results\n for each file.\n- Lane IDs must be unique for each sequencing of each library.\n - If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can\n give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will\n still be processed correctly.\n - This also applies to the SeqType column, i.e. with the example above, if one\n run is PE and one run is SE, you need to give fake lane IDs to one of the\n runs as well.\n- All _BAM_ files must be specified as `SE` under `SeqType`.\n - You should provide a small decoy reference genome with pre-made indices, e.g.\n the human mtDNA or phiX genome, for the mandatory parameter `--fasta` in\n order to avoid long computational time for generating the index files of the\n reference genome, even if you do not actually need a reference genome for any\n downstream analyses.\n- nf-core/eager will only merge multiple _lanes_ of sequencing runs with the\n same single-end or paired-end configuration\n- Accordingly nf-core/eager will not merge _lanes_ of FASTQs with BAM files\n (unless you use `--run_convertbam`), as only FASTQ files are lane-merged\n together.\n- Same libraries that are sequenced on different sequencing configurations (i.e\n single- and paired-end data), will be merged after mapping and will _always_\n be considered 'paired-end' during downstream processes\n - **Important** running DeDup in this context is _not_ recommended, as PE and\n SE data at the same position will _not_ be evaluated as duplicates.\n Therefore not all duplicates will be removed.\n - When you wish to run PE/SE data together `-dedupper markduplicates` is\n therefore preferred.\n - An error will be thrown if you try to merge both PE and SE and also supply\n `--skip_merging`.\n - If you truly want to mix SE data and PE data but using mate-pair info for PE\n mapping, please run FASTQ preprocessing mapping manually and supply BAM\n files for downstream processing by nf-core/eager\n - If you _regularly_ want to run the situation above, please leave a feature\n request on github.\n- DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on\n each unique library separately after deduplication (but prior same-treated\n library merging).\n- nf-core/eager functionality such as `--run_trim_bam` will be applied to only\n non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries.\n- Qualimap is run on each sample, after merging of libraries (i.e. your values\n will reflect the values of all libraries combined - after being damage trimmed\n etc.).\n- Genotyping will be typically performed on each `sample` independently, as\n normally all libraries will have been merged together. However, if you have a\n mixture of single-stranded and double-stranded libraries, you will normally\n need to genotype separately. In this case you **must** give each the SS and DS\n libraries _distinct_ `Sample_IDs`; otherwise you will receive a `file\n collision` error in steps such as `sexdeterrmine`, and then you will need to\n merge these yourself. We will consider changing this behaviour in the future\n if there is enough interest." }, "udg_type": { "type": "string", "default": "none", "description": "Specifies whether you have UDG treated libraries. Set to 'half' for partial treatment, or 'full' for UDG. If not set, libraries are assumed to have no UDG treatment ('none'). Not required for TSV input.", "fa_icon": "fas fa-vial", - "help_text": "Defines whether Uracil-DNA glycosylase (UDG) treatment was used to remove DNA\ndamage on the sequencing libraries.\n\nSpecify `'none'` if no treatment was performed. If you have partial UDG treated\ndata ([Rohland et al 2016](http://dx.doi.org/10.1098/rstb.2013.0624)), specify\n`'half'`. If you have complete UDG treated data ([Briggs et al.\n2010](https://doi.org/10.1093/nar/gkp1163)), specify `'full'`. \n\nWhen also using PMDtools specifying `'half'` will use a different model for DNA\ndamage assessment in PMDTools (PMDtools: `--UDGhalf`). Specify `'full'` and the\nPMDtools DNA damage assessment will use CpG context only (PMDtools: `--CpG`).\nDefault: `'none'`.\n\n> **Tip**: You should provide a small decoy reference genome with pre-made indices, e.g.\n> the human mtDNA genome, for the mandatory parameter `--fasta` in order to\n> avoid long computational time for generating the index files of the reference\n> genome, even if you do not actual need a reference genome for any downstream\n> analyses.", + "help_text": "Defines whether Uracil-DNA glycosylase (UDG) treatment was used to remove DNA\ndamage on the sequencing libraries.\n\nSpecify `'none'` if no treatment was performed. If you have partial UDG treated\ndata ([Rohland et al 2016](http://dx.doi.org/10.1098/rstb.2013.0624)), specify\n`'half'`. If you have complete UDG treated data ([Briggs et al.\n2010](https://doi.org/10.1093/nar/gkp1163)), specify `'full'`. \n\nWhen also using PMDtools specifying `'half'` will use a different model for DNA\ndamage assessment in PMDTools (PMDtools: `--UDGhalf`). Specify `'full'` and the\nPMDtools DNA damage assessment will use CpG context only (PMDtools: `--CpG`).\nDefault: `'none'`.\n\n> **Tip**: You should provide a small decoy reference genome with pre-made indices, e.g.\n> the human mtDNA genome, for the mandatory parameter `--fasta` in order to\n> avoid long computational time for generating the index files of the reference\n> genome, even if you do not actually need a reference genome for any downstream\n> analyses.", "enum": [ "none", "half",