nf-core/eager: Usage

nf-core/eager: Usage

General Nextflow info

Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen / tmux or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.

To create a screen session:

screen -R eager2

To disconnect, press ctrl+a then d.

To reconnect, type :

screen -r eager2

to end the screen session while in it type exit.

Automatic Resubmission

By default, if a pipeline step fails, EAGER2 will resubmit the job with twice the amount of CPU and memory. This will occur two times before failing.

Help Message

To access the nextflow help message run: nextflow run -help

Running the pipeline

Before you start you should change into the output directory you wish your results to go in. When you start the nextflow job, it will place all the log files and 'working' folders in the current directory and NOT necessarily the directory the output files will be in.

The typical command for running the pipeline is as follows:

nextflow run nf-core/eager --input '*_R{1,2}.fastq.gz' --fasta 'some.fasta' -profile standard,docker

where the reads are from libraries of the same pairing.

This will launch the pipeline with the docker configuration profile. See below for more information about profiles.

Note that the pipeline will create the following files in your working directory:

work            # Directory containing the nextflow working files
results         # Finished results (configurable, see below)
.nextflow.log   # Log file from Nextflow
               \# Other nextflow hidden files, eg. history of pipeline runs and old logs.

To see the the EAGER pipeline help message run: nextflow run nf-core/eager --help

By default, if a pipeline step fails, nf-core/eager will resubmit the job with twice the amount of CPU and memory. This will occur two times before failing.

If you want to configure your pipeline interactively using a graphical user interface, please visit https://nf-co.re/launch, select the eager pipeline and the version you intend to run and follow the on-screen instructions to create a config for your pipeline run.

Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

nextflow pull nf-core/eager

See below for more details about EAGER2 versioning.

Mandatory Arguments

`-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different computing environments (e.g. schedulers, software environments, memory limits etc). Note that multiple profiles can be loaded, for example: -profile standard,docker - the order of arguments is important! The first entry takes precendence over the others, e.g. if a setting is set by both the first and second profile, the first entry will be used and the second entry ignored.

Important: If running EAGER2 on a cluster - ask your system administrator what profile to use.

For more details on how to set up your own private profile, please see installation.

Basic profiles These are basic profiles which primarily define where you derive the pipeline's software packages from. These are typically the profiles you would use if you are running the pipeline on your own PC (vs. a HPC cluster - see below).

awsbatch
- A generic configuration profile to be used with AWS Batch.
conda
- Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker or Singularity.
- A generic configuration profile to be used with conda
- Pulls most software from Bioconda
docker
- A generic configuration profile to be used with Docker
- Pulls software from dockerhub: nfcore/eager
singularity
- A generic configuration profile to be used with Singularity
- Pulls software from singularity-hub
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
none
- No configuration at all. Useful if you want to build your own config from scratch and want to avoid loading in the default base config profile (not recommended).

Institution Specific Profiles These are profiles specific to certain HPC clusters, and are centrally maintained at nf-core/configs. Those listed below are regular users of EAGER2, if you don't see your own institution here check the nf-core/configs repository.

uzh
- A profile for the University of Zurich Research Cloud
- Loads Singularity and defines appropriate resources for running the pipeline.
binac
- A profile for the BinAC cluster at the University of Tuebingen 0 Loads Singularity and defines appropriate resources for running the pipeline
shh
- A profiler for the S/CDAG cluster at the Department of Archaeogenetics of the Max Planck Institute for the Science of Human History
- Loads Singularity and defines appropriate resources for running the pipeline

Pipeline Specific Institution Profiles There are also pipeline-specific institution profiles. I.e., we can also offer a profile which sets special resource settings to specific steps of the pipeline, which may not apply to all pipelines. This can be seen at nf-core/configs under conf/pipelines/eager/.

We currently offer a EAGER specific profile for

shh
- A profiler for the S/CDAG cluster at the Department of Archaeogenetics of the Max Planck Institute for the Science of Human History
- In addition to the nf-core wide profile, this also sets the MALT resources to match our commonly used databases

Further institutions can be added at nf-core/configs. Please ask the eager developers to add your institution to the list above, if you add one!

`--input`

There are two possible ways of supplying input sequencing data to nf-core/eager. The most efficient but more simplistic is supplying direct paths (with wildcards) to your FASTQ or BAM files, with each file or pair being considered a single library and each one run independently. TSV input requires creation of an extra file by the user and extra metadta, but allows more powerful lane and library merging.

Direct Input Method

This method is where you specify with --input, the path locations of FASTQ (optionally gzipped) or BAM file(s). This option is mutually exclusive to the TSV input method which is used for more complex input configurations such as lane and library merging.

When using the direct method of --input you can specify one or multiple samples in one or more directories files. File names must be unique, even if in different directories.

By default, the pipeline assumes you have paired-end data. If you want to run single-end data you must specify --single_end

For example, for a single set of FASTQs, or multiple files paired-end FASTQ files in one directory, you can specify:

--input 'path/to/data/sample_*_{1,2}.fastq.gz'

If you have multiple files in different directories, you can use additional wildcards (*) e.g.:

--input 'path/to/data/*/sample_*_{1,2}.fastq.gz'

⚠️ It is not possible to run a mixture of single-end and paired-end files in one run with the paths --input method! Please see the TSV input method for possibilities.

Please note the following requirements:

Valid file extensions: .fastq.gz, .fastq, .fq.gz, .fq, .bam.
The path must be enclosed in quotes
The path must have at least one * wildcard character
When using the pipeline with paired end data, the path must use {1,2} notation to specify read pairs.
Files names must be unique, having files with the same name, but in different directories is not sufficient
- This can happen when a library has been sequenced across two sequencers on the same lane. Either rename the file, try a symlink with a unique name, or merge the two FASTQ files prior input.

TSV Input Method

Alternatively to the direct input method, you can supply to --input a path to a TSV file that contains paths to FASTQ/BAM files and additional metadata. This allows for more complex procedures such as merging of sequencing data across lanes, sequencing runs, sequencing configuration types, and samples.

The use of the TSV --input method is recommended to be used when performing more complex procedures such as lane or library merging. You do not need to specify --single_end, --bam, --colour_chemistry, -udg_type etc. when using TSV input - this is defined within the TSV file itself.

This TSV should look like the following:

Sample_Name	Library_ID	Lane	Colour_Chemistry	SeqType	Organism	Strandedness	UDG_Treatment	R1	R2	BAM
JK2782	JK2782	1	4	PE	Mammoth	double	full	https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz	https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz	NA
JK2802	JK2802	2	2	SE	Mammoth	double	full	https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz	https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz	NA

A template can be taken from here.

⚠️ Cells must not contain spaces before or after strings, as this will make the TSV unreadable by nextflow. Strings containing spaces should be wrapped in quotes.

When using TSV_input, nf-core/eager will merge FASTQ files of libraries with the same Library_ID but different Lanes after adapter clipping (and merging), assuming all other metadata columns are the same. If you have the same Library_ID but with different SeqType, this will be merged directly after mapping prior BAM filtering. Finally, it will also merge BAM files with the same Sample_ID but different Library_ID after duplicate removal, but prior to genotyping. Please see caveats to this below.

Column descriptions are as follows:

Sample_Name: A text string containing the name of a given sample of which there can be multiple libraries. All libraries with the same sample name and same SeqType will be merged after deduplication.
Library_ID: A text string containing a given library, which there can be multiple sequencing lanes (with the same SeqType).
Lane: A number indicating which lane the library was sequenced on. Files from the libraries sequenced on different lanes (and different SeqType) will be concatenated after read clipping and merging.
Colour Chemistry A number indicating whether the Illumina sequencing machine the library was sequenced on was 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq). This informs whether poly-G trimming (if turned on) should be performed.
SeqType: A text string of either 'PE' or 'SE', specifying paired end (with both an R1 [or forward] and R2 [or reverse]) and single end data (only R1 [forward], or BAM). This will affect lane merging if different per library.
Organism: A text string of the organism name of the sample or 'NA'. This currently has no functionality and can be set to 'NA', but will affect lane/library merging if different per library
Strandedness: A text string indicating whether the library type is 'single' or 'double'. This currently has no functionality, but will affect lane/library merging if different per library.
UDG_Treatment: A text string indicating whether the library was generated with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library merging if different per library.
R1: A text string of a file path pointing to a forward or R1 FASTQ file. This can be used with the R2 column. File names must be unique, even if in different directories.
R2: A text string of a file path pointing to a reverse or R2 FASTQ file, or 'NA' when single end data. This can be used with the R1 column. File names must be unique, even if in different directories.
BAM: A text string of a file path pointing to a BAM file, or 'NA'. Cannot be specified at the same time as R1 or R2, both of which should be set to 'NA'

For example, with the following:

Sample_Name	Library_ID	Lane	Colour_Chemistry	SeqType	Organism	Strandedness	UDG_Treatment	R1	R2	BAM
JK2782	JK2782	7	4	PE	Mammoth	double	full	data/JK2782_TGGCCGATCAACGA_L007_R1_001.fastq.gz.tengrand.fq.gz	data/JK2782_TGGCCGATCAACGA_L007_R2_001.fastq.gz.tengrand.fq.gz	NA
JK2782	JK2782	8	4	PE	Mammoth	double	full	data/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz	data/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz	NA
JK2802	JK2802	7	4	PE	Mammoth	double	full	data/JK2802_AGAATAACCTACCA_L007_R1_001.fastq.gz.tengrand.fq.gz	data/JK2802_AGAATAACCTACCA_L007_R2_001.fastq.gz.tengrand.fq.gz	NA
JK2802	JK2802	8	4	SE	Mammoth	double	full	data/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz	NA	NA

In this context the following will occur:

After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8 with the same SeqType (and all other metadata columns) will be concatenated together for each Library.
After mapping, and prior BAM filtering, BAM files with the same with different SeqType (but with all other metadata columns th esame) will be merged together for each Library.
After duplicate removal, BAM files with Library_IDs with the same Sample_Name and the same UDG_Treatment will be merged together.
If BAM trimming is turned, all post-trimming BAMs (i.e. non-UDG and half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the same Sample_Name.

Note the following important points and limitations for setting up:

The TSV must use actual tabs (not spaces) between cells.
File names must be unique irregardless of file path, due to risk of over-writing (see: nextflow-io/nextflow#470).
- If it is 'too late' and already have duplicate file names, a work around is to concatenate the FASTQ files together and supply this to a nf-core/eager run. The only downside is that you will not get independent FASTQC results for each file.
Lane IDs must be unique for each sequencing of each library.
- If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will still be processed correctly.
- This also applies to the SeqType column, i.e. with the example above, if one run is PE and one run is SE, you need to give fake lane IDs to one of the runs as well.
All BAM files must be specified as SE under SeqType.
nf-core/eager will only merge multiple lanes of sequencing runs with the same single-end or paired-end configuration
Accordingly nf-core/eager will not merge lanes of FASTQs with BAM files (unless you use --run_convertbam), as only FASTQ files are lane-merged together.
Same libraries that are sequenced on different sequencing configurations (i.e single- and paired-end data), will be merged after mapping and will always be considered 'paired-end' during downstream processes
- Important running DeDup in this context is not recommended, as PE and SE data at the same position will not be evaluated as duplicates. Therefore not all duplicates will be removed.
- When you wish to run PE/SE data together -dedupper markduplicates is therefore prefered.
- An error will be thrown if you try to merge both PE and SE and also supply --skip_merging.
- If truly you want to mix SE data and PE data but using mate-pair info for PE mapping, please run FASTQ preprocessing mapping manually and supply BAM files for downstream processing by nf-core/eager
- If you regularly want to run the situation above, please leave an feature request on github.
DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on each unique library separately after deduplication (but prior same-treated library merging).
nf-core/eager functionality such as --run_trim_bam will be applied to only non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries.
Qualimap is run on each sample, after merging of libraries (i.e. your values will reflect the values of all libraries combined - after being damage trimmed etc.).
Genotyping will typically be performed on each sample independently as normally all libraries will have been merged together. However, if you have a mixture of single-stranded and double-stranded libraries, you will normally need to genotype separately. In this case you must give each the SS and DS libraries distinct Sample_IDs otherwise you will recieve a file collision error in steps such as sexdeterrmine, and merge these yourself. We will consider changing this behaviour in the future if there is enough interest.

`--bam`

Specifies the input file type to --input is in BAM format. This will automatically also apply --single_end.

Only required when using the 'Path' method of --input.

`--single_stranded`

Indicates libraries are single stranded.

Currently only affects MALTExtract, where it will switch on damage patterns calculation mode to single-stranded. Default: false.

`--colour_chemistry`

Specifies which Illumina colour chemistry was a library was sequenced with. This informs whether to perform poly-G trimming (if --complexity_filter_poly_g is also supplied). Only 2 colour chemistry sequencers (e.g. NextSeq or NovaSeq) can generate uncertain poly-G tails (due to 'G' being indicated via a no-colour detection). Default is '4' to indicate e.g. HiSeq or MiSeq platforms, which do not require poly-G trimming. Options: 2, 4. Default: 4

Only required when using the 'Path' method of --input.

`--fasta`

You specify the full path to your reference genome here. The FASTA file can have any file suffix, such as .fasta, .fna, .fa, .FastA etc. You may also supply a gzipped reference files, which will be unzipped automatically for you.

For example:

--fasta '/<path>/<to>/my_reference.fasta'

If you don't specify appropriate --bwa_index, --fasta_index parameters (see below), the pipeline will create these indices for you automatically. Note that you can save the indices created for you for later by giving the --save_reference flag. You must select either a --fasta or --genome

`--genome` (using iGenomes)

Alternatively, the pipeline config files come bundled with paths to the Illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the AWS-iGenomes resource.

There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome flag.

You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:

Human
- --genome GRCh37
- --genome GRCh38
Mouse *
- --genome GRCm38
Drosophila *
- --genome BDGP6
S. cerevisiae *
- --genome 'R64-1-1'

* Not bundled with nf-core eager by default.

Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. See the Nextflow documentation for instructions on where to save such a file.

The syntax for this reference configuration is as follows:

params {
  genomes {
    'GRCh37' {
      fasta   = '<path to the iGenomes genome fasta file>'
    }
    // Any number of additional genomes, key is used with --genome
  }
}

You must select either a --fasta or --genome

Output Directories

`--outdir`

The output directory where the results will be saved.

`-w / -work-dir`

The output directory where intermediate files will be saved. It is highly recommended that this is the same path as --outdir, otherwise you may 'lose' your intermediate files if you need to re-run a pipeline. By default, if this flag is not given, the intermediate files will be saved in a work/ and .nextflow/ directory from wherever you have run EAGER from.

Optional Reference Options

We provide various options for indexing of different types of reference genomes. EAGER can index reference genomes for you (with options to save these for other analysis), but you can also supply your pre-made indices.

Supplying pre-made indices saves time in pipeline execution and is especially advised when running multiple times on the same cluster system for example. You can even add a resource specific profile that sets paths to pre-computed reference genomes, saving even time when specifying these.

`--large_ref`

This parameter is required to be set for large reference genomes. If your reference genome is larger than 3.5GB, the samtools index calls in the pipeline need to generate CSI indices instead of BAI indices to accompensate for the size of the reference genome. This parameter is not required for smaller references (including a human hg19 or grch37/grch38 reference), but >4GB genomes have been shown to need CSI indices. Default: off

`--save_reference`

Use this if you do not have pre-made reference FASTA indices for bwa, samtools and picard. If you turn this on, the indices EAGER2 generates for you will be stored in the <your_output_dir>/results/reference_genomes for you.

`--bwa_index`

If you want to use pre-existing bwa index indices, please supply the path and file to the FASTA you also specified in --fasta (see above). EAGER2 will automagically detect the index files by searching for the FASTA filename with the corresponding bwa index file suffixes.

For example:

nextflow run nf-core/eager \
-profile test,docker \
--input '*{R1,R2}*.fq.gz'
--fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \
--bwa_index 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta'

bwa index does not give you an option to supply alternative suffixes/names for these indices. Thus, the file names generated by this command must not be changed, otherwise EAGER2 will not be able to find them.

`--seq_dict`

If you want to use a pre-existing picard CreateSequenceDictionary dictionary file, use this to specify the required .dict file for the selected reference genome.

For example:

--seq_dict 'Mammoth_MT_Krause.dict'

`--fasta_index`

If you want to use a pre-existing samtools faidx index, Use this to specify the required FASTA index file for the selected reference genome. This should be generated by samtools faidx and has a file suffix of .fai

For example:

--fasta_index 'Mammoth_MT_Krause.fasta.fai'

Other run specific parameters

`-r`

By default, EAGER2 will use the latest version of the pipeline that is downloaded on your system. However, it's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.

First, go to the nf-core/eager releases page and find the latest version number - numeric only (eg. 2.0). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 2.0.

This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.

Additionally, EAGER pipeline releases are named after Swabian German Cities. The first release V2.0 is named "Kaufbeuren". Future releases are named after cities named in the Swabian league of Cities.

`--max_memory`

Use to set a top-limit for the default memory requirement for each process. Should be a string in the format integer-unit. eg. --max_memory '8.GB'. If not specified, will be taken from the configuration in the -profile flag.

`--max_time`

Use to set a top-limit for the default time requirement for each process. Should be a string in the format integer-unit. eg. --max_time '2.h'. If not specified, will be taken from the configuration in the -profile flag.

`--max_cpus`

When not using a instutite specific -profile, you can use this parameter to set a top-limit for the default CPU requirement for each process. This is not the maximum number of CPUs that can be used for the whole pipeline, but the maximum number of CPUs each program can use for each program submission (known as a process).

Do not set this higher than what is available on your workstation or computing node can provide. If you're unsure, ask your local IT administrator for details on compute node capabilities! Should be a string in the format integer-unit. eg. --max_cpus 1. If not specified, will be taken from the configuration in the -profile flag.

`--email`

Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config) then you don't need to specify this on the command line for every run.

Note that this functionality requires either mail or sendmail to be installed on your system.

`--plaintext_email`

Set to receive plain-text e-mails instead of HTML formatted.

`-name`

Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.

This is used in the MultiQC report (if not default) and in the summary HTML / e-mail (always).

NB: Single hyphen (core Nextflow option)

`-resume`

Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

NB: Single hyphen (core Nextflow option)

`-c`

Specify the path to a specific nextflow config file (this is a core NextFlow command).

NB: Single hyphen (core Nextflow option)

Note - you can use this to override pipeline defaults.

`--monochrome_logs`

Set to disable colourful command line output and live life in monochrome.

`--multiqc_config`

Specify a path to a custom MultiQC configuration file.

`--custom_config_version`

Provide git commit id for custom Institutional configs hosted at nf-core/configs. This was implemented for reproducibility purposes. Default is set to master.

\#\# Download and use config file with following git commid id
--custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96

Adjustable parameters for nf-core/eager

This part of the documentation contains a list of user-adjustable parameters in nf-core/eager. You can specify any of these parameters on the command line when calling the pipeline by simply prefixing the respective parameter with a double dash --.

Step skipping parameters

Some of the steps in the pipeline can be executed optionally. If you specify specific steps to be skipped, there won't be any output related to these modules.

`--skip_fastqc`

Turns off FastQC pre- and post-Adapter Removal, to speed up the pipeline. Use of this flag is most common when data has been previously pre-processed and the post-Adapter Removal mapped reads are being re-mapped to a new reference genome.

`--skip_adapterremoval`

Turns off adaptor trimming and paired-end read merging. Equivalent to setting both --skip_collapse and --skip_trim.

`--skip_preseq`

Turns off the computation of library complexity estimation.

`--skip_deduplication`

Turns off duplicate removal methods DeDup and MarkDuplicates respectively. No duplicates will be removed on any data in the pipeline.

`--skip_damage_calculation`

Turns off the DamageProfiler module to compute DNA damage profiles.

`--skip_qualimap`

Turns off QualiMap and thus does not compute coverage and other mapping metrics.

BAM Conversion Options

`--run_convertinputbam`

Allows you to convert a input BAM file back to FASTQ for downstream processing. Note this is required if you need to perform AdapterRemoval and/or polyG clipping.

If not turned on, BAMs will automatically be sent to post-mapping steps.

Complexity Filtering Options

More details on can be seen in the fastp documentation

If using TSV input, this is performed per lane separately.

`--complexity_filter_poly_g`

Performs a poly-G tail removal step in the beginning of the pipeline using fastp, if turned on. This can be useful for trimming ploy-G tails from short-fragments sequenced on two-colour Illumina chemistry such as NextSeqs (where no-fluorescence is read as a G on two-colour chemistry), which can inflate reported GC content values.

`--complexity_filter_poly_g_min`

This option can be used to define the minimum length of a poly-G tail to begin low complexity trimming. By default, this is set to a value of 10 unless the user has chosen something specifically using this option.

Adapter Clipping and Merging Options

These options handle various parts of adapter clipping and read merging steps.

More details can be seen in the AdapterRemoval documentation

If using TSV input, this is performed per lane separately.

`--clip_forward_adaptor`

Defines the adapter sequence to be used for the forward read. By default, this is set to 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'.

`--clip_reverse_adaptor`

Defines the adapter sequence to be used for the reverse read in paired end sequencing projects. This is set to 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA' by default.

`--clip_readlength`

Defines the minimum read length that is required for reads after merging to be considered for downstream analysis after read merging. Default is 30.

`--clip_min_read_quality`

Defines the minimum read quality per base that is required for a base to be kept. Individual bases at the ends of reads falling below this threshold will be clipped off. Default is set to 20.

`--clip_min_adap_overlap`

Sets the minimum overlap between two reads when read merging is performed. Default is set to 1 base overlap.

`--skip_collapse`

Turns off the paired-end read merging.

For example

--skip_collapse  --input '*_{R1,R2}_*.fastq'

It is important to use the paired-end wildcard globbing as --skip_collapse can only be used on paired-end data! :warning: If you run this and also with --clip_readlength set to something (as is by default), you may end up removing single reads from either the pair1 or pair2 file. These will be NOT be mapped when aligning with either bwa or bowtie, as both can only accept one (forward) or two (forward and reverse) FASTQs as input.

`--skip_trim`

Turns off adaptor and quality trimming.

For example:

--skip_trim  --input '*.fastq'

`--preserve5p`

Turns off quality based trimming at the 5p end of reads when any of the --trimns, --trimqualities, or --trimwindows options are used. Only 3p end of reads will be removed.

This also entirely disables quality based trimming of collapsed reads, since both ends of these are informative for PCR duplicate filtering. Described here.

`--mergedonly`

Specify that only merged reads are sent downstream for analysis.

Singletons (i.e. reads missing a pair), or un-merged reads (where there wasn't sufficient overlap) are discarded.

You may want to use this if you want ensure only the best quality reads for your analysis, but with the penalty of potentially losing still valid data (even if some reads have slightly lower quality). It is highly recommended when using --dedupper 'dedup' (see below).

Read Mapping Parameters

If using TSV input, mapping is performed library, i.e. after lane merging.

`--mapper`

Specify which mapping tool to use. Options are BWA aln ('bwaaln'), BWA mem ('bwamem'), circularmapper ('circularmapper'), or bowtie2 (bowtie2). bwa aln is the default and highly suited for short read ancient DNA. bwa mem can be quite useful for modern DNA, but is rarely used in projects for ancient DNA. CircularMapper enhances the mapping procedure to circular references, using the BWA algorithm but utilizing a extend-remap procedure (see Peltzer et al 2016, Genome Biology for details). Bowtie2 is similar to bwa aln, and has recently been suggested to provide slightly better results under certain conditions (Poullet and Orlando 2020), as well as providing extra functionality (such as FASTQ trimming). Default is 'bwaaln'

More documentation can be seen for each tool under:

BWA (default)

These parameters configure mapping algorithm parameters.

`--bwaalnn`

Configures the bwa aln -n parameter, defining how many mismatches are allowed in a read. By default set to 0.04 (following recommendations of Schubert et al. (2012 BMC Genomics)), if you're uncertain what to set check out this Shiny App for more information on how to set this parameter efficiently.

`--bwaalnk`

Configures the bwa aln -k parameter for the seeding phase in the mapping algorithm. Default is set to 2.

`--bwaalnl`

Configures the length of the seed used in bwa aln -l. Default is set to be 'turned off' at recommendation the of Schubert et al. (2012 BMC Genomics)) for ancient DNA with 1024.

Note: Despite being recommended, turning off seeding can result in long runtimes!

CircularMapper

`--circularextension`

The number of bases to extend the reference genome with. By default this is set to 500 if not specified otherwise.

`--circulartarget`

The chromosome in your FastA reference that you'd like to be treated as circular. By default this is set to MT but can be configured to match any other chromosome.

`--circularfilter`

If you want to filter out reads that don't map to a circular chromosome, turn this on. By default this option is turned off.

Bowtie2

`--bt2_alignmode`

The type of read alignment to use. Options are 'local' or 'end-to-end'. Local allows only partial alignment of read, with ends of reads possibly 'soft-clipped' (i.e. remain unaligned/ignored), if the soft-clipped alignment provides best alignment score. End-to-end requires all nucleotides to be aligned. Default is 'local', following Cahill et al (2018) and (Poullet and Orlando 2020).

`--bt2_sensitivity`

The Bowtie2 'preset' to use. Options: 'no-preset' 'very-fast', 'fast', 'sensitive', or 'very-sensitive'. These strings apply to both --bt2_alignmode options. See the Bowtie2 manual for actual settings. Default is 'sensitive' (following Poullet and Orlando (2020), when running damaged-data without UDG treatment)

`--bt2n`

The number of mismatches allowed in the seed during seed-and-extend procedure of Bowtie2. This will override any values set with --bt2_sensitivity. Can either be 0 or 1. Default: 0 (i.e. use--bt2_sensitivity defaults).

`--bt2l`

The length of the seed substring to use during seeding. This will override any values set with --bt2_sensitivity. Default: 0 (i.e. use--bt2_sensitivity defaults: 20 for local and 22 for end-to-end.

`-bt2_trim5`

Number of bases to trim of 5' (left) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0

`-bt2_trim3`

Number of bases to trim of 3' (right) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0.

Mapped Reads Stripping

These parameters are used for removing mapped reads from the original input FASTQ files, usually in the context of uploading the original FASTQ files to a public read archive (NCBI SRA/EBI ENA).

These flags will produce FASTQ files almost identical to your input files, except that reads with the same read ID as one found in the mapped bam file, are either removed or 'masked' (every base replaced with Ns).

This functionality allows you to provide other researchers who wish to re-use your data to apply their own adapter removal/read merging procedures, while maintaining anonyminity for sample donors - for example with microbiome research.

If using TSV input, stripping is performed library, i.e. after lane merging.

`--strip_input_fastq`

Create pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)

`--strip_mode`

Read removal mode. Strip mapped reads completely ('strip') or just replace mapped reads sequence by N ('replace')

Read Filtering and Conversion Parameters

Users can configure to keep/discard/extract certain groups of reads efficiently in the nf-core/eager pipeline.

If using TSV input, filtering is performed library, i.e. after lane merging.

`--run_bam_filtering`

Turns on the bam filtering module for either mapping quality filtering or unmapped read treatment.

`--bam_unmapped_type`

Defines how to proceed with unmapped reads: 'discard' removes all unmapped reads, keep keeps both unmapped and mapped reads in the same BAM file, 'bam' keeps unmapped reads as BAM file, 'fastq' keeps unmapped reads as FastQ file, both keeps both BAM and FASTQ files. Default is discard. keep is what would happen if --run_bam_filtering was not supplied.

Note that in all cases, if --bam_mapping_quality_threshold is also supplied, mapping quality filtering will still occur on the mapped reads.

`--bam_mapping_quality_threshold`

Specify a mapping quality threshold for mapped reads to be kept for downstream analysis. By default keeps all reads and is therefore set to 0 (basically doesn't filter anything).

Read DeDuplication Parameters

If using TSV input, deduplication is performed library, i.e. after lane merging.

`--dedupper`

Sets the duplicate read removal tool. By default uses markduplicates from Picard. Alternatively an ancient DNA specific read deduplication tool 'dedup' (Pelter et al. 2016) is offered.

This utilises both ends of paired-end data to remove duplicates (i.e. true exact duplicates, as markduplicates will over-zealously deduplicate anything with the same starting position even if the ends are different). DeDup should only be used solely on paired-end data otherwise suboptimal deduplication can occur if applied to either single-end or a mix of single-end/paired-end data.

Note that if you run without the --mergedonly flag for AdapterRemoval, DeDup will likely fail. If you absolutely want to use both PE and SE data, you can supply the --dedup_all_merged flag to consider singletons to also be merged paired-end reads. This may result in over-zealous deduplication.

`--dedup_all_merged`

Sets DeDup to treat all reads as merged reads. This is useful if reads are for example not prefixed with M_ in all cases. Therefore, this can be used as a workaround when also using a mixture of paired-end and single-end data, however this is not recommended (see above).

Library Complexity Estimation Parameters

`--preseq_step_size`

Can be used to configure the step size of Preseqs c_curve method. Can be useful when only few and thus shallow sequencing results are used for extrapolation.

DNA Damage Assessment Parameters

More documentation can be seen in the follow links for:

If using TSV input, DamageProfiler is performed library, i.e. after lane merging. PMDtools and BAM Trimming is run after library merging of same-named library BAMs that have the same type of UDG treatment. BAM Trimming is only performed on non-UDG and half-UDG treated data.

`--udg_type`

Defines whether Uracil-DNA glycosylase (UDG) treatment was used to repair DNA damage on the sequencing libraries.

Specify 'none' if no treatment was performed. If you have partial UDG treated data (Rohland et al 2016), specify 'half'. If you have complete UDG treated data (Briggs et al. 2010), specify 'full'. When also using PMDtools 'half' will use a different model for DNA damage assessment in PMDTools. Specify the parameter with 'full' and the PMDtools DNA damage assessment will use CpG context only. Default: 'none'.

`--damageprofiler_length`

Specifies the length filter for DamageProfiler. By default set to 100.

`--damageprofiler_threshold`

Specifies the length of the read start and end to be considered for profile generation in DamageProfiler. By default set to 15 bases.

`--damageprofiler_yaxis`

Specifies what the maximum misincorporation frequency should be displayed as, in the DamageProfiler damage plot. This is set to 0.30 (i.e. 30%) by default as this matches the popular mapDamage2.0 program. However, the default behaviour of DamageProfiler is to 'autoscale' the y-axis maximum to zoom in on any possible damage that may occur (e.g. if the damage is about 10%, the highest value on the y-axis would be set to 0.12). This 'autoscale' behaviour can be turned on by specifying the number to 0. Default: 0.30.

`--run_pmdtools`

Specifies to run PMDTools for damage based read filtering and assessment of DNA damage in sequencing libraries. By default turned off.

`--pmdtools_range`

Specifies the range in which to consider DNA damage from the ends of reads. By default set to 10.

`--pmdtools_threshold`

Specifies the PMDScore threshold to use in the pipeline when filtering BAM files for DNA damage. Only reads which surpass this damage score are considered for downstream DNA analysis. By default set to 3 if not set specifically by the user.

`--pmdtools_reference_mask`

Can be used to set a path to a reference genome mask for PMDTools.

`--pmdtools_max_reads`

The maximum number of reads used for damage assessment in PMDtools. Can be used to significantly reduce the amount of time required for damage assessment in PMDTools. Note that a too low value can also obtain incorrect results.

BAM Trimming Parameters

For some library preparation protocols, users might want to clip off damaged bases before applying genotyping methods. This can be done in nf-core/eager automatically by turning on the --trim_bam parameter.

More documentation can be seen in the bamUtil documentation

`--run_trim_bam`

Turns on the BAM trimming method. Trims off [n] bases from reads in the deduplicated BAM file Damage assessment in PMDTools or DamageProfiler remains untouched, as data is routed through this independently. BAM trimming os typically performed to reduce errors during genotyping that can be caused by aDNA damage.

BAM trimming will only be performed on libraries indicated as --udg_type 'none' or --udg_type 'half'. Complete UDG treatment ('full') should have removed all damage. The amount of bases that will be trimmed off can be set separately for libraries with --udg_type 'none' and 'half' (see --bamutils_clip_half_udg_left / --bamutils_clip_half_udg_right / --bamutils_clip_none_udg_left / --bamutils_clip_none_udg_right).

Note: additional artefacts such as bar-codes or adapters that could potentially also be trimmed should be removed prior mapping.

`--bamutils_clip_half_udg_left` / `--bamutils_clip_half_udg_right`

Default set to 1 and clips off one base of the left or right side of reads from libraries whose UDG treatment is set to half. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).

`--bamutils_clip_none_udg_left` / `--bamutils_clip_none_udg_right`

Default set to 1 and clips off one base of the left or right side of reads from libraries whose UDG treatment is set to none. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).

`--bamutils_softclip`

By default, nf-core/eager uses hard clipping and sets clipped bases to N with quality ! in the BAM output. Turn this on to use soft-clipping instead, masking reads at the read ends respectively using the CIGAR string.

Captured Library Parameters

These parameters are required in some cases, e.g. when performing in-solution SNP capture protocols (390K,1240K, ...) for population genetics for example. Make sure to specify the required parameters in such cases.

`--snpcapture` false

This is by default set to false, but can be turned on to calculate on target metrics automatically for you. Note, that this requires setting --bedfile with the target SNPs simultaneously.

`--bedfile`

Can be used to set a path to a BED file (3/6 column format) to calculate capture target efficiency on the fly. Will not be used without --bedfile set as parameter.

Feature Annotation Statistics

If you're interested in looking at coverage stats for certain features on your reference such as genes, SNPs etc., you can use the following bedtools module for this purpose.

More documentation on bedtools can be seen in the bedtools documentation

If using TSV input, bedtools is run after library merging of same-named library BAMs that have the same type of UDG treatment.

`--run_bedtools_coverage`

Specifies to turn on the bedtools module, producing statistics for breadth (or percent coverage), and depth (or X fold) coverages.

`--anno_file`

Specify the path to a GFF/BED containing the feature coordinates (or any acceptable input for bedtools coverage). Must be in quotes.

Genotyping Parameters

There are options for different genotypers (or genotype likelihood calculators) to be used. We suggest you the documentation of each tool to find the ones that suit your needs.

Documentation for each tool:

If using TSV input, genotyping is performed per sample (i.e. after all types of libraries are merged), except pileupCaller which gathers all double-stranded and single-stranded (same-type merged) libraries respectively.

`--run_genotyping`

Turns on genotyping to run on all post-dedup and downstream BAMs. For example if --run_pmdtools and --trim_bam are both supplied, the genotyper will be run on all three BAM files i.e. post-deduplication, post-pmd and post-trimmed BAM files.

`--genotyping_tool`

Specifies which genotyper to use. Current options are: GATK (v3.5) UnifiedGenotyper or GATK Haplotype Caller (v4); and the FreeBayes Caller. Specify 'ug', 'hc', 'freebayes', 'pileupcaller' and 'angsd' respectively.

NB that while UnifiedGenotyper is more suitable for low-coverage ancient DNA (HaplotypeCaller does de novo assembly around each variant site), it is officially deprecated by the Broad Institute and is only accessible by an archived version not properly available on conda. Therefore if specifying 'ug', will need to supply a GATK 3.5 -jar to the parameter gatk_ug_jar. Note that this means the pipline is not fully reproducible in this configuration, unless you personally supply the .jar file.

`--genotyping_source`

Indicates which BAM file to use for genotyping, depending on what BAM processing modules you have turned on. Options are: 'raw' for mapped only, filtered, or DeDup BAMs (with priority right to left); 'trimmed' (for base clipped BAMs); 'pmd' (for pmdtools output). Default is: 'raw'.

`--gatk_ug_jar`

Specify a path to a local copy of a GATK 3.5 .jar file, preferably version '3.5-0-g36282e4'. The download location of this may be available from the GATK forums of the Broad Institute.

You must manually report your version of GATK 3.5 in publications/MultiQC as it is not included in our container.

`--gatk_call_conf`

If selected a GATK genotyper phred-scaled confidence threshold of a given SNP/INDEL call. Default: 30

`--gatk_ploidy`

If selected a GATK genotyper, what is the ploidy of your reference organism. E.g. do you want to allow heterozygous calls from >= diploid organisms. Default: 2

`--gatk_dbsnp`

(Optional)Specify VCF file for output VCF SNP annotation e.g. if you want annotate your VCF file with 'rs' SNP IDs. Check GATK documentation for more information. Gzip not accepted.

`--gatk_ug_out_mode`

If selected the GATK genotyper UnifiedGenotyper, what type of VCF to create, i.e. produce calls for every site or just confidence sites. Options: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_SITES'. Default: 'EMIT_VARIANTS_ONLY'

`--gatk_hc_out_mode`

If selected the GATK genotyper HaplotypeCaller, what type of VCF to create, i.e. produce calls for every site or just confidence sites. Options: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_ACTIVE_SITES'. Default: 'EMIT_VARIANTS_ONLY'

`--gatk_ug_genotype_model`

If selected GATK UnifiedGenotyper, which likelihood model to follow, i.e. whether to call use SNPs or INDELS etc. Options: 'SNP', 'INDEL', 'BOTH', 'GENERALPLOIDYSNP', 'GENERALPLOIDYINDEL'. Default: 'SNP'

`--gatk_hc_emitrefconf`

If selected GATK HaplotypeCaller, mode for emitting reference confidence calls. Options: 'NONE', 'BP_RESOLUTION', 'GVCF'. Default: 'GVCF'

`--gatk_ug_keep_realign_bam`

If provided, this will put into the output folder the BAMs that have realigned reads (with GATK's (v3) IndelRealigner) around possible variants for improved genotyping.

These BAMs will be stored in the same folder as the corresponding VCF files.

`--gatk_downsample`

Maximum depth coverage allowed for genotyping before down-sampling is turned on. Any position with a coverage higher than this value will be randomly down-sampled to 250 reads. Default: 250

`--gatk_ug_gatk_ug_defaultbasequalities`

Specify a value to set base quality scores, if reads are missing this information. Maybe useful if you have 'synthetically' generated reads (e.g. chopping up a reference genome). Default is set to -1 which is do not set any default quality (turned off). Default: -1

`--freebayes_C`

Specify minimum required supporting observations to consider a variant. Default: 1

`--freebayes_g`

Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified C. Not set by default.

`--freebayes_p`

Specify ploidy of sample in FreeBayes. Default is diploid. Default: 2

`--pileupcaller_bedfile`

Specify a SNP panel in the form of a bed file of sites at which to generate pileup for pileupCaller.

`--pileupcaller_snpfile`

Specify a SNP panel in EIGENSTRAT format, pileupCaller will call these sites.

`--pileupcaller_method`

Specify calling method to use. Options: randomHaploid, randomDiploid, majorityCall. Default: 'randomHaploid'

`--angsd_glmodel`

Specify which genotype likelihood model to use. Options: 'samtools, 'gatk', 'soapsnp', 'syk'. Default: 'samtools'

`--angsd_glformat`

Specifies what type of genotyping likelihood file format will be output. Options: 'text', 'binary', 'binary_three', 'beagle_binary'. Default: 'text'.

The options refer to the following descriptions respectively:

text: textoutput of all 10 log genotype likelihoods.
binary: binary all 10 log genotype likelihood
binary_three: binary 3 times likelihood
beagle_binary: beagle likelihood file

See the ANGSD documentation for more information on which to select for your downstream applications.

`--angsd_createfasta`

Turns on the ANGSD creation of a FASTA file from the BAM file.

`--angsd_fastamethod`

The type of base calling to be performed when creating the ANGSD FASTA file. Options: 'random' or 'common'. Will output the most common non-N base at each given positin, whereas 'random' will pick one at random. Default: 'random'.

`--pileupcaller_transitions_mode`

Specify if genotypes of transition SNPs should be called, set to missing, or excluded from the genotypes respectively. Options: AllSites, TransitionsMissing, SkipTransitions. Default: AllSites

Consensus Sequence Generation

If using TSV input, consensus eneration is performed per sample (i.e. after all types of libraries are merged).

`--run_vcf2genome`

Turn on concensus sequence genome creation via VCF2Genome. Only accepts GATK UnifiedGenotyper VCF files with the --gatk_ug_out_mode 'EMIT_ALL_SITES' and --gatk_ug_genotype_model 'SNP flags. Typically useful for small genomes such as mitochondria.

`--vcf2genome_outfile`

The name of your requested output FASTA file. Do not include .fasta suffix.

`--vcf2genome_header`

The name of the FASTA entry you would like in your FASTA file.

`--vcf2genome_minc`

Minimum depth coverage for a SNP to be called. Else, a SNP will be called as N. Default: 5

`--vcf2genome_minq`

Minimum genotyping quality of a call to be called. Else N will be called. Default: 30

`--vcf2genome_minfreq`

In the case of two possible alleles, the frequency of the majority allele required to be called. Else, a SNP will be called as N. Default: 0.8

Mitochondrial to Nuclear Ratio

If using TSV input, Mitochondrial to Nuclear Ratio calculation is calculated per deduplicated library (after lane merging)

`--run_mtnucratio`

Turn on the module to estimate the ratio of mitochondrial to nuclear reads.

`--mtnucratio_header`

Specify the FASTA entry in the reference file specified as --fasta, which acts as the mitochondrial 'chromosome' to base the ratio calculation from. The tool only accepts the first section of the header before the first space. The default chromosome name is based on hs37d5/GrCH37 human reference genome. Default: 'MT'

SNP Table Generation

SNP Table Generation here is performed by MultiVCFAnalyzer. The current version of MultiVCFAnalyzer version only accepts GATK UnifiedGenotyper 3.5 VCF files, and when the ploidy was set to 2 (this allows MultiVCFAnalyzer to look for report frequencies of polymorphic positions). A description of how the tool works can be seen in the Supplementary Information of Bos et al. (2014) under "SNP Calling and Phylogenetic Analysis".

More can be seen in the MultiVCFAnalyzer documentation.

If using TSV input, MultiVCFAnalyzer is performed on all samples gathered together.

`--run_multivcfanalyzer`

Turns on MultiVCFAnalyzer. Will only work when in combination with UnifiedGenotyper genotyping module.

`--write_allele_frequencies`

Specify whether to tell MultiVCFAnalyzer to write within the SNP table the frequencies of the allele at that position e.g. A (70%).

`--min_genotype_quality`

The minimal genotyping quality for a SNP to be considered for processing by MultiVCFAnalyzer. The default threshold is 30.

`--min_base_coverage`

The minimal number of reads covering a base for a SNP at that position to be considered for processing by MultiVCFAnalyzer. The default depth is 5.

`--min_allele_freq_hom`

The minimal frequency of a nucleotide for a 'homozygous' SNP to be called. In other words, e.g. 90% of the reads covering that position must have that SNP to be called. If the threshold is not reached, and the previous two parameters are matched, a reference call is made (displayed as . in the SNP table). If the above two parameters are not met, an 'N' is called. The default allele frequency is 0.9.

`--min_allele_freq_het`

The minimum frequency of a nucleotide for a 'heterozygous' SNP to be called. If this parameter is set to the same as --min_allele_freq_hom, then only homozygous calls are made. If this value is less than the previous parameter, then a SNP call will be made if it is between this and the previous parameter and displayed as a IUPAC uncertainty call. Default is 0.9.

`--additional_vcf_files`

If you wish to add to the table previously created VCF files, specify here a path with wildcards (in quotes). These VCF files must be created the same way as your settings for GATK UnifiedGenotyping module above.

`--reference_gff_annotations`

If you wish to report in the SNP table annotation information for the regions SNPs fall in. This must be in GFF format and the path must be in quotes.

`--reference_gff_exclude`

If you wish to exclude SNP regions from consideration by MultiVCFAnalyzer (such as for problematic regions), provide a file in GFF format (the path must be in quotes).

`--snp_eff_results`

If you wish to include results from SNPEff effect analysis, supply the output from SNPEff in txt format. The path must be in quotes.

Human Sex Determination

An optional process for human DNA. It can be used to calculate the relative coverage of X and Y chromosomes compared to the autosomes (X-/Y-rate). Standard errors for these measurements are also calculated, assuming a binomial distribution of reads across the SNPs.

If using TSV input, SexDetERRmine is performed on all samples gathered together.

`--run_sexdeterrmine`

Specify to run the optional process of sex determination.

`--sexdeterrmine_bedfile`

Specify an optional bedfile of the list of SNPs to be used for X-/Y-rate calculation. Running without this parameter will considerably increase runtime, and render the resulting error bars unstrustworthy. Theoretically, any set of SNPs that are distant enough that two SNPs are unlikely to be covered by the same read can be used here. The programme was coded with the 1240K panel in mind. The path must be in quotes.

Human Nuclear Contamination

`--run_nuclear_contamination`

Specify to run the optional processes for nuclear contamination.

`--contamination_chrom_name`

The name of the chromosome X in your bam. 'X' for hs37d5, 'chrX' for HG19. Defaults to 'X'.

Metagenomic Screening

An increasingly common line of analysis in high-throughput aDNA analysis today is simultaenously screening off target reads of the host for endogenous microbial signals - particularly of pathogens. Metagenomic screening is currently offered via MALT with aDNA specific verification via MaltExtract, or Kraken2.

Please note the following:

MALT database construction functionality is not included within the pipeline - this should be done independently, prior the EAGER run.
- To use malt-build from the same version as malt-run, load either the docker, singularity or conda environment.
MALT can often require very large computing resources depending on your database. We set a absolute minimum of 16 cores and 128GB of memory (which is 1/4 of the recommendation from the developer). Please leave an issue on the nf-core github if you would like to see this changed.

⚠️ Running MALT on a server with less than 128GB of memory should be performed at your own risk.

If using TSV input, metagenomic screening is performed on all samples gathered together.

`--run_metagenomic_screening`

Turn on the metagenomic screening module.

`--metagenomic_tool`

Specify which taxonomic classifier to use. There are two options available:

kraken with Kraken2
malt : more can be seen in the MALT documentation

⚠️ Important It is very important to run nextflow clean -f on your nextflow run directory once completed. RMA6 files are VERY large and are copied from a work/ directory into the results folder. You should clean the work directory with the command to ensure non-redundency and large HDD footprints!

`--metagenomic_min_support_reads`

Specify the minimum number of reads a given taxon is required to have to be retained as a positive 'hit'.
For malt, this only applies when --malt_min_support_mode is set to 'reads'. Default: 1 .

`--database`

Specify the path to the directory containing your taxonomic classifer's database (malt or kraken).

For Kraken2, it can be either the path to the directory or the path to the .tar.gz compressed directory of the Kraken2 database.

`--percent_identity`

Specify the minimum percent identity (or similarity) a squence must have to the reference for it to be retained. Default is 85

Only used when --metagenomic_tool malt is also supplied

`--malt_mode`

Use this to run the program in 'BlastN', 'BlastP', 'BlastX' modes to align DNA and DNA, protein and protein, or DNA reads against protein references respectively. respectively. Ensure your database matches the mode. Check the MALT manual for more details. Default: 'BlastN'