This is a fork of the CLAPAnalysis pipeline, modified to work on the UCL cluster Myriad. It performs the same workflow that the original pipeline (described below), exceopt for the following two modifications:
- Addition of a removeduplicates option, to allow the user to remove (default) or include duplicates in the
picard MarkDuplicatesstep - The
repeat maskingstep was modified to run bedtools instersect instead of FilterBlacklist.jar, because the latter appears remove duplicates as well:
# Before repeat masking, each sample has reads marked as duplicated
$ ls -ltr workup/alignments/*.merged.dedup.star.bam | cut -d " " -f 10 | xargs -I {} samtools flagstat {} | grep duplicates
487828 + 0 duplicates
6412520 + 0 duplicates
1254932 + 0 duplicates
165700 + 0 duplicates
371917 + 0 duplicates
3358908 + 0 duplicates
# After repeat masking, no reads are marked as duplicated (i.e., these have been removed)
$ ls -ltr workup/alignments/*.merged.dedup.star.masked.bam | cut -d " " -f 10 | xargs -I {} samtools flagstat {} | grep duplicates
0 + 0 duplicates
0 + 0 duplicates
0 + 0 duplicates
0 + 0 duplicates
[…]
Note: The STAR and Bowtie2 indexes were missing from the original repository, so these were created and added to the CLAPAnalysis_sge repository. For Bowtie2, the index was based on the same repeat files (human and mouse combined) used in the nfcore/clipseq pipeline.
To use the pipeline in myriad:
As the authors mentioned: "For reproducibility, we recommend keeping the pipeline, input, and output directories together. In other words, the complete directory should look like this GitHub repository with an extra workup subdirectory created upon running this pipeline."
git clone https://github.com/lconde-ucl/CLAPAnalysis_sge.git
2. Navigate to the main folder and bring your FASTQ files (these need to use a _R{1,2}.fastq.gz nomenclature)
E.g.:
cd CLAPAnalysis_sge/clap_pipeline
mkdir fastqs
cd fastqs/
rsync -arv rds2:/rdss/rd01/ritd-ag-project-rd011h-shend55/PRC2_TEMP/SAFA_MinusTag_CLIP_1.fastq.gz .
rsync -arv rds2:/rdss/rd01/ritd-ag-project-rd011h-shend55/PRC2_TEMP/SAFA_MinusTag_CLIP_2.fastq.gz .
ln -s SAFA_MinusTag_CLIP_1.fastq.gz SAFA_MinusTag_CLIP_R1.fastq.gz
ln -s SAFA_MinusTag_CLIP_2.fastq.gz SAFA_MinusTag_CLIP_R2.fastq.gz
./fastq2json.py --fastq_dir fastqs/
module load blic-modules
module load snakemake/7.32.4
activate_snakemake
- Email address: Add an email address if you want to receive an email if the pipeline fails (default: none)
- Assembly: Specify the assembly, hg38, mm10 or mixed (default: mixed)
- removeduplicates: if you don't want to remove duplicates in the picard MarkDuplicates step, change this to false (default: true)
email: ""
assembly: "mixed"
removeduplicates: "true"
./run_pipeline.sh
E.g.:
./CLAPAnalysis_sge/clap_pipeline/workup/
├── [ 68K] alignments
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_000._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_001._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_002._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_003._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_004._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_005._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_006._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_007._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_008._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_009._STARgenome/
│ ├── [4.0K] SAFA_MinusTag_CLIP.part_010._STARgenome/
│ └── [4.0K] SAFA_MinusTag_CLIP.part_011._STARgenome
├── [4.0K] fastqs/
├── [ 16K] logs/
│ └── [ 12K] cluster/
├── [4.0K] splitfq/
└── [ 16K] trimmed/
112K used in 18 directories
The documentation from the original pipeline is below.
Contents
This pipeline performs a tiered RNA alignment first to repetitive and structural RNAs (rRNAs, snRNAs, snoRNAs, tRNAs), followed by unique alignment to an appropriate reference genome.
This pipeline assumes an existing conda installation and is written as a Snakemake workflow. To install Snakemake with conda, run
conda env create -f envs/snakemake.yaml
conda activate snakemake
to create and activate a conda environment named snakemake. Once all
the input files are ready, run the pipeline on a SLURM
server environment with
./run_pipeline.sh
After the pipeline finishes, you can explore mapped alignments in
(workup/alignments) and calculate enrichments using
(scripts/Enrichment.jar) by passing the arguments:
java -jar Enrichment.jar <sample.bam> <input.bam> <genefile.bed> <savename.windows> <sample read count> <input read count>
Other common usage notes:
-
To run the pipeline for input RNA samples, replace
SnakefilewithSnakefile_for_inputunder--snakefilein/run_pipeline.sh. -
To run the pipeline for on a local computer (e.g., laptop), comment out or remove the
--cluster-config cluster.yamland--cluster "sbatch ..."arguments within./run_pipeline.sh, and set the number of jobs-j <#>to the number of local processors available. -
run_pipeline.shpasses any additional arguments to snakemake. For example, run./run_pipeline.sh --dry-runto perform a dry run, or./run_pipeline.sh --forceallto force (re)execution of all rules regardless of past output.
The pipeline relies on scripts written in Java, Bash, and Python.
Versions of Python are specified in conda environments described in
envs/, along with other third-party programs and packages that this
pipeline depends on.
Workflow
- Define samples and paths to FASTQ files (
fastq2json.pyor manually generatesamples.json) - Split FASTQ files into chunks for parallel processing (set
num_chunksinconfig.yaml) - Adaptor trimming (Trim Galore!)
- Tiered RNA alignment workflow:
- Alignment to repetitive and structural RNAs (Bowtie2)
- Convert unmapped reads to FASTQ files (samtools)
- Alignment to unique RNAs in reference genome (STAR)
- Chromosome relabeling (add "chr") and filtering (removing non-canonical chromosomes)
- PCR deduplication (Picard)
- Repeat masking (based on UCSC blacklists)
- Merge all BAMs from initial chunking (samtools)
We will refer to 4 directories:
-
Working directory: We follow the Snakemake documentation in using the term "working directory".
- This is also where Snakemake creates a
.snakemakedirectory within which it installs conda environments and keeps track of metadata regarding the pipeline.
- This is also where Snakemake creates a
-
Pipeline directory: where the software (including the
Snakefileand scripts) residesenvs/scripts/fastq2json.pySnakefile
-
Input directory: where configuration and data files reside
assets/data/cluster.yamlconfig.yaml: paths are specified relative to the working directorysamples.json: paths are specified relative to the working directoryrun_pipeline.sh: the paths in the arguments--snakefile <path to Snakefile>,--cluster-config <path to cluster.yaml>, and--configfile <path to config.yaml>are relative to where you runrun_pipeline.sh
-
Output or workup directory (
workup/): where to place thisworkupdirectory can be changed inconfig.yamlalignments/fastqs/logs/splitfq/trimmed/
For reproducibility, we recommend keeping the pipeline, input, and
output directories together. In other words, the complete directory
should look like this GitHub repository with an extra workup
subdirectory created upon running this pipeline.
-
config.yaml: YAML file containing the processing settings and paths of required input files. As noted above, paths are specified relative to the working directory.output_dir: path to create the output directory<output_dir>/workupwithin which all intermediate and output files are placed.temp_dir: path to a temporary directory, such as used by the-Toption of GNU sortsamples: path tosamples.jsonfilerepeat_bed:mm10: path to mm10 genomic regions to ignore, such as UCSC blacklist regions; reads mapping to these regions are maskedhg38: path to hg38 genomic regions to ignore, such as UCSC blacklist regions; reads mapping to these regions are maskedmixed: path to mixed hg38+mm10 genomic regions to ignore, such as UCSC blacklist regions; reads mapping to these regions are masked
bowtie2_index:mm10: path to Bowtie 2 genome index for the GRCm38 (mm10) buildhg38: path to Bowtie 2 genome index for the GRCh38 (hg38) buildmixed: path to Bowtie 2 genome index for the combined GRCh38 (hg38) and GRCm38 (mm10) build
assembly: currently supports either"mm10","hg38", ormixedstar_index:mm10: path to STAR genome index for the GRCm38 (mm10) buildhg38: path to STAR genome index for the GRCh38 (hg38) buildmixed: path to STAR genome index for the combined GRCh38 (hg38) and GRCm38 (mm10) build
num_chunks: integer giving the number of chunks to split FASTQ files from each sample into for parallel processing
-
samples.json: JSON file with the location of FASTQ files (read1, read2) to process.-
config.yamlkey to specify the path to this file:samples -
This can be prepared using
fastq2json.py --fastq_dir <path_to_directory_of_FASTQs>or manually formatted as follows:{ "sample1": { "R1": ["<path_to_data>/sample1_R1.fastq.gz"], "R2": ["<path_to_data>/sample1_R2.fastq.gz"] }, "sample2": { "R1": ["<path_to_data>/sample2_R1.fastq.gz"], "R2": ["<path_to_data>/sample2_R2.fastq.gz"] }, ... } -
The pipeline (in particular, the script
scripts/bash/split_fastq.sh) currently only supports one read 1 (R1) and one read 2 (R2) FASTQ file per sample.- If there are multiple FASTQ files per read orientation per sample (for example, if the same sample was sequenced multiple times, or it was split across multiple lanes during sequencing), the FASTQ files will first need to be concatenated together, and the paths to the concatenated FASTQ files should be supplied in the JSON file.
-
Each sample is processed independently, generating independent BAM files.
-
-
assets/repeats.RNA.mm10.bed,assets/repeats.RNA.hg38.bed,assets/repeats.RNA.combined.hg38.mm10.bed: blacklisted repetitive genomic regions (i.e., poor alignments to rRNA regions) for CLIP and CLAP data. -
assets/index_mm10/*.bt2,assets/index_hg38/*.bt2,assets/index_mixed/*.bt2: Bowtie 2 genome indexconfig.yamlkey to specify the path to the index:bowtie2_index: {'mm10': <mm10_index_prefix>, 'hg38': <hg38_index_prefix>, 'mixed': <mixed_index_prefix}- Bowtie2 indexes for repetitive and structural RNAs may be custom generated from a FASTA file containing desired annotations.
-
assets/index_mm10/*.star,assets/index_hg38/*.star,assets/index_mixed/*.star: STAR genome indexconfig.yamlkey to specify the path to the index:star_index: {'mm10': <mm10_index_prefix>, 'hg38': <hg38_index_prefix>, 'mixed': <mixed_index_prefix}- Combined (
mixed) genome build can be concatenated from standard GRCm38 (mm10) and GRCh38 (hg38) builds.
- Combined (
-
Merged mapped BAM Files for individual proteins (
workup/alignments/*.merged.mapped.bam) -
Window enrichment tables computed from CLIP/CLAP sample BAMs and input BAMs
- These are generated independently of the Snakemake pipeline.
- Example BED files used to survey enrichments are provided in the
assetsfolder.
Adapted from the SPRITE, RNA-DNA SPRITE, and ChIP-DIP pipelines by Isabel Goronzy (@igoronzy).
Other contributors
- Jimmy Guo (@jk-guo)
- Mitchell Guttman (@mitchguttman)