VDM_workshop_2025_practical

Virus discovery pipeline

created by Erin Harvey, Carla Mavian, Nokuzotha Nkiwane, TJ Sanko, Eduan Wilkinson. (in alphabetical order)

Links to other practicals

Metagenomic Workflow

In this tutorial we will learn how to taxonomically classify and visualize our metagenomic reads obtained with Illumina using the following programs:

These programs can be installed under conda environment:

Install Miniconda

Create conda environment:

conda create -n viral_discovery -c bioconda -c conda-forge python=3.12 rsem trinity multiqc samtools=1.21 bowtie2 trimmomatic fastqc megahit diamond blast seqkit

activate the environment in the scripts:

ON='conda activate viral_discovery' && eval ${ON}

Figure 1. Virus discovery pipeline Workflow by Nokuzotha Nkiwane

Connecting to the server to run the analysis

MacBook's and Linux computers come with terminal windows as part of the system. They do not require any additional programs to connect to a server. On Windows computers, you have to emulate a terminal. Multiple free programs provide that function, e.g MobaXterm, putty, GitBash, Terminus etc. Using one of the emulators, open a terminal and connect to host by typing:

ssh username@ceri.samrc.ac.za

List of usernames and passwords is avalable under USERNAME LIST. The passwords are only valid for next 7 days.
First time connecting you will be ask if you trust the connection. You have to type 'yes' and press enter. Then you will be asked to provide password. You will be provided one together with your username at the beginning of the workshop. Passwords are case sensitive and invisiable when typing !!! Therefore, it is advisable to copy & paste the password. Successful connection will print the server's logo on your terminal.
On average on a server you only have 2-5 tries to type the password after which you'll be blocked for at least 30 minutes. It is better to close the terminal and open it again and try again after the second failed password try.

Setting up our folder for the analysis

Commands used for orentation on the server

pwd

ls

ls -Fal *

ll *

Moving around on the server

cd

cd ..

cd ../..

cd ~

cd workspace

Absolute vs relative path

cd /home/vdw00

cd ../

cd ./vdw00

First step will be creating a working folder in workspace folder and moving into the freshly created directory

cd ~/workspace
mkdir -p metagenomics && cd metagenomics

pwd

Next, we will create other folders which will, help keep the analysis sorted

mkdir -p data results scripts

chmod -R a=rwx *

ll *

Let's copy our raw data files

wget

wget https://

git clone github.com

cp -R /analyses/vdworkshop/data/* ./data

chmod a=rwx ./data/*
ll ./data/*

Analysis

In this section we will create scripts to be executed on each step. The steps will be run form ~/workspace/metagenomics location while the scripts will be located in ~/workspace/metagenomics/scripts directory.

cd ~/workspace/metagenomics

(if you get lost, you can use the absolute path: /analyses/vdworkshop/${USER}/metagenomics)

We will use nano text editor to creat all scripts. You can create empty file and open it to edit at the same time

nano scripts/[script_name].sh

you can copy & paste the script text directly into the open document.
To save any of your scripts, press ctrl+X then Y and ENTER. This will "override" the [script_name].sh (which was empty on opening)
You can change the name after pressing 'Y'.Before running the script you have to change permissions

chmod a=rwx scripts/[script_name].sh

To run the script type:

bash scripts/[script_name].sh

FastQC pre-Trimmomatic

nano scripts/01.fastqc_pretrim.sh

#!/bin/env bash

THR=5
### enabling conda environment and fastqc program
ON="module miniconda && conda activate fastqc"
eval $ON

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/data/fastq'
output=${workdir}'/results/01.fastqc_pretrim'

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

echo '[INFO] Running FastQC pre-trim'

### run fastqc
fastqc -t ${THR} -o ${output}/ ${input}/*.f*q*
echo -e "\n[DONE] Pretrim FastQC complete"

### deactivating fastqc program
OFF='conda deactivate'
eval ${OFF}

chmod -R a=rwx ${output}
exit 0;

chmod a=rwx scripts/01.fastqc_pretrim.sh

bash scripts/01.fastqc_pretrim.sh

Trimmomatic

nano scripts/02.trimmomatic.sh

#!/bin/env bash

THR=5
### activating program
ON='module trimmomatic 1>/dev/null 2>/dev/null'
eval ${ON}

workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/data/fastq'
output=${workdir}'/results/02.trim_output'
ADAPTERS=${workdir}'/data/adapters.fa'

### make output directory if it doesn't exist
if [[ ! -e ${output} ]]; then mkdir -p ${output}; fi

### getting the SRA IDs and creating output files
for FOW in $(ls ${input}/*.f*q* | grep -Ei "_r?1"); do
  REV=`echo ${FOW} | sed -r 's/\_(r|R)?1/\_\12/'`;
  ID=`basename ${FOW} | cut -d '_' -f1`
  P1=${output}/${ID}'_1.P.fq.gz'
  U1=${output}/${ID}'_1.U.fq.gz'
  P2=${output}/${ID}'_2.P.fq.gz'
  U2=${output}/${ID}'_2.U.fq.gz'

  ### running program
  echo "[INFO] trimming: ${ID}"
  trimmomatic PE -threads ${THR} -phred33 -summary ${output}/${ID}'_statsSummary.txt' \
    ${FOW} ${REV} ${P1} ${U1} ${P2} ${U2} \
    ILLUMINACLIP:"${ADAPTERS}":2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:5 MINLEN:25 \
    2>>${output}/${ID}.log 1>>${output}/${ID}.log

  echo "[DONE] Trimming for ${ID} complete"
done
chmod -R a=rwx ${output}
exit 0;

chmod a=rwx  scripts/02.trimmomatic.sh

bash scripts/02.trimmomatic.sh

FastQC post-Trimmomatic

nano scripts/03.fastqc_posttrim.sh

#!/bin/env bash

THR=5
### enabling conda environment and fastqc program
ON="module miniconda && conda activate fastqc 1>/dev/null 2>/dev/null"
eval $ON

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/02.trim_output'
output=${workdir}'/results/03.fastqc_posttrim'

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

echo '[INFO] Running FastQC post-trim'

### run fastqc
fastqc -t ${THR} -o ${output}/ ${input}/*.P.fq.gz
echo -e "\n[DONE] Post-trim FastQC complete\n"

### deactivating fastqc program
OFF='conda deactivate'
eval ${OFF}

chmod -R a=rwx ${output}
exit 0;

chmod a=rwx  scripts/03.fastqc_posttrim.sh

bash scripts/03.fastqc_posttrim.sh

MultiQC

nano scripts/04.multiqc.sh

#!/bin/env bash

THR=5
### enabling conda environment and fastqc program
ON="module miniconda && conda activate fastqc 1>/dev/null 2>/dev/null"
eval $ON

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input_pre=${workdir}'/results/01.fastqc_pretrim'
input_post=${workdir}'/results/03.fastqc_posttrim'
output=${workdir}'/results/04.multiqc'

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

echo '[INFO] Runnig multiqc'
### to join both results
#multiqc ${input_pre} ${input_post} -o ${output}

multiqc ${input_pre}  -o ${output}/multiqc_pretrim
multiqc ${input_post} -o ${output}/multiqc_posttrim

### deactivating multiqc program
OFF='conda deactivate'
eval ${OFF}

echo -e "\n[INFO] Compressing the results\n"
zip -9r ${output}/multiqc_pretrim.zip ${output}/multiqc_pretrim
zip -9r ${output}/multiqc_posttrim.zip ${output}/multiqc_posttrim

chmod -R a=rwx ${output}
exit 0;

chmod a=rwx  scripts/04.multiqc.sh

bash scripts/04.multiqc.sh

Megahit

nano scripts/05.megahit.sh

#!/bin/env bash

THR=5
### enabling conda environment and megahit program
ON="module miniconda && conda activate megahit 1>/dev/null 2>/dev/null"
eval $ON

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/02.trim_output'
output=${workdir}'/results/05.megahit'

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

### run megahit
for FOW in $(ls ${input}/*_1.P.fq.gz); do
  REV=`echo ${FOW} | sed -r 's/\_(r|R)?1/\_\12/'`;
  ID=`basename ${FOW} | cut -d '_' -f1`
  LOG=${output}/${ID}.log

  echo "[INFO] assembling: ${ID}"
  megahit -t ${THR} -1 ${FOW} -2 ${REV} -o ${output}/${ID} 1>${LOG} 2>${LOG}

  ### after megahit run, prepend sample name to contigs
  ### (allows easier tracking of which contigs came from which sample in downstream analysis)
  awk -v prefix="${ID}_" '/^>/ {$0=">" prefix substr($0,2)} {print}
    ' ${output}/${ID}/final.contigs.fa | tr ' ' '_' > ${output}/${ID}/${ID}.contigs.fasta

  echo '[INFO] Cleaning results' && sleep 1s
  rm -Rf ${output}/${ID}/intermediate_contigs 2>/dev/null
  ln -s  ${output}/${ID}/${ID}.contigs.fasta  ${output}/${ID}.contigs.fasta
  echo "[DONE] Megahit assmebly completed successfully and contigs named with sample name: ${ID}" && sleep 1s
done

### deactivating megahit program
OFF='conda deactivate'
eval ${OFF}

chmod -R a=rwx ${output}
exit 0;

chmod a=rwx scripts/05.megahit.sh

bash scripts/05.megahit.sh

Diamond

Filtering viral sequences from RVDB protein database

nano scripts/06.1.diamond_rvdb.sh

#!/bin/env bash
THR=5
### enabling diamond program
ON="module diamond 1>/dev/null 2>/dev/null"
eval $ON

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/05.megahit'
output=${workdir}'/results/06.1.diamond_rvdb'
input_db=${workdir}'/data/database'

### DB variables and directories
FASTA="${input_db}/U-RVDBv30.0-prot.fasta"
DB='RVDB'
if [[ ! -e ${FASTA} && -e ${FASTA}.xz ]]; then xz -dk -T ${THR} ${FASTA}.xz; fi

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

### make diamond protein database
echo '[INFO] Reading database file '{$DB}
if [[ ! -e ${input_db}/${DB}.dmnd ]]; then
  diamond makedb --in ${FASTA} --threads ${THR} -d ${input_db}/${DB}; fi
chmod a=rwx ${input_db}/${DB}.dmnd 2>/dev/null

### loop through each of the files created in megahit output directory
### to find final.contigs.fa files and run diamond
function DBLX {
 FOLDER=$1; OUT=$2; DB=$3; T=$4; DBDIR=$5;
  ID=$(basename $1)
  CONTIGS=${FOLDER}/${ID}.contigs.fasta
  LOG=${OUT}/${ID}.log

  if [[ -f ${CONTIGS} ]]; then
	echo "[INFO] Analyzing sample ${ID}"
    sample_out=${OUT}/${ID}_rvdb.blastx
    diamond blastx -d ${DBDIR}/${DB}.dmnd \
    -q ${CONTIGS} \
    --out ${sample_out} \
    --threads ${T} \
    --evalue 1e-5 \
    --outfmt 6 qseqid qlen sseqid stitle pident length evalue bitscore \
    --strand both \
    --unal 0 \
	--more-sensitive \
    --mp-init 1>${LOG} 2>${LOG}
  else
    echo "[WARNING] Contigs file for ${ID} not found."
  fi
}
export -f DBLX

### check and run in parallel
PARALLEL_VER=(`parallel --version | grep 'GNU parallel'`)
if [[ ${PARALLEL_VER[2]} =~ [0-9]+ ]]; then
 echo '[INFO] Starting parallel mode' && sleep 1s
 ls -dl ${input}/* | grep ^d | awk '{print $9}' | parallel -j ${THR} -n1 -I% "DBLX %" ${output} ${DB} 1 ${input_db}
else
 echo '[INFO] Starting non-parallel mode' && sleep 1s
 for FOLDER in $(ls -dl ${input}/* | grep ^d | awk '{print $9}'); do
  DBLX ${FOLDER} ${output} ${DB} 1 ${input_db}
 done
fi

chmod -R a=rwx ${output}
exit 0;

chmod a=rwx scripts/06.1.diamond_rvdb.sh

#bash scripts/06.1.diamond_rvdb.sh

Running this step will take few to several hours. Therefore, to save time we will copy the results of this step from backup location

cp -R /analyses/vdworkshop/backup/results/06.1.diamond_rvdb ./results/

Filtering viral sequences from NCBI protein database

nano scripts/06.2.diamond_nrvir.sh

#!/bin/env bash

THR=5
### enabling diamond program
ON="module diamond 1>/dev/null 2>/dev/null"
eval $ON

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/05.megahit'
output=${workdir}'/results/06.2.diamond_nrvir'
input_db=${workdir}'/data/database'

### DB variables and directories
FASTA="${input_db}/nr.faa"
### downsized NR database to only viral sequences
DB='nrvir'
if [[ ! -e ${FASTA} && -e ${FASTA}.gz ]]; then gzip -dkf ${FASTA}.gz; fi

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

### make diamond protein database
echo '[INFO] Reading database file '{$DB}
if [[ ! -e ${input_db}/${DB}.dmnd ]]; then
  diamond makedb --in ${FASTA} --threads ${THR} -d ${input_db}/${DB}; fi
chmod a=rwx ${input_db}/${DB}.dmnd 2>/dev/null

### loop through each of the files created in megahit output directory
### to find final.contigs.fa files and run diamond
function DBLX {
 FOLDER=$1; OUT=$2; DB=$3; T=$4; DBDIR=$5;
  ID=$(basename $1)
  CONTIGS=${FOLDER}/${ID}.contigs.fasta
  LOG=${OUT}/${ID}.log

  if [[ -f ${CONTIGS} ]]; then
	echo "[INFO] Analyzing sample ${ID}"
    sample_out=${OUT}/${ID}_nrvir.blastx
    diamond blastx -d ${DBDIR}/${DB}.dmnd \
    -q ${CONTIGS} \
    --out ${sample_out} \
    --threads ${T} \
    --evalue 1e-5 \
    --outfmt 6 qseqid qlen sseqid stitle pident length evalue bitscore \
    --strand both \
	--more-sensitive \
    --unal 0 \
    --mp-init 1>${LOG} 2>${LOG}
  else
    echo "[WARNING] Contigs file for ${ID} not found."
  fi
}
export -f DBLX

### check and run in parallel
PARALLEL_VER=(`parallel --version | grep 'GNU parallel'`)
if [[ ${PARALLEL_VER[2]} =~ [0-9]+ ]]; then
 echo '[INFO] Starting parallel mode' && sleep 1s
 ls -dl ${input}/* | grep ^d | awk '{print $9}' | parallel -j ${THR} -n1 -I% "DBLX %" ${output} ${DB} 1 ${input_db}
else
 echo '[INFO] Starting non-parallel mode' && sleep 1s
 for FOLDER in $(ls -dl ${input}/* | grep ^d | awk '{print $9}'); do
  DBLX ${FOLDER} ${output} ${DB} 1 ${input_db}
 done
fi

chmod -R a=rwx ${output}
exit 0;

chmod a=rwx scripts/06.2.diamond_nrvir.sh

#bash scripts/06.2.diamond_nrvir.sh

Running this step will take few to several hours. Therefore, to save time we will copy the results of this step from backup location

cp -R /analyses/vdworkshop/backup/results/06.2.diamond_nrvir ./results/

Combining results from both blastx searches

nano scripts/07.combine.sh

#!/bin/env bash

THR=5
### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/05.megahit'
rvdb_in=${workdir}'/results/06.1.diamond_rvdb'
nrdb_in=${workdir}'/results/06.2.diamond_nrvir'
contigs_in=${workdir}'/results/05.megahit'

combined_dir="${workdir}/results/07.combined"
### make output directory if it doesn't exist
if ! [[ -d ${combined_dir} ]]; then mkdir -p -m a=rwx ${combined_dir}; fi

rvdb_out=${combined_dir}'/combined_rvdb.blastx'
nrdb_out=${combined_dir}'/combined_nrdb.blastx'
rvdb_virus=${combined_dir}'/combined_rvdb.virus_hits.blastx'
nrdb_virus=${combined_dir}'/combined_nrdb.virus_hits.blastx'
unique=${combined_dir}'/unique_viral_contigs.txt'
combined_contigs=${combined_dir}'/all_contigs.fasta'
viral_fasta=${combined_dir}'/viral_contigs.fasta'

echo "[INFO] Concatenating all contigs..."
cat ${contigs_in}/*.contigs.fasta | tr ' ' '_' > ${combined_contigs}

echo "[INFO] Concatenating RVDB blastx results..."
cat ${rvdb_in}/*_rvdb.blastx > ${rvdb_out}

echo "[INFO] Concatenating VP blastx results..."
cat ${nrdb_in}/*_nrvir.blastx > ${nrdb_out}

echo "[INFO] Filtering for viral hits..."
grep -Ei "vir[us|idae|oid]" ${rvdb_out} > ${rvdb_virus}
grep -Ei "vir[us|idae|oid]" ${nrdb_out} > ${nrdb_virus}

echo "[INFO] Extracting unique contig names..."
awk '{print $1}' ${rvdb_virus} ${nrdb_virus} | sort -u > ${unique}

### Activate seqkit environment
ON="module miniconda && conda activate seqkit 1>/dev/null 2>/dev/null"
eval ${ON}

echo "[INFO] Extracting viral contigs using seqkit..."
seqkit grep -f ${unique} ${combined_contigs} > ${viral_fasta}

OFF='conda deactivate'
eval ${OFF}

echo -en "
[DONE]
Combined RVDB blastx     : ${rvdb_out}
Combined NRDB blastx     : ${nrdb_out}
RVDB viral hits          : ${rvdb_virus}
NRDB viral hits          : ${nrdb_virus}
Unique viral contigs list: ${unique}
Viral contigs FASTA      : ${viral_fasta}
"

echo "[COUNT] Number of viral contigs extracted:"`grep -Ec "^>" ${viral_fasta}`

exit 0;

chmod a=rwx scripts/07.combine.sh

bash scripts/07.combine.sh

Filtering viral sequences from NCBI nucleotide database

nano scripts/08.ncbi_ntdb.sh

#!/bin/env bash

THR=5
###enabling BLAST program
ON='module ncbi'
eval ${ON}

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/07.combined/viral_contigs.fasta'
output=${workdir}'/results/08.ncbi_ntdb'
input_db=${workdir}'/data/database'

### DB variables and directories
DB='ntdb'
if [[ ! -e ${input_db}/${DB}/${DB}.nal ]]; then
  echo '[INFO] Indexing nucleotide NCBI database'
  makeblastdb -out ${input_db}/${DB}/${DB} -dbtype 'nucl' -parse_seqids \
	-in ${input_db}/${DB}/${DB}.fasta -input_type 'fasta';
fi

###make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

echo -en "
Running BLASTN against nt database...
Query: ${input}
DB:    ${DB}
"

blastn \
    -db ${input_db}/${DB}/${DB} \
    -query ${input} \
    -out ${output}'/viral_contigs_vs_nt.blastn' \
    -num_threads ${THR} \
    -evalue 1e-10 \
    -max_target_seqs 25 \
    -outfmt "6 qseqid qlen sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore" \
    > ${output}/blastn.log 2>&1

chmod -R a+rwx ${output}

echo -en "
BLASTN complete.
Results written to: ${output}/viral_contigs_vs_nt.blastn
Log               : ${output}/blastn.log
"
exit 0;

chmod a=rwx scripts/08.ncbi_ntdb.sh

#bash scripts/08.ncbi_ntdb.sh

Running this step will take few to several hours. Therefore, to save time we will copy the results of this step from backup location

cp -R /analyses/vdworkshop/backup/results/08.ncbi_ntdb ./results/

Combining and filtering sequences from NCBI nucleotide database

nano scripts/09.blastn_filtered.sh

#!/bin/env bash

THR=5
### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
blastn_in=${workdir}'/results/08.ncbi_ntdb'
output=${workdir}'/results/09.blastn_filtered'
combined_contigs=${workdir}'/results/07.combined/all_contigs.fasta'

virus_hits_list=${output}'/combined_blastn.virus_hits.blastn'
unique_ids=${output}'/unique_ncbi_ntdb_contigs.txt'
viral_fasta=${output}'/viral_contigs_blastn.fasta'

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

echo "[INFO] Filtering BLASTn outputs for viral hits..."

### Combine and filter
grep -iv "virus" ${blastn_in}/*.blastn | awk '{print $1}' | sort -u > ${blastn_in}/blastn_not-viral_IDs.txt
grep -v -f ${blastn_in}/blastn_not-viral_IDs.txt ${workdir}/results/07.combined/unique_viral_contigs.txt > ${virus_hits_list}

echo "[INFO] Extracting unique contig IDs..."
awk '{print $1}' ${virus_hits_list} | sort -u > ${unique_ids}

echo "[COUNT] Number of unique viral contigs: "`cat ${unique_ids} | wc -l`

### Activate seqkit
ON="module miniconda && conda activate seqkit 1>/dev/null 2>/dev/null"
eval "${ON}"

echo "[INFO] Extracting viral contigs using seqkit..."
seqkit grep -j ${THR} -f ${unique_ids} ${combined_contigs} > ${viral_fasta}

OFF='conda deactivate'
eval ${OFF}

echo -en "
[DONE]
Viral hits table   : ${virus_hits_list}
Unique contig list : ${unique_ids}
Viral contigs FASTA: ${viral_fasta}
Output directory   : ${output}
"
exit 0;

chmod a=rwx scripts/09.blastn_filtered.sh

bash scripts/09.blastn_filtered.sh

Combining and filtering sequences from NCBI non-redundant database

nano scripts/10.diamond_blastx_blastn_contigs.sh

#!/bin/env bash

THR=5
### enabling diamond program
ON="module diamond 1>/dev/null 2>/dev/null"
eval $ON

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/09.blastn_filtered/viral_contigs_blastn.fasta'
output=${workdir}'/results/10.diamond_blastx_blastn_contigs'
assembly=${workdir}'/results/05.megahit'
input_db=${workdir}'/data/database'
unique=${output}'/unique_viral_contigs.txt'

### DB variables and directories
FASTA="${input_db}/nr.faa"
DB='NRDB'
if [[ ! -e ${FASTA} && -e ${FASTA}.gz ]]; then gzip -dkf ${FASTA}.gzip; fi

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

### make diamond protein database
if [[ ! -e ${input_db}/${DB}.dmnd ]]; then
  echo '[INFO] Indexing protein diamond nrdb database'
  diamond makedb --in ${FASTA} --threads ${THR} -d ${input_db}/${DB}; fi
chmod a=rwx ${input_db}/${DB}.dmnd

echo "[INFO] Using existing DIAMOND DB: ${DB}"

echo "[INFO] Running DIAMOND blastx..."
diamond blastx \
    -d ${input_db}/${DB} \
    -q ${input} \
    --out ${output}'/viral_contigs_nrdb.blastx' \
    --threads ${THR} \
    --evalue 1e-5 \
    --max-target-seqs 1 \
    --outfmt 6 qseqid qlen sseqid stitle pident length evalue bitscore \
    --strand both \
	--more-sensitive \
    --unal 0 \
    2> ${output}'/diamond.log'

echo '[INFO] extracting accession numbers'
cat ${output}/viral_contigs_nrdb.blastx | grep -Ei "vir[us|idae|oid]" | awk '{print $3}' > ${output}/viral_contigs_nrdb.txt

echo "[INFO] Extracting unique contig names..."
cat ${output}/viral_contigs_nrdb.blastx | grep -Ei "vir[us|idae|oid]" | awk '{print $1}' | sort -u > ${unique}

### Activate seqkit environment
ON="module miniconda && conda activate seqkit 1>/dev/null 2>/dev/null"
eval ${ON}

for FASTA in $(ls ${assembly}/*.contigs.fasta); do
  SAMPLE=`basename -s .contigs.fasta ${FASTA}`
  echo "[INFO] Extracting viral contigs using seqkit..."
  seqkit grep -f ${unique} ${FASTA} > ${output}/${SAMPLE}_viral_contigs.fasta
done
chmod -R a=rwx ${output}/*

echo -en "
[DONE]
DIAMOND results: ${output}/viral_contigs_nrdb.blastx
ACCESSIONS     : ${output}/viral_contigs_nrdb.txt
Log            : ${output}/diamond.log
"
exit 0;

chmod a=rwx scripts/10.diamond_blastx_blastn_contigs.sh

bash scripts/10.diamond_blastx_blastn_contigs.sh

Fetching taxonomy linage from NCBI nucleotide database

nano scripts/11.genbank_fetch.sh

#!/bin/env bash

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/10.diamond_blastx_blastn_contigs'
output=${workdir}'/results/11.genbank_fetch'
gb_final="${output}/11.final_gb.gb"

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

echo '[INFO] Creating Perl script to fetch taxonomy';
echo '#!/usr/bin/env perl

use strict;
use warnings;
use LWP::Simple;
use XML::Simple;
use Text::CSV;

### Usage:
###   perl get_taxonomy_lineage.pl accessions.txt output_dir

my $input_file = $ARGV[0] or die "Usage: $0 accessions.txt output_dir\n";
my $output_dir = $ARGV[1] or die "Specify output directory\n";

open(my $in, "<", $input_file) or die "Cannot open $input_file: $!\n";

my $csv = Text::CSV->new({ binary => 1, eol => "\n" });
open(my $out, ">$output_dir/taxonomy_lineage.csv") or die "Cannot write to output file: $!\n";
$csv->print($out, ["Accession", "Organism", "TaxID", "Lineage"]);

while (my $acc = <$in>) {
    chomp $acc;
    next unless $acc;

    print "Processing $acc...\n";

    ### Step 1: Search for accession in protein DB
    my $esearch_url =
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=$acc&retmode=xml";

    my $xml_data = get($esearch_url);
    unless ($xml_data) { warn "  Failed to fetch ESearch for $acc\n"; next; }

    my $xml = eval { XMLin($xml_data) };
    if ($@ || !$xml->{IdList}->{Id}) { warn "  No ID found for $acc\n"; next; }

    my $id = ref($xml->{IdList}->{Id}) eq "ARRAY"
           ? $xml->{IdList}->{Id}[0]
           : $xml->{IdList}->{Id};

    ### Step 2: Fetch ESummary to get TaxID
    my $esummary_url =
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=$id&retmode=xml";

    my $sum_data = get($esummary_url);
    unless ($sum_data) { warn "  Failed to fetch ESummary for $acc\n"; next; }

	my ($taxid) = $sum_data =~ /.*Name="TaxId"[^>]*>(\d+)<\/Item>/;
    unless ($taxid) { warn "  No TaxID found for $acc\n"; next; }

    ### Step 3: Get taxonomy lineage
    my $efetch_url =
      "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=$taxid&retmode=xml";

    my $tax_data = get($efetch_url);
    unless ($tax_data) { warn "  Failed to fetch taxonomy for TaxID $taxid ($acc)\n"; next; }

    my $tax_xml = eval { XMLin($tax_data) };
    if ($@ || !$tax_xml->{Taxon}) { warn "  Failed to parse taxonomy XML for TaxID $taxid\n"; next; }

    ### Taxon may be array or hash
    my $taxon = $tax_xml->{Taxon};
    $taxon = $taxon->[0] if ref($taxon) eq "ARRAY";

    my $scientific_name = $taxon->{ScientificName} || "N/A";
    my $lineage         = $taxon->{Lineage}        || "N/A";

    ### Save to CSV
    $csv->print($out, [$acc, $scientific_name, $taxid, $lineage]);
	sleep 1;
}

close $in and close $out;

print "\nDone! Results saved to $output_dir/taxonomy_lineage.csv\n";
' > ${workdir}/scripts/get_taxonomy_lineage.pl
chmod a=rwx ${workdir}/scripts/get_taxonomy_lineage.pl

NOACC=`cat ${input}'/viral_contigs_nrdb.txt' | wc -l`
echo "[INFO] Downloading taxonomy lineage for ${NOACC} accessions"
perl ${workdir}/scripts/get_taxonomy_lineage.pl  ${input}'/viral_contigs_nrdb.txt' ${output}

chmod -R a=rwx ${output}
exit 0;

chmod a=rwx scripts/11.genbank_fetch.sh

bash scripts/11.genbank_fetch.sh

Align and estimate abundance with RSEM

nano scripts/12.rsem.sh

#!/bin/env bash

THR=5
ON='module miniconda && conda activate rsem'
eval $ON

### input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/09.blastn_filtered/viral_contigs_blastn.fasta'
output=${workdir}'/results/12.rsem'
reads=${workdir}'/data/fastq'
assemblies=${workdir}'/results/05.megahit'

### make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi

function RUN_RSEM {
  FOW=$1; reads_dir=$2; assembly_dir=$3; out_dir=$4; threads=$5

  REV=`echo ${FOW} | sed -r 's,_(r|R)?1,_\12,'`
  sample=`basename ${FOW} | cut -d '_' -f1`

  contigs="${assembly_dir}/${sample}.contigs.fasta"
  out_prefix="${out_dir}/${sample}.RSEM"

  if [[ ! -f ${contigs} ]]; then
    echo "[WARN] Missing contigs for ${sample}: ${contigs}"
    return
  fi
  if [[ ! -f ${FOW} || ! -f ${REV} ]]; then
    echo "[WARN] Missing reads for ${sample}"
    return
  fi

  echo "[INFO] Running RSEM for ${sample}" && sleep 1s

  ### Run RSEM using Trinity utility
  align_and_estimate_abundance.pl \
        --transcripts ${contigs} \
        --seqType fq \
        --left ${FOW} \
        --right ${REV} \
        --est_method RSEM \
        --aln_method bowtie2 \
        --output_dir ${out_prefix} \
        --thread_count ${threads} \
        --prep_reference \
        > ${out_prefix}'.log' 2>&1

  echo "[DONE] ${sample}"
}
export -f RUN_RSEM

echo "[INFO] Starting RSEM quantification for samples..."
### check and run in parallel
PARALLEL_VER=(`parallel --version | grep 'GNU parallel'`)
if [[ ${PARALLEL_VER[2]} =~ [0-9]+ ]]; then
 echo '[INFO] Starting non-parallel mode' && sleep 1s
 ls -dl ${reads}/*.gz | awk '{print $9}' | grep -Ei '_r?1' | parallel -j ${THR} -n1 -I% "RUN_RSEM %" ${reads} ${assemblies} ${output} 1
else
 echo '[INFO] Starting parallel mode' && sleep 1s
 for FILE in $(ls -dl ${reads}/*.gz | awk '{print $9}' | grep -Ei '_r?1'); do
  RUN_RSEM ${FILE} ${reads} ${assemblies} ${output} 1
 done
fi

OFF='conda deactivate'

eval ${OFF}
chmod -R a=rwx ${output}
exit 0;

chmod a=rwx scripts/12.rsem.sh

bash scripts/12.rsem.sh

Join RSEM, Blastx and lineage tables

nano scripts/13.combine_tables.sh

#!/bin/env bash
THR=5
ON='module R'
eval ${ON}
###' input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input_taxa=${workdir}'/results/11.genbank_fetch/taxonomy_lineage.csv'
input_rsem=${workdir}'/results/12.rsem'
input_bltx=${workdir}'/results/10.diamond_blastx_blastn_contigs/viral_contigs_nrdb.blastx'
output=${workdir}'/results/13.combine_tables'
###' make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi
###' generating R script.
###' You need to have or install fallowing R packages:
###' "dplyr", "readr", "stringr", "tidyverse"
echo '#!/bin/env R
rm(list=ls())
args = commandArgs(trailingOnly=TRUE)
####' test if there is at least one argument: if not, return an error
if (length(args) < 3) {
stop("Expected 3 tables supplied", call.=FALSE)
}
###' ---- Input files ----
taxonomy_file <- args[1]
blast_file    <- args[2]
rsem_dir      <- args[3]
###' Library check and activation
library("dplyr")
library("readr")
library("stringr")
library("tidyverse")
###' ---- Read taxonomy table ----
###' Assumes first column = Accession
taxonomy <- read_csv(taxonomy_file, show_col_types = FALSE)
colnames(taxonomy) <- c("Accession","Organism","TaxID","Lineage")
taxonomy <- taxonomy %>% distinct(Accession, .keep_all = TRUE)
###' ---- Read BLASTX table ----
###' BLASTX files are usually tab-separated
blast <- read_tsv(blast_file, col_names = FALSE, show_col_types = FALSE)
colnames(blast) <- c("Contig","query.len","Accession","seq.title","pident",
"length","evalue","bitscore")
blast <- blast %>%
arrange(Contig,desc(pident), desc(length),desc(bitscore)) %>%
group_by(Contig) %>%
slice(1) %>%          # take the best row per Contig
ungroup()
blast$SAMPLE<- str_extract(blast$Contig, "SRR[0-9]+")
###' ---- Read RSEM table --s-
rsem <- tibble();
for (rsem_file in list.files(rsem_dir, pattern = ".RSEM$")) {
rsem <- rbind(rsem,
read.table(paste(rsem_dir,rsem_file,"RSEM.genes.results",sep="/"),
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE))
}
rsem <- rsem[-1]
colnames(rsem) <- c("Contig","length","effective_length","expected_count","TPM","FPKM")
rsem <- blast %>% mutate(SAMPLE = unlist(strsplit(rsem$Contig,"_"))[1] )
rsem$SAMPLE <- str_extract(rsem$Contig, "SRR[0-9]+")
###' ---- Merging table ----
merged <- blast %>% full_join(taxonomy, by ="Accession")
merged <- merged %>% full_join(rsem, by = c("Contig","SAMPLE"))
###' ----- Renaming columns ----
colnames(merged) <- c("Contig","query.len","Accession","seq.title","pident","aln.length","evalue","bitscore","Sample",
"Organism","TaxID","Lineage","length.y","effective_length","expected_count","TPM","FPKM")
###' ---- sorting columns ----
merged <- merged %>%
select(c("Sample","Contig","query.len","Accession","seq.title","pident","aln.length","evalue","bitscore",
"Organism","TaxID","Lineage","length.y","effective_length","expected_count","TPM","FPKM") )
###' ---- Output one file per SRR ----
unique_SRRs <- unique(merged$Sample)
for (srr in unique_SRRs) {
subset_df <- merged %>% filter(Sample == srr)
outfile <- paste0(srr, "_summary.txt")
write_tsv(subset_df, outfile)
message("Created: ", outfile)
}
' > ${output}/join_tab.R
chmod a=rwx ${output}/join_tab.R
Rscript --vanilla ${output]/join_tab.R ${input_taxa} ${input_bltx} realapth ${input_rsem} 2>/dell/null
echo '[INFO] Merging of tables finished'
exit 0;

chmod a=rwx scripts/13.combine_tables.sh

bash scripts/13.combine_tables.sh

Formatting and investigating similarity search results

nano scripts/14.filter_sort.sh

#!/bin/env bash
###' input and output directories
workdir=`realpath $(pwd) 2>/dev/null`
input=${workdir}'/results/13.combine_tables'
output=${workdir}'/results/14.filter_sort'
blastx=${workdir}'/results/10.diamond_blastx_blastn_contigs'
###' Library IDs to iterate through
#IDs=(SRR31521267 SRR13765816 SRR8048112)
IDs=(`ls ${input}/*.txt | xargs -n1 basename | cut -d '_' -f1`)
###' Make output directory if it doesn't exist
if ! [[ -d ${output} ]]; then mkdir -p -m a=rwx ${output}; fi
ON="module miniconda && conda activate seqkit 1>/dev/null 2>/dev/null"
eval "${ON}"
###' Iterate through library IDs, the different sample types
for ID in ${IDs[@]}; do
###' Sort the result table by abundance and size of contig
sort -t$'\t' -k 3rn -k 16rn ${input}/${ID}_summary.txt > ${output}/${ID}_summary_sorted.txt
###' Filter out hits to phage (these hits are commonly discarded, but if phage is relevant to your aim, skip this step)
grep -v 'phage' ${output}/${ID}_summary_sorted.txt > ${output}/${ID}_summary_sorted_filtered.txt
###' Get contigs for the filtered dataset (i.e. viral contigs without phage contigs)
awk '{print $2}' ${output}/${ID}_summary_sorted_filtered.txt > ${output}/${ID}_summary_sorted_filtered_ids.txt
seqkit grep -f ${output}/${ID}_summary_sorted_filtered_ids.txt ${blastx}/${ID}_viral_contigs.fasta > /
${output}/${ID}_summary_sorted_filtered.fasta
done
OFF='conda deactivate'
eval ${OFF}

chmod a=rwx scripts/14.filter_sort.sh

bash scripts/14.filter_sort.sh

The following are bash commands used to investigate the results and determine for each result whether it represents a real virus, a true positive, or not.

Reminder: SRR8048112 is a fecal sample, SRR13765816 is a lip sample, and SRR31521267 is a negative

Check how many viral hits per library

wc -l ${input}/SRR*

Lets start by running the previous script. This will make some new result files that collect all of the information and contigs we identified in previous steps, as well as sort the results by the length of the contigs and abundance values (from step 12 using rsem).

We now have sorted tables to investigate the results of our pipeline. We will use the 'less' command to parse our results, the aim being to get a general idea of what is in there

###' variable for input dir
input=${workdir}'/results/14.filter_sort
ID=SRR31521267 	### choose different SRR IDs to explore with the 'less' command
#ID=SRR13765816
#ID=SRR8048112

less ${input}/${ID}_SUMMARY_sorted_filtered.txt ### start with SRR31521267 - negative control - what's in there?
###' (Scroll with arrow keys; type q to exit less command)

One hit in SRR31521267 - the negative control - is to a human papillomavirus. Human viruses are common contaminants in virus discovery, and the presence of this hit in the negative control lets us know that hits for these viruses in the other samples can be discarded.

Let's check if we can see the papilloma in other libraries?

###' check if potential contaminant identified in negative control is also in other libraries
grep -i "papilloma" ${input}/SRR31521267_summary.txt
###' consider whether this result would be real
###' how similar is it to the blast hit? is it likely a virus that makes sense in a Tasmanian devil library?

Proceed by examining the contigs - use cat, less or grep to view contigs and copy and paste them into webtools. The following code shows how to extract a contig you are interested in.

cat ${input}/${ID}_SUMMARY_sorted_filtered.fasta
###' grep a specific contig that you liked from the results summary table
contig="" 	###' add contig name
grep -A1 ${contig} ${output}/${ID}_SUMMARY_sorted_filtered.fasta

Once you have extracted a contig of interest, lets find ORFS. Expasy translate: https://web.expasy.org/translate/

Check for interupted stop codons
Download translated frames to get protein sequences If the contig does not have complete ORFs, choose another contig to look at.

Next, if the ORFs look real for a contig, annotate the contig and determine what proteins are encoded CDD search: https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

Note that if long contigs don't have any viral domains, it may be a false positive.

###' Additional: If you find segmented viruses, try to find all the segments eg. Crimean-Congo hemorrhagic fever virus in SRR31521267 Use ICTV as a guide to determine number of expected segments https://ictv.global/report/genome

Search blast results for all contigs from relevant species, genus or family if relevant eg. Crimean-Congo hemorrhagic fever virus in SRR31521267

###' Crimean-Congo hemorrhagic fever virus species name
grep "Orthonairovirus haemorrhagiae" ${output}/SRR31521267_SUMMARY_sorted_filtered.txt
###' Crimean-Congo hemorrhagic fever virus genus name
grep "Orthonairovirus" ${output}/SRR31521267_SUMMARY_sorted_filtered.txt
###' Crimean-Congo hemorrhagic fever virus family name
grep "Nairoviridae" ${output}/SRR31521267_SUMMARY_sorted_filtered.txt

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
phylogenetics_practical		phylogenetics_practical
structure_practical		structure_practical
README.md		README.md
db_download.sh		db_download.sh
sra_download.sh		sra_download.sh
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VDM_workshop_2025_practical

Virus discovery pipeline

Links to other practicals

Metagenomic Workflow

Connecting to the server to run the analysis

ssh username@ceri.samrc.ac.za

Setting up our folder for the analysis

Analysis

FastQC pre-Trimmomatic

Trimmomatic

FastQC post-Trimmomatic

MultiQC

Megahit

Diamond

Filtering viral sequences from RVDB protein database

Filtering viral sequences from NCBI protein database

Combining results from both blastx searches

Filtering viral sequences from NCBI nucleotide database

Combining and filtering sequences from NCBI nucleotide database

Combining and filtering sequences from NCBI non-redundant database

Fetching taxonomy linage from NCBI nucleotide database

Align and estimate abundance with RSEM

Join RSEM, Blastx and lineage tables

Formatting and investigating similarity search results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VDM_workshop_2025_practical

Virus discovery pipeline

Links to other practicals

Metagenomic Workflow

Connecting to the server to run the analysis

ssh username@ceri.samrc.ac.za

Setting up our folder for the analysis

Analysis

FastQC pre-Trimmomatic

Trimmomatic

FastQC post-Trimmomatic

MultiQC

Megahit

Diamond

Filtering viral sequences from RVDB protein database

Filtering viral sequences from NCBI protein database

Combining results from both blastx searches

Filtering viral sequences from NCBI nucleotide database

Combining and filtering sequences from NCBI nucleotide database

Combining and filtering sequences from NCBI non-redundant database

Fetching taxonomy linage from NCBI nucleotide database

Align and estimate abundance with RSEM

Join RSEM, Blastx and lineage tables

Formatting and investigating similarity search results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages