Haplotype-resolved scaffolding with 3D-DNA, producing whole-genome or chromosome-level assemblies.
HapScafFlow is a Snakemake workflow designed to scaffold individual haplotypes using 3D-DNA, assess genome completeness with BUSCO, generate final whole-genome or chromosome-level assemblies and search individual chromosomes for telomeric repeats with tidk. The pipeline was designed to scaffold the output from Toulbar2, which assigns contigs to new haplotype numbers based on protein alignment. The original assembly comes from Hifiasm.
- First Submission: If submitted once, the pipeline will crash after 3D-DNA scaffolding.
- Manual Curation: The user must open Juicebox and export the genome to create
.reviewfiles. - Resubmission: After review file generation, re-run the pipeline without changes. It will continue until completion.
- Iterative Manual Curation: After each review file update, the pipeline can be resubmitted.
Warning: The pipeline does not back up .review files or track curation rounds. Users should manually save previous versions if needed.
The following input files are required:
hap_{n}.fasta– Haplotype-specific assembly fileshap_{n}.fasta.length– Corresponding length filesmerged_nodups.txt- Aligned Hi-C reads to the whole-genome assembly with Juicer
Pipeline configuration variables are set in run_snake.sh.
Set the number of expected chromosomes:
export NUM_CHRS=8The number of haplotypes is determined automatically from the number in input fasta files:
export NUM_HAP=$(ls $PATH_TO_FASTA/hap_*.fasta 2>/dev/null | grep -oP '(?<=/hap_)\d+' | sort -nr | head -n1)Set lineage of your organism. By default, the pipeline uses the .odb10 version of the BUSCO dataset.
After producing chromosome-level assembly, the chromosomes are searched for telomeric repeats and a plot of the highest hits is produced. Set the telomeric repeat of your organism.
This pipeline is designed for Snakemake 7.20.0 and runs using a Slurm cluster (preconfigured in config.yaml).
The pipeline requires the following software modules, which are loaded within individual scripts using module load:
module load devel/Miniconda/Miniconda3
module load bioinfo/BUSCO/5.4.7
module load bioinfo/LASTZ/1.04.22 devel/python/Python-3.6.3
module load bioinfo/3D-DNA/529ccf4
module load bioinfo/Seqtk/1.3
module load bioinfo/bgzip/1.18
module load bioinfo/samtools/1.19
module load bioinfo/assemblathon2/d1f044b
module load bioinfo/tidk/0.2.63To install and configure the pipeline:
- Navigate to the directory containing the FASTA files:
cd /path/to/fasta/files- Clone the repository:
git clone https://github.com/adlnosk/HapScafFlow.git
cd HapScafFlow- Modify
run_snake.shaccording to your dataset.
To launch the Snakemake pipeline, ensure all required files are in place and execute:
sbatch run_snake.sh- output files are stored next to the input FASTA files
The pipeline generates:
- Scaffolds for each haplotype (in
q0orq1_3D_DNA_HAP$n) - BUSCO scores to assess completeness (in
BUSCO/busco_summaries) - Final whole-genome and chromosome-level assemblies (in
FINALS/andFINALS/chrs/) - Assemblathon statistics (
/FINALS/whole_genome.fasta.gz.assemblahon_stats) - Telomeric repeats plot (
/TELOMERES/)