HapScafFlow

Haplotype-resolved scaffolding with 3D-DNA, producing whole-genome or chromosome-level assemblies.

Overview

HapScafFlow is a Snakemake workflow designed to scaffold individual haplotypes using 3D-DNA, assess genome completeness with BUSCO, generate final whole-genome or chromosome-level assemblies and search individual chromosomes for telomeric repeats with tidk. The pipeline was designed to scaffold the output from Toulbar2, which assigns contigs to new haplotype numbers based on protein alignment. The original assembly comes from Hifiasm.

Workflow Execution and Manual Curation

First Submission: If submitted once, the pipeline will crash after 3D-DNA scaffolding.
Manual Curation: The user must open Juicebox and export the genome to create .review files.
Resubmission: After review file generation, re-run the pipeline without changes. It will continue until completion.
Iterative Manual Curation: After each review file update, the pipeline can be resubmitted.

Warning: The pipeline does not back up .review files or track curation rounds. Users should manually save previous versions if needed.

Input Requirements

The following input files are required:

hap_{n}.fasta – Haplotype-specific assembly files
hap_{n}.fasta.length – Corresponding length files
merged_nodups.txt - Aligned Hi-C reads to the whole-genome assembly with Juicer

Configuration

Pipeline configuration variables are set in run_snake.sh.

Defining Genome Structure

Set the number of expected chromosomes:

export NUM_CHRS=8

The number of haplotypes is determined automatically from the number in input fasta files:

export NUM_HAP=$(ls $PATH_TO_FASTA/hap_*.fasta 2>/dev/null | grep -oP '(?<=/hap_)\d+' | sort -nr | head -n1)

Species-specific parameters

BUSCO lineage

Set lineage of your organism. By default, the pipeline uses the .odb10 version of the BUSCO dataset.

Telomere repeat motif

After producing chromosome-level assembly, the chromosomes are searched for telomeric repeats and a plot of the highest hits is produced. Set the telomeric repeat of your organism.

Snakemake and Dependencies

This pipeline is designed for Snakemake 7.20.0 and runs using a Slurm cluster (preconfigured in config.yaml).

Required Modules

The pipeline requires the following software modules, which are loaded within individual scripts using module load:

module load devel/Miniconda/Miniconda3
module load bioinfo/BUSCO/5.4.7
module load bioinfo/LASTZ/1.04.22 devel/python/Python-3.6.3
module load bioinfo/3D-DNA/529ccf4
module load bioinfo/Seqtk/1.3
module load bioinfo/bgzip/1.18
module load bioinfo/samtools/1.19
module load bioinfo/assemblathon2/d1f044b
module load bioinfo/tidk/0.2.63

Installing the Pipeline

To install and configure the pipeline:

Navigate to the directory containing the FASTA files:

cd /path/to/fasta/files

Clone the repository:

git clone https://github.com/adlnosk/HapScafFlow.git
cd HapScafFlow

Modify run_snake.sh according to your dataset.

Running the Workflow

To launch the Snakemake pipeline, ensure all required files are in place and execute:

sbatch run_snake.sh

Output

output files are stored next to the input FASTA files

The pipeline generates:

Scaffolds for each haplotype (in q0 or q1_3D_DNA_HAP$n)
BUSCO scores to assess completeness (in BUSCO/busco_summaries)
Final whole-genome and chromosome-level assemblies (in FINALS/ and FINALS/chrs/)
Assemblathon statistics (/FINALS/whole_genome.fasta.gz.assemblahon_stats)
Telomeric repeats plot (/TELOMERES/)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
scripts		scripts
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
run_snake.sh		run_snake.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HapScafFlow

Overview

Workflow Execution and Manual Curation

Input Requirements

Configuration

Defining Genome Structure

Species-specific parameters

BUSCO lineage

Telomere repeat motif

Snakemake and Dependencies

Required Modules

Installing the Pipeline

Running the Workflow

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HapScafFlow

Overview

Workflow Execution and Manual Curation

Input Requirements

Configuration

Defining Genome Structure

Species-specific parameters

BUSCO lineage

Telomere repeat motif

Snakemake and Dependencies

Required Modules

Installing the Pipeline

Running the Workflow

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages