Skip to content

adlnosk/HapScafFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HapScafFlow

Haplotype-resolved scaffolding with 3D-DNA, producing whole-genome or chromosome-level assemblies.

Overview

HapScafFlow is a Snakemake workflow designed to scaffold individual haplotypes using 3D-DNA, assess genome completeness with BUSCO, generate final whole-genome or chromosome-level assemblies and search individual chromosomes for telomeric repeats with tidk. The pipeline was designed to scaffold the output from Toulbar2, which assigns contigs to new haplotype numbers based on protein alignment. The original assembly comes from Hifiasm.

Workflow Execution and Manual Curation

  1. First Submission: If submitted once, the pipeline will crash after 3D-DNA scaffolding.
  2. Manual Curation: The user must open Juicebox and export the genome to create .review files.
  3. Resubmission: After review file generation, re-run the pipeline without changes. It will continue until completion.
  4. Iterative Manual Curation: After each review file update, the pipeline can be resubmitted.

Warning: The pipeline does not back up .review files or track curation rounds. Users should manually save previous versions if needed.

Input Requirements

The following input files are required:

  • hap_{n}.fasta – Haplotype-specific assembly files
  • hap_{n}.fasta.length – Corresponding length files
  • merged_nodups.txt - Aligned Hi-C reads to the whole-genome assembly with Juicer

Configuration

Pipeline configuration variables are set in run_snake.sh.

Defining Genome Structure

Set the number of expected chromosomes:

export NUM_CHRS=8

The number of haplotypes is determined automatically from the number in input fasta files:

export NUM_HAP=$(ls $PATH_TO_FASTA/hap_*.fasta 2>/dev/null | grep -oP '(?<=/hap_)\d+' | sort -nr | head -n1)

Species-specific parameters

BUSCO lineage

Set lineage of your organism. By default, the pipeline uses the .odb10 version of the BUSCO dataset.

Telomere repeat motif

After producing chromosome-level assembly, the chromosomes are searched for telomeric repeats and a plot of the highest hits is produced. Set the telomeric repeat of your organism.

Snakemake and Dependencies

This pipeline is designed for Snakemake 7.20.0 and runs using a Slurm cluster (preconfigured in config.yaml).

Required Modules

The pipeline requires the following software modules, which are loaded within individual scripts using module load:

module load devel/Miniconda/Miniconda3
module load bioinfo/BUSCO/5.4.7
module load bioinfo/LASTZ/1.04.22 devel/python/Python-3.6.3
module load bioinfo/3D-DNA/529ccf4
module load bioinfo/Seqtk/1.3
module load bioinfo/bgzip/1.18
module load bioinfo/samtools/1.19
module load bioinfo/assemblathon2/d1f044b
module load bioinfo/tidk/0.2.63

Installing the Pipeline

To install and configure the pipeline:

  1. Navigate to the directory containing the FASTA files:
cd /path/to/fasta/files
  1. Clone the repository:
git clone https://github.com/adlnosk/HapScafFlow.git
cd HapScafFlow
  1. Modify run_snake.sh according to your dataset.

Running the Workflow

To launch the Snakemake pipeline, ensure all required files are in place and execute:

sbatch run_snake.sh

Output

  • output files are stored next to the input FASTA files

The pipeline generates:

  • Scaffolds for each haplotype (in q0 or q1_3D_DNA_HAP$n)
  • BUSCO scores to assess completeness (in BUSCO/busco_summaries)
  • Final whole-genome and chromosome-level assemblies (in FINALS/ and FINALS/chrs/)
  • Assemblathon statistics (/FINALS/whole_genome.fasta.gz.assemblahon_stats)
  • Telomeric repeats plot (/TELOMERES/)

About

Haplotype-resolved de novo assembly of autopolyploid species from HiFi and Hi-C reads.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors