GraphAllele

A graph-constrained pipeline for constructing standardized allele matrices in polyploid genomes.

Overview

GraphAllele integrates synteny-based graph clustering, tandem duplication filtering, orthogroup verification, and reference genome calibration into a unified, resumable pipeline for allele identification in complex polyploid genomes.

Chromosome naming convention required: Chr{NUM}{SUFFIX}, e.g., Chr1A, Chr1B, ..., Chr1K.

Features

Breakpoint-resumable: each step is checkpointed and skipped automatically on re-run; if the pipeline fails at any step, simply re-run the same command and it will resume from the last incomplete step.
Sequential processing: chromosome groups are processed one by one to protect HPC NFS I/O and ensure stable, readable logging.
Graph-constrained clustering: NetworkX-based extraction of syntenic connected components with configurable distance thresholds.
Tandem duplication filtering: self-BLASTP + graph-based tandem array blacklisting with adjustable gene distance.
Intra-group OrthoFinder with early termination: OrthoFinder is run independently per chromosome group; the process is killed immediately after Orthogroups.tsv is written to disk, skipping the MSA and tree-building phases to reduce runtime.
Reference calibration: TBLASTN-based anchoring to an external reference genome for cross-species annotation.
Standardized output: fixed-column allele matrix (A–N subgenomes) with globally unique cluster IDs.
Automated JCVI ortholog analysis: built-in shell script for pairwise synteny comparisons with auto-discovery of species from input files.
Global matrix normalization: post-processing module that produces both a cleaned global matrix and a k-mer–formatted cluster file for downstream analysis.

Pipeline Architecture

Input: Whole-genome GFF3 + FASTA + Reference CDS/GFF + Orthogroups.tsv (or --auto_og)
         |
         v
  Step 1: Prepare Data (01.prepare/)
          Split GFF/FASTA by chromosome group
          Extract CDS, PEP, BED per haplotype (gffread)
         |
         v
  Step 1.5 (optional): Intra-Group OrthoFinder (--auto_og)
          Run OrthoFinder per chromosome group
          Early termination after Orthogroups.tsv is written
         |
         v
  Step 2: Tandem Duplication Identification (02.tandem/)
          Self-BLASTP on merged PEP
          Graph-based tandem array detection (--tandem_dist)
         |
         v
  Step 3: JCVI Synteny Analysis (03.jcvi/)
          Pairwise LAST alignment via run_ortholog.sh
          MCScan anchor generation (.anchors files)
         |
         v
  Step 4: Graph Clustering (04.cluster.tsv)
          Build synteny graph from .anchors
          Filter tandem blacklist
          Extract connected components as candidate allele clusters
         |
         v
  Step 5: OrthoGroup Verification (05.verified.tsv)
          Validate clusters against OrthoFinder orthogroups
          Output verified and rejected tables separately
         |
         v
  Step 6: Sequence Homology Expansion (06.expanded/)
          BLASTP-based refinement of allele assignments
         |
         v
  Step 7: Reference Calibration (07.FINAL_ALLELE.tsv)
          TBLASTN anchoring to reference CDS
          Annotate each cluster with Ref_Gene and Ref_Locus
         |
         v
  Post-processing: Global Merge & Normalization
          Merge all chromosome groups into PolyAlleler_Global_Matrix.tsv
          Normalize into PolyAlleler_Global_Matrix_Cleaned.tsv (fixed A–N columns)
          Generate my_clusters.tsv (k-mer format)

Installation

Conda (Recommended)

git clone https://github.com/GengruiZhu/GraphAllele.git
cd GraphAllele

conda env create -f environment.yml
conda activate polyalleler

Manual Installation

See INSTALL.md for step-by-step instructions.

Quick Start

Basic run with a pre-computed Orthogroups file

cd GraphAllele/workflow

python GraphAllele.py \
  -g /path/to/genome.gff3 \
  -f /path/to/genome.fasta \
  -ref_g /path/to/reference.gff3 \
  -ref_f /path/to/reference.cds \
  -og /path/to/Orthogroups.tsv \
  -s 1 -e 5 \
  -t 20 \
  --sub_list A,B,C,D,E

Run with automatic OrthoFinder (no pre-computed file needed)

python GraphAllele.py \
  -g /path/to/genome.gff3 \
  -f /path/to/genome.fasta \
  -ref_g /path/to/reference.gff3 \
  -ref_f /path/to/reference.cds \
  --auto_og \
  -s 1 -e 10 \
  -t 30 \
  --min_c 2 \
  --cluster_dist 100 \
  --sub_list A,B,C,D,E,F,G,H,I,J,K,L,M,N

PBS Job Submission

# Edit workflow/run_GraphAllele.sh to set your paths and resource requirements, then:
qsub workflow/run_GraphAllele.sh

Example PBS script (workflow/run_GraphAllele.sh):

#!/bin/bash
#PBS -N GraphAllele_ZG
#PBS -l nodes=1:ppn=30
#PBS -q comput
#PBS -l mem=
#PBS -o Allele.log
#PBS -j oe

cd $PBS_O_WORKDIR
date -R
source ~/miniconda3/bin/activate
source activate polyalleler

python GraphAllele.py \
  -g ../data/ZG.gff3 \
  -f ../data/ZG.fasta \
  -ref_g ../data/Eru.gff3 \
  -ref_f ../data/Eru.cds \
  --auto_og \
  -s 1 -e 10 \
  -t 30 \
  -o ZG_100 \
  --min_c 2 \
  --cluster_dist 100 \
  --sub_list A,B,C,D,E,F,G,H,I,J,K,L,M,N

Usage

python GraphAllele.py [OPTIONS]

Required Arguments

Argument	Description
`-g / --gff`	Whole-genome GFF3 annotation file. Must use `Chr{NUM}{SUFFIX}` naming convention.
`-f / --fasta`	Whole-genome FASTA sequence file corresponding to the GFF3.
`-ref_g / --ref_gff`	Reference genome GFF3 annotation (used for calibration in Step 7).
`-ref_f / --ref_cds`	Reference genome CDS FASTA (used for TBLASTN in Step 7).

Optional Arguments

Argument	Default	Description
`-og / --orthogroups`	`None`	Path to a pre-computed OrthoFinder `Orthogroups.tsv`. If provided, the pipeline uses this file for orthogroup verification (Step 5). Mutually exclusive in spirit with `--auto_og`, though both can be set (pre-computed file takes priority if the auto-generated one does not exist).
`--auto_og`	`False`	Run OrthoFinder automatically per chromosome group using only the haplotype PEP files for that group. The OrthoFinder process is terminated immediately after `Orthogroups.tsv` is written, skipping MSA and tree-building. Recommended when no pre-computed orthogroups are available.
`-s / --start`	`1`	Start chromosome group number. For example, `-s 1` means beginning from `Group_Chr01`.
`-e / --end`	`10`	End chromosome group number. For example, `-e 10` means processing through `Group_Chr10`.
`-t / --threads`	`10`	Total CPU threads available. Used by BLAST, OrthoFinder, and other parallelizable steps.
`-o / --outdir`	`standardized_results`	Output directory. All intermediate and final results are stored here.
`--sub_list`	`A,B,...,N`	Comma-separated haplotype suffix list. Defines the column order in the output allele matrix. Must match the suffix letters in your chromosome names.
`--min_c`	`3`	Minimum number of distinct haplotype chromosomes a cluster must span to be retained. Lower values (e.g., 2) are more permissive and suitable for genomes with high gene loss.
`--tandem_dist`	`5`	Maximum gene index distance for tandem duplicate detection (MCScanX convention). Two genes within this distance on the same chromosome with high sequence similarity are flagged as tandem duplicates.
`--cluster_dist`	`30`	Maximum gene index distance for synteny graph clustering. Increasing this value (e.g., to 100) allows more distant syntenic gene pairs to be grouped together, which can be useful for genomes with many structural rearrangements.

Parameter Guidance

--sub_list: must exactly match the suffix letters used in your chromosome names. For example, if your chromosomes are named Chr1A through Chr1K, use --sub_list A,B,C,D,E,F,G,H,I,J,K. The order of suffixes determines column order in the output matrix.
--min_c: a cluster must span at least this many distinct haplotype chromosomes to be retained. For highly fragmented or recently diverged polyploids, consider lowering to 2.
--auto_og: when enabled, OrthoFinder is run once per chromosome group using only the haplotype PEP files for that group. The process is terminated automatically after Orthogroups.tsv is written, before MSA and tree steps. It is still recommended to supply a pre-computed -og file when one is available, as it tends to be more comprehensive.
--cluster_dist: the default value of 30 works well for most plant genomes. For genomes with extensive rearrangements (e.g., sugarcane), values of 50–100 may produce better results.
--tandem_dist: the default of 5 follows the MCScanX convention. Increase to 10 if you suspect many dispersed tandem arrays are being missed.

Output Description

standardized_results/
├── Group_Chr01/
│   ├── 01.prepare/                    # Per-haplotype GFF, FASTA, CDS, BED, PEP
│   ├── 01.5.OrthoFinder_Intra/        # Intra-group OrthoFinder results (--auto_og only)
│   ├── 02.tandem/                     # Merged GFF/PEP + tandem blacklist
│   ├── 03.jcvi/                       # Pairwise .anchors files
│   ├── 04.cluster.tsv                 # Raw synteny clusters
│   ├── 05.verified.tsv                # OrthoGroup-validated clusters
│   ├── 05.verified_rejected.tsv       # Clusters failing OG verification
│   ├── 06.expanded_expanded.tsv       # BLAST-refined allele table
│   └── 07.FINAL_ALLELE.tsv            # Final allele matrix with reference anchors
├── Group_Chr02/
│   └── ...
├── PolyAlleler_Global_Matrix.tsv       # Genome-wide merged allele matrix (raw)
├── PolyAlleler_Global_Matrix_Cleaned.tsv  # Normalized matrix with fixed A–N columns
└── my_clusters.tsv                     # K-mer format cluster file

`07.FINAL_ALLELE.tsv` column format

ClusterID	Ref_Gene	Ref_Locus	Chr01A	Chr01B	...
Global_Cluster_000001	Gene001	Chr1:1000-5000(+)	gene_A	gene_B	...

Each row is one allele group. - or NA indicates no gene assigned for that haplotype in this cluster.

`PolyAlleler_Global_Matrix_Cleaned.tsv`

A post-processed version of the global matrix with standardized columns (Allele_A through Allele_N) and globally re-numbered cluster IDs, produced by the normalization module.

`my_clusters.tsv`

A two-column file: ClusterID and a comma-separated list of all gene members in that cluster. Useful for downstream k-mer or expression analyses.

Dependencies

Python Packages

Package	Version
Python	>= 3.8
Biopython	>= 1.79
pandas	>= 1.3
networkx	>= 2.6

External Tools

Tool	Purpose
BLAST+	Sequence similarity search (BLASTP, TBLASTN, makeblastdb)
gffread	CDS/protein extraction from GFF
JCVI/MCScan	Synteny analysis
LAST	Pairwise sequence alignment
OrthoFinder (optional)	Orthogroup inference (required if using `--auto_og`)

Changelog

See CHANGELOG.md for a detailed version history.

v1.1 (2025-04-02)

Added automated JCVI ortholog shell script (bin/run_ortholog.sh) with species auto-discovery
Added global matrix normalization module (generate_clean_clusters_auto) producing cleaned matrix and k-mer cluster file
Improved global merge logic using pandas with re-numbered ClusterIDs
Added --protein flag to prepare_jcvi.py for protein sequence extraction
Enhanced error handling with [FATAL ERROR] / [ACTION] messages and resume guidance
Updated PBS example script with real-world parameters

v1.0

Initial release

Citation

Manuscript in preparation. Citation will be added upon publication.

License

This project is licensed under a Non-Commercial Research License. Free to use for academic and research purposes only. Commercial use is strictly prohibited. See the LICENSE file for details.

Contact

Developers: Gengrui Zhu, Yi Chen
Issues: please use the GitHub Issues page for bug reports and questions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bin		bin
data		data
workflow		workflow
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

GraphAllele

Table of Contents

Overview

Features

Pipeline Architecture

Installation

Conda (Recommended)

Manual Installation

Quick Start

Basic run with a pre-computed Orthogroups file

Run with automatic OrthoFinder (no pre-computed file needed)

PBS Job Submission

Usage

Required Arguments

Optional Arguments

Parameter Guidance

Output Description

07.FINAL_ALLELE.tsv column format

PolyAlleler_Global_Matrix_Cleaned.tsv

my_clusters.tsv

Dependencies

Python Packages

External Tools

Changelog

v1.1 (2025-04-02)

v1.0

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`07.FINAL_ALLELE.tsv` column format

`PolyAlleler_Global_Matrix_Cleaned.tsv`

`my_clusters.tsv`

Packages