Skip to content

Latest commit

 

History

History
376 lines (133 loc) · 4.81 KB

File metadata and controls

376 lines (133 loc) · 4.81 KB

# Installation Guide

This document provides detailed installation instructions for GraphAllele and all its dependencies.

---

## System Requirements

| Component | Minimum | Recommended |

|---|---|---|

| OS | Linux (any major distro) | CentOS 7 / Ubuntu 20.04+ |

| CPU | 4 cores | 20+ cores |

| RAM | 32 GB | 128 GB+ (for large polyploid genomes) |

| Disk | 100 GB free | 500 GB+ |

| Python | 3.8 | 3.9 or 3.10 |

| Conda | Any | Miniconda3 or Anaconda3 |

**Note**: For highly polyploid genomes (e.g., sugarcane with 100+ chromosomes), RAM usage can exceed 64 GB during tandem identification and BLAST steps. High-memory compute nodes are strongly recommended.

---

## Step 1: Install Conda (if not already installed)

\# Download Miniconda3 installer

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86\_64.sh



\# Run the installer

bash Miniconda3-latest-Linux-x86\_64.sh



\# Reload shell

source \~/.bashrc

---

## Step 2: Clone GraphAllele

git clone https://github.com/GengruiZhu/GraphAllele.git

cd GraphAllele

---

## Step 3: Create the Conda Environment

### Option A: Use the provided environment file (Recommended)

conda env create -f environment.yml

conda activate polyalleler

### Option B: Manual step-by-step creation

\# Create environment

conda create -n polyalleler python=3.9 -y

conda activate polyalleler



\# Install Python packages

pip install biopython pandas networkx



\# Install bioinformatics tools via bioconda

conda install -c bioconda -c conda-forge blast gffread last jcvi -y



\# Optional: OrthoFinder (only needed if using --auto\_og)

conda install -c bioconda orthofinder -y

---

## Step 4: Verify Installation

Run the following to confirm all dependencies are available:

\# Activate environment first

conda activate polyalleler



\# Check Python packages

python -c "from Bio import SeqIO; import pandas; import networkx; print('Python packages: OK')"



\# Check external tools

blastp -version

makeblastdb -version

tblastn -version

gffread --version

python -m jcvi.compara.catalog --help 2>\&1 | head -3

If all commands print version info without errors, installation is complete.

---

## Step 5: Test with Example Data (Optional)

cd GraphAllele/workflow



\# Test run with provided example data (Chr5 only)

python GraphAllele.py \\

  -g ../data/ROC22.gff3 \\

  -f ../data/ROC22.fasta \\

  -ref\_g ../data/Eru.gff3 \\

  -ref\_f ../data/Eru.cds \\

  -og ./Orthogroups.tsv \\

  -s 5 -e 5 \\

  -t 4 \\

  --sub\_list A,B,C,D,E

---

## Troubleshooting

### ModuleNotFoundError: No module named 'Bio'

conda activate polyalleler

pip install biopython

### blastp: command not found

conda activate polyalleler

conda install -c bioconda blast -y

### gffread: command not found

conda activate polyalleler

conda install -c bioconda gffread -y

### JCVI/MCScan errors during Step 3

Ensure you are running from within the correct working directory. GraphAllele handles this automatically via cwd= in subprocess calls. If issues persist, check that .cds and .bed files are correctly linked in the 03.jcvi/ subdirectory.

### OrthoFinder not found (when using --auto\_og)

conda activate polyalleler

conda install -c bioconda orthofinder -y

---

## PBS Job Submission

Edit workflow/run\_GraphAllele.sh to set the correct mem= and ppn= values for your cluster:

\#PBS -l nodes=1:ppn=20

\#PBS -l mem=128gb

---

## HPC / Cluster Notes

- The pipeline processes chromosome groups **sequentially** (one by one) to protect HPC NFS I/O and ensure stable logging. Set -t to the number of threads available per node.

- For PBS clusters, submit via qsub workflow/run\_GraphAllele.sh.

- For SLURM clusters, adapt the job header accordingly:

\#!/bin/bash

\#SBATCH -J GraphAllele

\#SBATCH -n 20

\#SBATCH --mem=128G

\#SBATCH -o Allele.log



source activate polyalleler

cd /path/to/GraphAllele/workflow



python GraphAllele.py \\

  -g ../data/genome.gff3 \\

  -f ../data/genome.fasta \\

  -ref\_g ../data/reference.gff3 \\

  -ref\_f ../data/reference.cds \\

  --auto\_og \\

  -s 1 -e 10 \\

  -t 20 \\

  --sub\_list A,B,C,D,E,F,G,H,I,J,K,L,M,N