Skip to content

bio-ontology-research-group/gapsmith

Repository files navigation

gapsmith

Rust reimplementation of gapseq. For a detailed comparison with the original R/bash implementation, see COMPARISON.md.

What it does

gapsmith reconstructs genome-scale metabolic models from bacterial proteomes. Given a protein FASTA, it predicts metabolic pathways, detects transporters, assembles a draft stoichiometric model, infers a growth medium, and gap-fills the model so it can simulate growth.

gapsmith doall genome.faa.gz -f output/

This produces a gap-filled SBML model that loads directly in COBRApy, COBRAToolbox, or any SBML-compatible tool.

Install

Prerequisites

An external sequence aligner (pick one):

Tool Install
BLAST+ apt install ncbi-blast+ or conda install -c bioconda blast
DIAMOND apt install diamond or conda install -c bioconda diamond
MMseqs2 apt install mmseqs2 or conda install -c bioconda mmseqs2

Plus a C++ toolchain (cmake, gcc/clang) for the bundled HiGHS LP solver — any Linux / macOS system with dev-tools installed will do.

Option 1: pre-built binary (fastest)

# Pick the right target for your OS/arch; see Releases page.
TARGET=x86_64-unknown-linux-gnu
VER=$(curl -s https://api.github.com/repos/bio-ontology-research-group/gapsmith/releases/latest \
  | grep '"tag_name"' | head -1 | sed 's/.*"\(v[^"]*\)".*/\1/')
curl -L https://github.com/bio-ontology-research-group/gapsmith/releases/download/$VER/gapsmith-$VER-$TARGET.tar.gz | tar xz
cd gapsmith-$VER-$TARGET
./gapsmith --version

Each release tarball bundles the binary + curated data tables. See the releases page.

Option 2: cargo install (directly from git)

cargo install --git https://github.com/bio-ontology-research-group/gapsmith.git gapsmith-cli

Installs gapsmith into ~/.cargo/bin/. You still need the data/ curation tables — clone the repo or grab them from a release tarball.

Option 3: build from source

git clone https://github.com/bio-ontology-research-group/gapsmith.git
cd gapsmith
cargo build --release
# Binary: target/release/gapsmith, curated data in ./data/

Reference data

Three parts, fetched independently:

  1. Curation tables (subex, medium rules, biomass templates, …) — vendored in this repo under data/. ~1 MB. Auto-used when running from a checkout; bundled inside release tarballs.

  2. Large public reference tables (SEED reactions + metabolites, MNXref cross-refs, ~65 MB) — fetched on demand from upstream gapseq's GitHub mirror:

    gapsmith update-data -o path/to/dat
  3. Sequence database (per-reaction FASTAs, ~2 GB) — downloaded from Zenodo on demand:

    gapsmith update-sequences -D path/to/dat/seq -t Bacteria

After that you have a complete data directory and no longer need any upstream gapseq checkout. Point all subsequent invocations at it with --data-dir path/to/dat.

License-restricted data (MetaCyc pathways, KEGG, BiGG, BRENDA, VMH) is left opt-in; a forthcoming --accept-license flag will gate loading those.

Quick start

# Full reconstruction pipeline (find → transport → draft → medium → fill)
gapsmith --data-dir path/to/dat doall genome.faa.gz -f output/ -A diamond

# Step by step
gapsmith --data-dir path/to/dat find -p all -A diamond -o output/ genome.faa
gapsmith --data-dir path/to/dat find-transport -A diamond -o output/ genome.faa
gapsmith --data-dir path/to/dat draft -r output/*-Reactions.tbl -t output/*-Transporter.tbl -o output/
gapsmith --data-dir path/to/dat medium -m output/*-draft.gmod.cbor -p output/*-Pathways.tbl
gapsmith --data-dir path/to/dat fill output/*-draft.gmod.cbor -n output/*-medium.csv -r output/*-Reactions.tbl -o output/

Output files

File Contents
*-all-Reactions.tbl Per-reaction homology hits + pathway context
*-all-Pathways.tbl Pathway completeness predictions
*-Transporter.tbl Detected transporters
*-draft.gmod.cbor Draft model (native format)
*-draft.xml Draft model (SBML L3V1 + FBC2 + groups)
*-medium.csv Predicted growth medium
*-filled.gmod.cbor Gap-filled model (native format)
*-filled.xml Gap-filled model (SBML)
*-filled-added.tsv Reactions added during gap-filling

Subcommands

Command Description
doall Full pipeline: find → transport → draft → medium → fill
find Pathway and reaction detection
find-transport Transporter detection
draft Build a draft metabolic model
medium Rule-based growth medium inference
fill Iterative gap-filling (pFBA + KO essentiality)
fba FBA / pFBA on an existing model
adapt Add/remove reactions or force growth on compounds
pan Build a pan-draft model from multiple drafts
batch-align Cluster N genomes + single alignment + per-genome TSVs
doall-batch Run doall across many genomes in parallel (rayon + SLURM-array --shard)
community per-mag Per-MAG FBA under a shared (union) medium — scales to 1000+ MAGs
community cfba Compose N drafts into one community model; weighted-sum biomass
update-sequences Sync reference sequence database from Zenodo
update-data Fetch the large public reference tables (SEED, MNXref)
convert Convert between CBOR and JSON model formats
export-sbml Export a model as SBML

Run any command with -h for full option documentation.

Documentation

Full documentation is published at https://bio-ontology-research-group.github.io/gapsmith/.

Local copies:

Document Contents
User guide Install, quick-start, per-subcommand recipes, troubleshooting
CLI reference Every flag of every subcommand
Multi-genome & metagenome workflows gspa integration, doall-batch for 1k–1M genomes, community per-mag vs cfba
Architecture Crate dependency graph, data flow, LP plumbing
Feature matrix R source → Rust module mapping, status per feature
Porting notes Intentional deviations from upstream gapseq
Performance Shipped optimisations, benchmarks, semantic-parity results
Comparison Performance benchmarks and feature comparison with upstream

License

GPL-3.0-or-later — same as gapseq.

Citation

If you use gapsmith, please cite the original gapseq paper:

Zimmermann J, Kaleta C, Özbek Ö, et al. gapseq: informed prediction of bacterial metabolic pathways and reconstruction of accurate metabolic models. Genome Biology 22, 81 (2021). https://doi.org/10.1186/s13059-021-02295-1

About

Rust reimplementation of gapseq — informed prediction and analysis of bacterial metabolic pathways and genome-scale networks. ~3× faster on pathway detection, in-process LP solver, single static binary.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages