TAtouScan

TAtouScan is a command-line tool designed to identify toxin-antitoxin (TA) systems in genomes and metagenomes.

Installation

TAtouScan is available on PyPI. It is recommended to install it in a virtual environment:

python -m venv tatouscan-env
source tatouscan-env/bin/activate
pip install TAtouScan

Alternatively, using conda for environment management:

conda create -n tatouscan python=3.12
conda activate tatouscan
pip install TAtouScan

Note

TAtouScan is not yet available via bioconda. The above method combines conda for environment management and pip for installation.

Download the TAtouScan Database

TAtouScan requires a database directory containing HMM profiles and reference statistics.

Download the database and extract it with:

wget https://zenodo.org/records/20059258/files/tatouscan_db.tar.gz
tar -xzf tatouscan_db.tar.gz

The database directory must contain the following four files:

tatouscan_db/
  ta.hmm                 # HMM profiles (HMMER3 format)
  hmm_info.tsv           # profile metadata (name, type, source)
  family_statistics.tsv  # per-family reference statistics for scoring
  known_pairs.tsv        # known toxin–antitoxin family co-occurrences

Usage

After installation and downloading the database, run TAtouScan with:

a GFF file with gene annotations
a FAA file with the corresponding protein sequences
the database directory downloaded above

tatouscan --gff <genes.gff> --faa <proteins.faa> --db tatouscan_db/

By default, results are written to a directory called tatouscan_results/. Use --outdir to specify a different location:

tatouscan --gff <genes.gff> --faa <proteins.faa> --db tatouscan_db/ --outdir my_results/

Two TSV files are produced inside the output directory:

File	Description
`tatouscan_results.tsv`	One row per predicted toxin or antitoxin gene
`tatouscan_results_pairs.tsv`	One row per predicted TA pair (two-gene systems only)

HMM Database Composition

The HMM database used by TAtouScan is composed of profiles collected from multiple sources, including curated databases and literature. The file hmm_info.tsv provides metadata for each profile, indicating its origin and whether it corresponds to a toxin or an antitoxin.

Breakdown of the database:

682 profiles were obtained from the TASmania project:

Akarsu H, Bordes P, Mansour M, Bigot D-J, Genevaux P, Falquet L (2019). TASmania: A bacterial Toxin-Antitoxin Systems database. PLoS Comput Biol 15(4): e1006946.
https://doi.org/10.1371/journal.pcbi.1006946
3,168 profiles were generated from sequences in the TADB 3.0 database:
These sequences were first clustered, and each cluster was then aligned using multiple sequence alignment. HMM profiles were built from the resulting alignments.

Guan J, Chen Y, Goh YX, Wang M, Tai C, Deng Z, Song J, Ou HY (2024).
TADB 3.0: an updated database of bacterial toxin-antitoxin loci and associated mobile genetic elements.
Nucleic Acids Research, 52(D1): D784–D790.
https://doi.org/10.1093/nar/gkad962
Additional HMM profiles were manually collected from other sources in the literature.

Output

TAtouScan writes two TSV files into the output directory.

By default, only the most informative columns are written. Add --detailed to include per-source HMM breakdowns and raw Z-score columns.

`tatouscan_results.tsv` — per-gene results

One row per predicted toxin or antitoxin gene.

Column	Description
`contig_name`	Contig where the gene is located
`gene_id`	Gene identifier (from the input GFF)
`start` / `end`	Genomic coordinates
`strand`	`+` or `-`
`length_aa`	Protein length in amino acids
`product`	Predicted gene product (if available)
`ta_system_id`	ID shared by both genes of a pair (`None` for single-gene predictions)
`is_single_gene`	`True` if no paired partner was found
`gene_type`	`Toxin` or `Antitoxin`
`hmm_name` / `hmm_score` / `hmm_evalue`	Best HMM hit across all database sources
`hmm_source`	Database the best hit comes from (`TADB3`, `TASmania`, or other)
`hmm_description`	Profile description
`pair_is_known`	`1` if this toxin–antitoxin family combination is known in TADB3, `0` if not, `None` if family could not be identified
`score`	Unified match score in `(0, 1]` (see Scoring)

Scoring columns are None for single-gene predictions.

`tatouscan_results_pairs.tsv` — per-pair results

One row per predicted toxin–antitoxin pair. For systems with more than one toxin or antitoxin, all valid combinations are written as separate rows.

Column	Description
`ta_system_id`	Shared system ID (matches the per-gene file)
`contig_name`	Contig where the pair is located
`toxin_gene_id`	Toxin gene identifier
`toxin_strand`	`+` or `-`
`toxin_product`	Predicted gene product
`toxin_length_aa`	Toxin protein length in amino acids
`toxin_hmm_name` / `_score` / `_evalue` / `_source` / `_description`	Best HMM hit for the toxin
`antitoxin_gene_id`	Antitoxin gene identifier
`antitoxin_strand`	`+` or `-`
`antitoxin_product`	Predicted gene product
`antitoxin_length_aa`	Antitoxin protein length in amino acids
`antitoxin_hmm_name` / `_score` / `_evalue` / `_source` / `_description`	Best HMM hit for the antitoxin
`intergenic_distance`	Distance in nucleotides between the two genes (negative = overlap)
`pair_is_known`	`1` / `0` / `None` (see above)
`score`	Unified match score in `(0, 1]`

Detailed output

With --detailed, the following additional columns are written to both files:

Per-source HMM hits: TASmania_hmm_name/score/evalue/description, TADB3_hmm_name/score/evalue/description, Other_hmm_name/score/evalue/description (prefixed with toxin_ / antitoxin_ in the pairs file)
Raw Z-scores: toxin_size_z, at_size_z, intergenic_distance_z, matched_family, n_reference_pairs

The pairs file also adds toxin_start/end and antitoxin_start/end in detailed mode.

Scoring

Every predicted TA pair is compared against reference statistics derived from known TADB3 type-II systems. The score measures how closely the predicted pair resembles a genuine TA system of its family.

What is compared

Three structural features are measured for each predicted pair and compared against the reference distribution for the matched family:

Feature	Definition
`toxin_size`	Toxin protein length (amino acids)
`at_size`	Antitoxin protein length (amino acids)
`intergenic_distance`	Distance in nucleotides between the two genes (negative = overlap)

The toxin family is determined from its best TADB3 HMM hit. If no TADB3 hit exists or the family has fewer than 20 reference pairs, global statistics computed across all families are used as a fallback.

Robust Z-scores

For each feature, a Z-score measures how far the predicted value deviates from the family reference:

$$z = \frac{x - \text{median}}{\text{MAD} / 0.6745}$$

Median and MAD (median absolute deviation) are used instead of mean and standard deviation because size distributions in TA families are often skewed. This makes the scores robust to outliers.

Unified score

All Z-scores are combined into a single score in the range $(0, 1]$:

$$\text{score} = \exp!\left(-\frac{1}{n}\sum_i |z_i|\right)$$

The mean is taken over all available terms: the three structural Z-scores plus a compatibility term ($z_{\text{compat}}$) based on whether this toxin–antitoxin family combination has been observed in TADB3:

pair_is_known = 1 → $z_{\text{compat}} = 0$ (no penalty)
pair_is_known = 0 → $z_{\text{compat}} = 2$ (unknown combination lowers the score)
pair_is_known = None → compatibility term excluded from the mean

Score interpretation:

Score	Meaning
~1.0	Features match the family reference almost exactly, known combination
~0.7	Moderate structural match, known combination
~0.4	Moderate structural match, but family combination not seen in TADB3
< 0.2	Large structural deviations or unknown combination — treat with caution

A high score supports a genuine TA pair; a low score does not exclude it, but suggests the prediction should be reviewed.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAtouScan

Installation

Download the TAtouScan Database

Usage

HMM Database Composition

Breakdown of the database:

Output

`tatouscan_results.tsv` — per-gene results

`tatouscan_results_pairs.tsv` — per-pair results

Detailed output

Scoring

What is compared

Robust Z-scores

Unified score

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TAtouScan

Installation

Download the TAtouScan Database

Usage

HMM Database Composition

Breakdown of the database:

Output

tatouscan_results.tsv — per-gene results

tatouscan_results_pairs.tsv — per-pair results

Detailed output

Scoring

What is compared

Robust Z-scores

Unified score

License

`tatouscan_results.tsv` — per-gene results

`tatouscan_results_pairs.tsv` — per-pair results