telomere-finder: a tool to identify putative telomeric and subtelomeric sequences from draft genome assemblies
The tool telomere-finder.py helps identify potential telomeric repeats in genome assemblies by analyzing k-mer occurrences at contig ends. It supports both compressed and uncompressed FASTA inputs and can output results in BED or GFF format.
# Clone the repository
git clone https://github.com/yourusername/telomere-finder
cd telomere-finder
# Install dependencies
pip install biopythonBasic usage with default settings (6-mers and 7-mers, BED output):
python telomere_finder.py input.fastaAnalyze specific k-mer sizes and output GFF:
python telomere_finder.py --kmers 6 7 8 --format gff input.fastaLet's analyze a simple demo genome with known telomeric repeats. The demo file demo.fasta contains three chromosomes:
- chr1: Contains multiple TTTAGGG repeats at both ends
- chr2: Has TTAGGG repeats internally and CCCTAA at the end
- chr3_reverse: Contains the canonical telomere sequence in reverse orientation (CCCTAA)
# Run analysis on demo genome
python telomere_finder.py --kmers 6 7 demo.fastaExpected output:
Top telomeric repeats found at contig ends:
TTAGGG: 12 occurrences at contig ends
TTTAGGG: 8 occurrences at contig ends
CCCTAA: 12 occurrences at contig ends
-
.bedor.gfffile containing positions of all k-mer occurrences:- BED format (0-based):
chromosome start end kmer score strand - GFF format (1-based): Includes additional metadata about end proximity TODO double-check basing. FOLLOWUP-20250101: In "telomere-finder-2.py" I corrected line 63.
- BED format (0-based):
-
.stats.txtfile containing:- K-mer occurrence counts
- Number of times each k-mer appears near contig ends
- Distribution of k-mers across the genome
The tool is particularly useful for:
- Identifying potential telomeric repeats in draft assemblies
- Validating genome assembly completeness
- Detecting non-canonical telomere sequences
- Analyzing telomere orientation and distribution
Note that the tool considers sequences within 1000bp of contig ends as potential telomeric regions by default.
- First, this variable in the constructor defines what counts as an "end":
def __init__(self, kmer_sizes, end_region_size=1000):
self.end_region_size = end_region_size- Then in the
process_sequencemethod, a k-mer is marked as being at an "end" if it meets this condition:
'is_end': i < self.end_region_size or i > seq_len - self.end_region_sizeBreaking this down:
i < self.end_region_size: This checks if the k-mer starts within the first 1000bp of the sequencei > seq_len - self.end_region_size: This checks if the k-mer starts within the last 1000bp of the sequence
Any k-mer that occurs within 1000 bases of either the start or end of a contig is considered to be at an "end". This is somewhat arbitrary - telomeres can be longer or shorter than 1000bp.
The tool automatically handles gzipped FASTA files:
python telomere_finder.py input.fasta.gzModify the end_region_size parameter in the code to adjust what's considered an "end region" (default: 1000bp).
- BED format: 0-based, half-open intervals
- GFF format: 1-based, closed intervals
Human and many vertebrates:
- TTAGGG (canonical)
- CCCTAA (reverse complement)
Some known variants:
- TTTAGGG
- TTAGGGG
- TTAGG
- Start with multiple k-mer sizes to catch variants
- Look for reverse complement sequences
- Pay attention to k-mer clustering at contig ends
- Consider both frequency and position of repeats
- The current output (as of December 10, 2024) identifies top kmers in the CLI summary, which might not match canonical telomeric repeats like TTAGGG.