ONT_homopolymer_screen

Identify regions in nucleic acid sequences which may coincide with sequencing artifacts and contribute toward incorrect interpretations. This script was originally written with nanopore sequencing in mind, but other platforms have historically had issues with homopolymers. Additionally, repetitive sequences may be more likely to change between replication cycles.

Related tool: telomere-finder.py

ONT Problematic Regions Detection Tutorial

This tutorial demonstrates how to use the ONT problematic regions detection script "ont-problems.py" to identify potentially problematic sequences for Oxford Nanopore Technologies (ONT) sequencing, including homopolymers, repeats, and user-defined k-mers. ONT wet and dry lab development has advanced at a rapid pace. This repo is designed to be used to help inform analyses on legacy ONT datasets, and to provide users with regions to prioritize during standard bioinformatics workflows.

Note as this tool may be under continued development, exact naming and usage might change. Usage should be based on user preference/local running environment.

Installation

# Clone the repository
git clone https://github.com/yourusername/ont-problems
cd ont-problems

# Install required packages
pip install biopython pandas

TODO update with current repo name

Input Data

For this tutorial, we'll use two example files:

A demo genome with telomeric repeats (demo-genome.fasta)
The HIV-1 HXB2 reference genome

Demo Genome Structure

The demo genome contains three chromosomes:

>chr1
ACTGACTGACTGACTGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTAGGGTTAGGGACT
GACTGACTGACTGACTGACTGACTGACTGACTGACTGTTAGGGTTAGGGTTAGGG

>chr2
ACTGACTGACTGACTGTTAGGGTTAGGGTTAGGGACTGACTGACTGACTGTTAGGGTTAGGG
CCCTAACCCTAACCCTAACCCTAA

>chr3_reverse
CCCTAACCCTAACCCTAACCCTAACCCTAAGCTGACTGACTGACTGACTGCCCTAACCCTAA
CCCTAACCCTAACCCTAA

This file contains both forward ("TTAGGG") and reverse ("CCCTAA") telomeric repeats, making it perfect for testing our k-mer detection.

Basic Usage

The script can be run with minimal parameters:

python ont_problems.py --input demo-genome.fasta --output-prefix results/demo

This will:

Identify homopolymers and repeats
Find default k-mers (6-mers and 7-mers)
Generate BED*, CSV, and metadata files *TODO check BED/GTF/GFF; check basing as in telomere-finder

Advanced Usage

Custom K-mer Lengths

python ont_problems.py --input demo-genome.fasta \
    --output-prefix results/demo \
    --kmer-lengths 6 7 8

Different Output Formats

# GFF output (1-based coordinates)
python ont_problems.py --input demo-genome.fasta \
    --output-prefix results/demo \
    --output-format gff

# GTF output (1-based coordinates)
python ont_problems.py --input demo-genome.fasta \
    --output-prefix results/demo \
    --output-format gtf

Compressed Input

The script automatically handles gzipped FASTA files:

gzip demo-genome.fasta
python ont_problems.py --input demo-genome.fasta.gz \
    --output-prefix results/demo

Example Analysis: HIV-1 HXB2

Let's analyze the HIV-1 HXB2 reference genome for potential ONT sequencing issues:

python ont_problems.py --input HXB2.fasta \
    --output-prefix results/hiv \
    --kmer-lengths 6 7

Expected output for HIV-1:

Top k-mers at contig ends:
K03455.1:
  TTAGCC: 12
  AAGCTT: 10
  TTTGCC: 8
  GCCTGT: 8
  GACTGG: 7

Problem region counts by type:
  homopolymer: 42
  dimer_repeat: 28
  trimer_repeat: 15
  kmer: 1834

Understanding the Output

1. Region Files (BED/GFF/GTF)

BED format (0-based):

chr1    0    4    homopolymer_AAAA    4    +
chr1    10   16   kmer_TTAGGG    6    +

GFF format (1-based):

##gff-version 3
chr1    ONT_problems    homopolymer    1    4    .    +    .    ID=homopolymer_0;Sequence=AAAA

2. Problems CSV

Contains detailed information about all identified regions:

Contig name
Start/end positions
Sequence
Type (homopolymer/repeat/kmer)
Length
Strand

3. Metadata File

Includes:

Run timestamp
Total regions found
Counts by problem type
Most common k-mers
K-mers near contig ends

Best Practices

K-mer Selection:
- Use k=6 for telomeric repeats (TTAGGG)
- Use k=7 for other common repeats
- Larger k

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
HIV-1_HXB2.fasta		HIV-1_HXB2.fasta
README.md		README.md
demo-genome.fasta		demo-genome.fasta
ont-problems-enhanced.py		ont-problems-enhanced.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ONT_homopolymer_screen

ONT Problematic Regions Detection Tutorial

Installation

Input Data

Demo Genome Structure

Basic Usage

Advanced Usage

Custom K-mer Lengths

Different Output Formats

Compressed Input

Example Analysis: HIV-1 HXB2

Understanding the Output

1. Region Files (BED/GFF/GTF)

2. Problems CSV

3. Metadata File

Best Practices

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ONT_homopolymer_screen

ONT Problematic Regions Detection Tutorial

Installation

Input Data

Demo Genome Structure

Basic Usage

Advanced Usage

Custom K-mer Lengths

Different Output Formats

Compressed Input

Example Analysis: HIV-1 HXB2

Understanding the Output

1. Region Files (BED/GFF/GTF)

2. Problems CSV

3. Metadata File

Best Practices

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages