UTR_Checker

Assess for U3-R-U5 presence in user-provided HIV sequences.

UTR_Checker development notes

Problem: HIV sequences are often plagued with sequence defects and/or sequencing/assembly artifacts.

Aim: To develop a tool to screen for HIV-1 LTR sequences, including U3 region, R region, U5 region at the 5' and 3' ends.

Background: HIV DNA has LTR-HIV-LTR structure, more specifically U3-R-U5-HIV-U3-R-U5. 5' U3-R-U5 should be identical to 3' U3-R-U5. HIV sequences with different sequences denote possible sequencing artifacts or synthetic constructs. E.g., NL4-3 was made from two HIV-1 cDNAs restriction-ligated together during molecular cloning, as opposed to during an infective cycle. Note, the current historical HIV reference genome GenBank:K03455.1 has sequencing artifacts in its U3 and U5:

CLUSTAL format alignment by MAFFT (v7.511)


HXB2_5'_LTR_U3_region     tggaagggctaattcactcccaacgaagacaagatatccttgatctgtggatctaccaca
HXB2_3'_LTR_U3_region     tggaagggctaattcactcccaaagaagacaagatatccttgatctgtggatctaccaca
                          *********************** ************************************

HXB2_5'_LTR_U3_region     cacaaggctacttccctgattagcagaactacacaccagggccagggatcagatatccac
HXB2_3'_LTR_U3_region     cacaaggctacttccctgattagcagaactacacaccagggccaggggtcagatatccac
                          ***********************************************.************

HXB2_5'_LTR_U3_region     tgacctttggatggtgctacaagctagtaccagttgagccagagaagttagaagaagcca
HXB2_3'_LTR_U3_region     tgacctttggatggtgctacaagctagtaccagttgagccagataagatagaagaggcca
                          ******************************************* *** *******.****

HXB2_5'_LTR_U3_region     acaaaggagagaacaccagcttgttacaccctgtgagcctgcatggaatggatgacccgg
HXB2_3'_LTR_U3_region     ataaaggagagaacaccagcttgttacaccctgtgagcctgcatgggatggatgacccgg
                          *.********************************************.*************

HXB2_5'_LTR_U3_region     agagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacatggcccgag
HXB2_3'_LTR_U3_region     agagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacgtggcccgag
                          **************************************************.*********

HXB2_5'_LTR_U3_region     agctgcatccggagtacttcaagaactgctgacatcgagcttgctacaagggactttccg
HXB2_3'_LTR_U3_region     agctgcatccggagtacttcaagaactgctgacatcgagcttgctacaagggactttccg
                          ************************************************************

HXB2_5'_LTR_U3_region     ctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagat
HXB2_3'_LTR_U3_region     ctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagat
                          ************************************************************

HXB2_5'_LTR_U3_region     cctgcatataagcagctgctttttgcctgtactgg
HXB2_3'_LTR_U3_region     cctgcatataagcagctgctttttgcctgtactgg
                          ***********************************

CLUSTAL format alignment by MAFFT (v7.511)


HXB2_3'_LTR_R_repeat      gtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccact
HXB2_5'_LTR_R_repeat      gtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccact
                          ************************************************************

HXB2_3'_LTR_R_repeat      gcttaagcctcaataaagcttgccttgagtgcttca
HXB2_5'_LTR_R_repeat      gcttaagcctcaataaagcttgccttgagtgcttca
                          ************************************

CLUSTAL format alignment by MAFFT (v7.511)


HXB2_3'_LTR_U5_region     agtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagaccctttta
HXB2_5'_LTR_U5_region     agtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagaccctttta
                          ************************************************************

HXB2_3'_LTR_U5_region     gtcagtgtggaaaatctctagca
HXB2_5'_LTR_U5_region     gtcagtgtggaaaatctctagca
                          ***********************

These sequencing artifacts conflict with the reported identity of HXB2, which was believed to be a host-integrated HIV-1 provirus captured experimentally early in the prepandemic HIV-1 epidemic era. Based on the current model of retroviral replication, LTRs homogenize after RT and prior to integration. More recent resequencing of a legacy HXB2 molecular clone pHXB2 D (pHXB2_D as Genbank:MW079479.1) enabled correction of this incongruity. Because it was made with high-depth long-read sequencing, it is the most accurate representation of HXB2. (Full references in my dissertation, "HIV Informatics".) As such, corrected landmarks were used for the development of UTR_Checker "utr-checker.py". This tool aims to determine how much of U3, R, U5 are present at 3 and 5' ends of user-provided HIV-1 sequence. Based on this, the tool may then classify input as HIV DNA, RNA, or partial/incomplete.

CLUSTAL format alignment by MAFFT (v7.511)


MW079479.1:1-634          tggaagggctaattcactcccaaagaagacaagatatccttgatctgtggatctaccaca
MW079479.1:9086-9719      tggaagggctaattcactcccaaagaagacaagatatccttgatctgtggatctaccaca
                          ************************************************************

MW079479.1:1-634          cacaaggctacttccctgattagcagaactacacaccagggccaggggtcagatatccac
MW079479.1:9086-9719      cacaaggctacttccctgattagcagaactacacaccagggccaggggtcagatatccac
                          ************************************************************

MW079479.1:1-634          tgacctttggatggtgctacaagctagtaccagttgagccagataaggtagaagaggcca
MW079479.1:9086-9719      tgacctttggatggtgctacaagctagtaccagttgagccagataaggtagaagaggcca
                          ************************************************************

MW079479.1:1-634          ataaaggagagaacaccagcttgttacaccctgtgagcctgcatgggatggatgacccgg
MW079479.1:9086-9719      ataaaggagagaacaccagcttgttacaccctgtgagcctgcatgggatggatgacccgg
                          ************************************************************

MW079479.1:1-634          agagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacgtggcccgag
MW079479.1:9086-9719      agagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacgtggcccgag
                          ************************************************************

MW079479.1:1-634          agctgcatccggagtacttcaagaactgctgatatcgagcttgctacaagggactttccg
MW079479.1:9086-9719      agctgcatccggagtacttcaagaactgctgatatcgagcttgctacaagggactttccg
                          ************************************************************

MW079479.1:1-634          ctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagat
MW079479.1:9086-9719      ctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagat
                          ************************************************************

MW079479.1:1-634          cctgcatataagcagctgctttttgcctgtactgg[gtctctctggttagaccagatctga
MW079479.1:9086-9719      cctgcatataagcagctgctttttgcctgtactgg[gtctctctggttagaccagatctga
                          ***********************************[*************************

MW079479.1:1-634          gcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgcct
MW079479.1:9086-9719      gcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgcct
                          ************************************************************

MW079479.1:1-634          tgagtgcttca]agtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctc
MW079479.1:9086-9719      tgagtgcttca]agtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctc
                          ***********]*************************************************

MW079479.1:1-634          agacccttttagtcagtgtggaaaatctctagca
MW079479.1:9086-9719      agacccttttagtcagtgtggaaaatctctagca
                          **********************************

[R]

HXB2 LTR heterogeneity discovered after pHXB2-D resequencing was first described in version 1 of Gener, 2019¹.

HXB2 as pHXB2_D plasmid assembly and data release as MW079479.1 was described in the most current work: Gener et al. 2021².

(Pairwise alignment done with MAFFT online server. Method FFT-NS-i (Standard). Command: mafft --reorder --auto input.)

References: Katoh et al. (2002) describes FFT-NS-1, FFT-NS-2 and FFT-NS-i.

Kuraku et al. (2013) outlines this web service.

MAFFT home: https://mafft.cbrc.jp/alignment/software/

UTR-Checker Tutorial

Overview

UTR-Checker is a Python script designed to analyze HIV sequences for the presence and arrangement of U3, R, and U5 regions. It uses minimap2 for initial candidate identification followed by detailed alignment analysis, making it efficient for both short and long sequences. Version 13 "utr-checker-13.py" works as expected on HXB2 DNA (K03455) as ACGT and NL4-3 mRNA as ACGT MZ242719.

Dependencies

Required Python Packages

pip install biopython mappy

The script requires:

Python 3.6 or higher
Biopython (for sequence handling and alignment)
mappy (Python bindings for minimap2)
Built-in libraries: tempfile, os, logging, typing, argparse

Installation

Save the script as utr-checker.py
Ensure your input sequences are in FASTA format
Make the script executable (Unix/Linux):
```
chmod +x utr-checker.py
```

Usage

Basic Usage

python utr-checker.py input.fasta

The script accepts any file containing FASTA-formatted sequences, regardless of extension (.fasta, .fsa, .fa, .txt).

Advanced Options

python utr-checker.py --minimap-threshold 0.55 --final-threshold 0.65 --gap-open -2 --gap-extend -0.5 --debug input.fasta

Parameters:

--minimap-threshold: Initial screening threshold (default: 0.60)
--final-threshold: Final similarity threshold (default: 0.70)
--gap-open: Gap opening penalty (default: -2)
--gap-extend: Gap extension penalty (default: -0.5)
--debug: Enable debug output
--format: Input file format (default: fasta)

Example Analysis

Using sample test files:

Example 1: HIV-1 mRNA (MZ242719.1)

user@computer:~$ python utr-checker.py MZ242719.fasta --minimap-threshold 0.55 --final-threshold 0.65

Analyzing sequence: MZ242719.1
Best match found on forward strand
Classification: Likely viral RNA
Overall confidence: 99.45%

Details:
- Multiple R regions detected (5' and 3' ends)
- U5 region present near 5' end
- U3 region present near 3' end
- U3 occurrences:
-   1. 98.35% similarity at position 8622-9077
- R occurrences:
-   1. 100.00% similarity at position 9077-9173
-   2. 97.92% similarity at position 2-95
- U5 occurrences:
-   1. 100.00% similarity at position 98-181

Example 2: HIV-1 DNA (HXB2)

user@computer:~$ python utr-checker.py HIV-1_HXB2.fasta --minimap-threshold 0.55 --final-threshold 0.65

Analyzing sequence: K03455.1
Best match found on forward strand
Classification: Incomplete/Unclear
Overall confidence: 99.78%

Details:
- Partial or unclear LTR pattern
- U3 occurrences:
-   1. 99.34% similarity at position 9085-9540
-   2. 97.03% similarity at position 0-455
- R occurrences:
-   1. 100.00% similarity at position 455-551
-   2. 100.00% similarity at position 9540-9636
- U5 occurrences:
-   1. 100.00% similarity at position 551-634
-   2. 100.00% similarity at position 9636-9719

Sequence Classification

The script classifies sequences into:

"Likely viral RNA"
- R regions at both ends
- U5 near 5' end
- U3 near 3' end
- Expected pattern: R-U5-genome-U3-R
"Likely genomic DNA"
- Complete U3-R-U5 pattern
- Found in correct order
- May be present at both ends
"Incomplete/Unclear"
- Regions present but in unexpected arrangement
- Missing expected regions
- Ambiguous pattern

Performance Considerations

Two-Step Analysis
- Initial fast screening using minimap2
- Detailed alignment for candidate regions
- Adjustable thresholds for both steps
Speed vs Accuracy
- Lower thresholds increase sensitivity but may add false positives
- Higher thresholds increase specificity but might miss divergent sequences
- Default values optimized for HIV-1 group M
Memory Usage
- Efficient with long sequences due to minimap2
- Memory scales with sequence length
- Temporary files used for minimap2 analysis

Best Practices

Threshold Selection
- Start with default thresholds
- Lower minimap-threshold for divergent sequences
- Adjust final-threshold based on expected similarity
Result Interpretation
- Check both similarity scores and positions
- Verify region order matches expected pattern
- Consider biological context (RNA vs DNA)
Troubleshooting
- Use --debug flag for detailed output
- Check for sequence quality issues
- Verify FASTA format is correct

Limitations

Reference Sequences
- Based on HIV-1 HXB2 references
- May have reduced sensitivity for highly divergent strains
- Best suited for HIV-1 group M subtype B analysis
Structure Detection
- Analyzes full sequence for all elements (U3, R, U5)
- Detects both terminal and internal matches
- Uses position information for pattern classification (e.g., R-U5-genome-U3-R for RNA)
Format Requirements
- Input can be any file containing FASTA-formatted sequences
- Common extensions (.fasta, .fsa, .fa, .txt) all supported
- Can process multiple sequences in a single file (multifasta untested)
- Handles both DNA and RNA sequences, but currently script assumes ACGT base encoding (untested with ACGU).

Alejandro R. Gener. "Full-coverage sequencing of HIV-1 provirus from a reference plasmid" bioRxiv 611848; doi: https://doi.org/10.1101/611848. ↩
Alejandro R. Gener, Wei Zou, Brian T. Foley, Deborah P. Hyink, Paul E. Klotman. "Reference plasmid pHXB2_D is an HIV-1 molecular clone that exhibits identical LTRs and a single integration site indicative of an HIV provirus" bioRxiv 611848; doi: https://doi.org/10.1101/611848. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
HIV-1_HXB2.fasta		HIV-1_HXB2.fasta
HXB2_R_cat.fasta		HXB2_R_cat.fasta
HXB2_U3_cat_fixed.fasta		HXB2_U3_cat_fixed.fasta
HXB2_U5_cat.fasta		HXB2_U5_cat.fasta
MW079479.1:1-634_9086-9719.fasta		MW079479.1:1-634_9086-9719.fasta
MZ242719.fasta		MZ242719.fasta
README.md		README.md
utr-checker-13.py		utr-checker-13.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UTR_Checker

UTR_Checker development notes

UTR-Checker Tutorial

Overview

Dependencies

Required Python Packages

Installation

Usage

Basic Usage

Advanced Options

Example Analysis

Example 1: HIV-1 mRNA (MZ242719.1)

Example 2: HIV-1 DNA (HXB2)

Sequence Classification

Performance Considerations

Best Practices

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UTR_Checker

UTR_Checker development notes

UTR-Checker Tutorial

Overview

Dependencies

Required Python Packages

Installation

Usage

Basic Usage

Advanced Options

Example Analysis

Example 1: HIV-1 mRNA (MZ242719.1)

Example 2: HIV-1 DNA (HXB2)

Sequence Classification

Performance Considerations

Best Practices

Limitations

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages