Skip to content

Latest commit

 

History

History
247 lines (190 loc) · 9.25 KB

File metadata and controls

247 lines (190 loc) · 9.25 KB

Breast Cancer Classification of DNA Sequences

Python DNABERT License

📚 Overview

This project implements a hybrid machine learning approach for classifying breast cancer from DNA sequences using bidirectional embeddings generated by DNABERT. The study processes over 46 million high-quality DNA sequences to distinguish between cancerous and non-cancerous genomic material.

Key Features

  • Bidirectional Analysis: Utilizes both forward and reverse DNA strand representations
  • Hybrid Classification: Combines Random Forest (forward embeddings) and Deep Neural Networks (backward embeddings)
  • High-Quality Data: Implements Q30 Phred score filtering for 99.9% base call accuracy
  • Scalable Processing: Handles large genomic datasets using efficient batch processing

🎯 Research Objectives

  • Apply DNA sequence analysis to distinguish between genetic material linked to breast cancer
  • Develop a bidirectional DNA sequence embedding approach for improved classification
  • Demonstrate the potential of genomic information for early, non-invasive cancer diagnosis

📖 Research Paper

📄 Main Publication: Breast_Cancer_Classification_of_DNA_Sequences3.pdf

Authors:

  • Aakash Walavalkar (Michigan Technological University, USA)
  • Anushka Kumar (NMIMS, Mumbai)
  • Laavanya Mishra (NMIMS, Mumbai)

Keywords: Phred Score, Base Pair, Sequence Embeddings, Breast Cancer, DNA Sequencing, Classification

🗂️ Project Structure

├── 📄 Breast_Cancer_Classification_of_DNA_Sequences3.pdf  # Main research paper
├── 📊 Data Processing & Analysis
│   ├── 6. Generating Clean Readings for All Batches.pdf   # Quality control procedures
│   ├── cleaning_sequences.ipynb                          # Sequence cleaning implementation
│   ├── data_read_forward.ipynb                          # Forward strand processing
│   ├── data_read_backward.ipynb                         # Backward strand processing
│   └── rough.ipynb                                      # Utility functions
├── 🧬 Embedding Generation
│   ├── embeddings.py                                    # DNABERT embedding generation
│   └── embeddings.ipynb                                # Embedding analysis notebook
├── 🤖 Model Training
│   ├── forw_train_df.ipynb                             # Forward embeddings training
│   └── backw_train_df.ipynb                            # Backward embeddings training
└── 📈 Results & Analysis
    ├── UMAP visualizations                             # Dimensionality reduction plots
    └── Classification reports                          # Model performance metrics

🔬 Methodology

1. Data Acquisition

  • Source: National Genomics Data Center (NGDC) and NCBI SRA
  • Datasets:
    • Cancerous: Primary breast cancer sequences (SRR5177930)
    • Non-Cancerous: Normal breast tissue samples (SRR6269879)
  • Format: FASTQ files converted to Parquet for efficient processing

2. Quality Control & Filtering

def is_quality_good(quality_scores):
    return np.min(quality_scores) >= 30

Filtering Criteria:

  • Phred Score: ≥ Q30 (99.9% base call accuracy)
  • Sequence Length: ≥ 100 base pairs
  • Batch Size: 100,000 reads per batch for memory efficiency

3. Data Processing Pipeline

Phase 1: FASTQ to Parquet Conversion

  • Parse FASTQ files using Biopython's SeqIO
  • Extract sequence ID, nucleotide sequence, and quality scores
  • Batch processing to handle 60M+ sequences efficiently

Phase 2: Quality Filtering

  • Apply Q30 filtering to ensure high-quality sequences
  • Remove sequences with any base having Phred score < 30
  • Filter sequences shorter than 100 base pairs

Phase 3: Embedding Generation

# DNABERT-6 tokenization and embedding
model = AutoModel.from_pretrained("zhihan1996/DNA_bert_6")
embeddings = model.encode(sequences)  # 768-dimensional vectors

4. Model Architecture

Forward Embeddings: Random Forest Classifier

  • Algorithm: RandomForestClassifier with Optuna hyperparameter optimization
  • Best Parameters:
    • n_estimators: 161
    • max_depth: 21
    • max_features: sqrt
  • Performance: AUC = 0.9753, Accuracy = 97.5%

Backward Embeddings: Deep Neural Network

  • Architecture: Feedforward Neural Network
    • Input: 768-dimensional DNABERT embeddings
    • Layers: [4096, 2048, 1024, 512, 256, 128]
    • Dropout: 0.3
    • Optimizer: Adam
  • Performance: AUC = 0.9493, Accuracy = 88.05%

📊 Results Summary

Model Type Embedding Direction Accuracy Precision Recall F1-Score AUC
Random Forest (Tuned) Forward 97.50% 0.97-0.98 0.97-0.98 0.97-0.98 0.9753
Neural Network Backward 88.05% 0.87-0.90 0.85-0.91 0.87-0.89 0.9493

🛠️ Installation & Setup

Prerequisites

  • Python 3.9.21
  • CUDA-compatible GPU (recommended for DNABERT processing)
  • Azure VM or similar cloud compute environment

Required Libraries

pip install torch transformers
pip install biopython pandas pyarrow numpy scikit-learn
pip install optuna umap-learn
pip install duckdb  # For efficient Parquet querying

DNABERT Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNA_bert_6", trust_remote_code=True)
model = AutoModel.from_pretrained("zhihan1996/DNA_bert_6", trust_remote_code=True)

🔧 Usage

1. Data Preprocessing

# Clean and filter sequences
jupyter notebook cleaning_sequences.ipynb

# Process forward and backward strands
jupyter notebook data_read_forward.ipynb
jupyter notebook data_read_backward.ipynb

2. Generate Embeddings

# Run DNABERT embedding generation
python embeddings.py --input_dir ./clean_sequences --output_dir ./embeddings

3. Train Models

# Train forward embeddings model
jupyter notebook forw_train_df.ipynb

# Train backward embeddings model  
jupyter notebook backw_train_df.ipynb

📁 Data Organization

Cleaned Datasets Structure

clean_forward_reads/     # Forward cancerous sequences (604 batches, 28M sequences)
clean_backward_reads/    # Backward cancerous sequences (604 batches, 9M sequences)  
clean_forward_noncan/    # Forward non-cancerous sequences (553 batches, 5.7M sequences)
clean_backward_noncan/   # Backward non-cancerous sequences (553 batches, 3.8M sequences)

Embedding Files Structure

embeddings/
├── forward_cancerous_embeddings.npy     # 768-dim vectors + metadata CSV
├── forward_noncancerous_embeddings.npy  # 768-dim vectors + metadata CSV
├── backward_cancerous_embeddings.npy    # 768-dim vectors + metadata CSV
└── backward_noncancerous_embeddings.npy # 768-dim vectors + metadata CSV

🔬 Technical Implementation Details

Infrastructure

  • Platform: Azure Virtual Machine (cloud-hosted)
  • Access: SSH with Visual Studio Code remote development
  • Storage: Efficient Parquet format for large genomic datasets
  • Monitoring: Weights & Biases for experiment tracking

Key Technologies

  • DNABERT: Pre-trained transformer for genomic sequences with 6-mer tokenization
  • Optuna: Hyperparameter optimization framework
  • UMAP: Dimensionality reduction for visualization
  • DuckDB: Lightweight querying for large Parquet files

Quality Metrics

  • Total Sequences Processed: 46,968,954 high-quality sequences
  • Quality Threshold: Phred Q30 (1 in 1,000 error rate)
  • Batch Processing: 100,000 sequences per batch for memory efficiency

📈 Key Findings

  1. Bidirectional Approach: Forward and backward embeddings show different separability patterns in UMAP visualization
  2. Model Selection: Random Forest optimal for forward embeddings; Neural Networks better for backward embeddings
  3. Quality Impact: Q30 filtering significantly reduces dataset size but improves classification accuracy
  4. Performance: Both models achieve high accuracy (>88%) demonstrating feasibility of genomic-based cancer classification

Future Work

Future Directions

  • Multi-class cancer type classification
  • Integration with clinical data
  • Real-time diagnostic applications
  • Cross-population validation studies

🤝 Contributing

This is an academic research project. For collaboration opportunities or questions:

📚 References

The complete reference list is available in the research paper. Key references include:

  1. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language
  2. National Genomics Data Center (NGDC) datasets
  3. Apache Parquet for efficient genomic data storage
  4. Optuna for automated hyperparameter optimization

📄 License

This project is for academic research purposes. Please cite the paper if you use this work in your research.


Last Updated: September 2025
Version: 1.0
Status: Research Complete