Skip to content

KavanaN12/Title_Ranking_ML_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Title Ranking ML Model

A machine learning pipeline for ranking and scoring academic paper titles based on their relevance and quality. This project uses LightGBM with SBERT (Sentence Transformers) embeddings and custom feature fusion to predict title-abstract matching scores.

๐Ÿ“‹ Table of Contents

๐ŸŽฏ Project Overview

This project implements a complete machine learning pipeline for academic title ranking that:

  • Preprocesses text data (cleaning, deduplication)
  • Generates Features using:
    • SBERT embeddings (semantic similarity)
    • Lexical features (token overlap, length ratio)
    • Fusion-based scoring
  • Trains a LightGBM regression model with K-Fold cross-validation
  • Evaluates model performance with multiple metrics
  • Provides both GUI and CLI interfaces for predictions

๐Ÿ“ฆ Prerequisites

System Requirements

  • Python: 3.8 or higher
  • OS: Windows/Mac/Linux
  • RAM: Minimum 8GB (16GB recommended for SBERT model)
  • Disk Space: ~5GB (for SBERT model and datasets)

Required Software

  • Git (for cloning the repository)
  • Python package manager (pip)

๐Ÿš€ Installation

1. Clone the Repository

git clone https://github.com/KavanaN12/Title_Ranking_ML_model.git
cd Title_Ranking_ML_model/title_ranking_project

2. Create a Virtual Environment (Recommended)

# Windows (PowerShell)
python -m venv venv
.\venv\Scripts\Activate.ps1

# Windows (Command Prompt)
python -m venv venv
venv\Scripts\activate.bat

# Mac/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

This will install:

  • numpy, pandas - Data processing
  • scikit-learn - Machine learning utilities
  • lightgbm - LightGBM model
  • sentence-transformers - SBERT embeddings
  • nltk, scipy - NLP utilities
  • matplotlib - Visualization
  • streamlit - Web interface
  • And other dependencies

4. Download Required NLP Data

After installation, download NLTK data:

python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

๐Ÿ“ Project Structure

title_ranking_project/
โ”œโ”€โ”€ README.md                      # This file
โ”œโ”€โ”€ requirements.txt               # Python dependencies
โ”œโ”€โ”€ run_pipeline_final.py          # Main training pipeline
โ”œโ”€โ”€ gui_app.py                     # Tkinter GUI for predictions
โ”œโ”€โ”€ bulk_test.py                   # Bulk evaluation on test dataset
โ”œโ”€โ”€ model_test_lgb.py              # Detailed model testing
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ preprocess.py              # Text preprocessing functions
โ”‚   โ”œโ”€โ”€ features_fusion.py         # Feature extraction and fusion
โ”‚   โ”œโ”€โ”€ models.py                  # Model definitions
โ”‚   โ”œโ”€โ”€ train_eval.py              # Training and evaluation utilities
โ”‚   โ””โ”€โ”€ utils.py                   # Helper functions
โ”œโ”€โ”€ outputs/                       # Generated artifacts (after training)
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ””โ”€โ”€ lgbm.joblib           # Trained LightGBM model
โ”‚   โ”œโ”€โ”€ feature_builder.joblib    # Feature builder object
โ”‚   โ”œโ”€โ”€ scaler.joblib             # StandardScaler for features
โ”‚   โ”œโ”€โ”€ target_stats.json         # Target variable statistics
โ”‚   โ”œโ”€โ”€ predictions_lgbm.csv      # Training predictions
โ”‚   โ”œโ”€โ”€ pipeline_meta.json        # Pipeline metadata
โ”‚   โ”œโ”€โ”€ bulk_test_results/        # Bulk evaluation results
โ”‚   โ””โ”€โ”€ model_test_plots/         # Test visualization plots
โ””โ”€โ”€ datasets/                      # Data directory (symlink or copy)
    โ”œโ”€โ”€ train_real_world_dataset_10000.csv  # Training dataset
    โ””โ”€โ”€ real_world_dataset_2000_cleaned.csv # Test/evaluation dataset

๐Ÿ’ป Usage

1. Training the Model

To train the model from scratch using the training dataset:

python run_pipeline_final.py

What happens:

  • Loads training data from datasets/train_real_world_dataset_10000.csv
  • Preprocesses and cleans text
  • Builds SBERT embeddings and fusion features
  • Trains LightGBM with 5-Fold cross-validation
  • Saves all artifacts to outputs/
  • Generates initial predictions on training data

Expected output:

  • outputs/models/lgbm.joblib - Trained model
  • outputs/feature_builder.joblib - Feature builder
  • outputs/scaler.joblib - Feature scaler
  • outputs/target_stats.json - Target statistics
  • outputs/predictions_lgbm.csv - Training predictions

Estimated time: 10-30 minutes (depending on hardware)

2. Testing with GUI

Launch the interactive GUI for single predictions:

python gui_app.py

Features:

  • Enter title and abstract manually
  • Get instant predictions with confidence scores
  • Category mapping (Excellent, Strong, Moderate, Weak, NoMatch)
  • Simple, user-friendly interface

Requirements:

  • Model must be trained first (run run_pipeline_final.py)

3. Bulk Evaluation

Run batch predictions and evaluation on the test dataset:

python bulk_test.py

What happens:

  • Loads test dataset from datasets/real_world_dataset_2000_cleaned.csv
  • Generates predictions for all records
  • Computes evaluation metrics:
    • Mean Absolute Error (MAE)
    • Root Mean Squared Error (RMSE)
    • Rยฒ Score
    • Spearman/Pearson Correlation
  • Creates confusion matrix visualization
  • Generates category distribution plots

Output:

  • outputs/bulk_test_results/metrics.json - Performance metrics
  • outputs/bulk_test_results/predictions_bulk.csv - Bulk predictions
  • outputs/bulk_test_results/plots/ - Visualization plots

Estimated time: 2-5 minutes

4. Detailed Model Testing

Generate comprehensive test report with detailed analysis:

python model_test_lgb.py

What happens:

  • Loads trained model and artifacts
  • Computes detailed performance metrics
  • Generates individual feature importance plots
  • Creates prediction distribution plots
  • Produces residual analysis

Output:

  • outputs/model_test_plots/ - Detailed test plots
  • Console output with performance summary

๐Ÿ“Š Datasets

Training Dataset

Location: datasets/train_real_world_dataset_10000.csv

  • Size: 10,000 records
  • Source: CrossRef / Real-world academic papers
  • Format: CSV with columns:
    • title - Paper title
    • abstract - Paper abstract
    • expected - Target relevance score (0-1)

Test/Evaluation Dataset

Location: datasets/real_world_dataset_2000_cleaned.csv

  • Size: 2,000 records
  • Source: Real-world academic papers (non-overlapping with training)
  • Format: Same as training dataset
  • Usage: Bulk evaluation and model validation

Required Dataset Columns

Both datasets must have:

  • title (string) - Paper title
  • abstract (string) - Paper abstract
  • expected (float) - Target score (range 0-1)

๐Ÿ“ค Outputs

After training and evaluation, the following artifacts are generated:

Models & Artifacts

File Description
models/lgbm.joblib Trained LightGBM model
feature_builder.joblib FeatureFusionBuilder object
scaler.joblib StandardScaler for normalization
target_stats.json Mean/std of target variable
pipeline_meta.json Pipeline metadata and config

Predictions & Metrics

File Description
predictions_lgbm.csv Training set predictions
bulk_test_results/predictions_bulk.csv Test set predictions
bulk_test_results/metrics.json Performance metrics

Visualizations

File Description
bulk_test_results/plots/ Test set plots (distribution, confusion matrix, etc.)
model_test_plots/ Detailed model analysis plots

โš™๏ธ Configuration

Main Configuration Variables

Located in run_pipeline_final.py:

DATASET_FOLDER = "D:/aimlTextPr/datasets"  # Dataset location
CROSSREF_TRAIN_PATH = "datasets/train_real_world_dataset_10000.csv"
EVAL_TEST_PATH = "datasets/real_world_dataset_2000_cleaned.csv"
OUT_DIR = "outputs"
SBERT_MODEL = "sentence-transformers/paraphrase-MiniLM-L6-v2"
SEED = 42
N_SPLITS = 5  # K-Fold splits

To Customize

  1. Change training dataset: Modify CROSSREF_TRAIN_PATH
  2. Change test dataset: Modify EVAL_TEST_PATH
  3. Change SBERT model: Modify SBERT_MODEL to another HuggingFace model
  4. Adjust K-Fold splits: Change N_SPLITS value
  5. Change random seed: Modify SEED for reproducibility

๐Ÿ”ง Troubleshooting

Issue: "ModuleNotFoundError: No module named 'sentence_transformers'"

Solution:

pip install sentence-transformers --upgrade

Issue: "CUDA out of memory" or slow processing

Solution:

  • Reduce batch size in feature builder
  • Use smaller SBERT model:
    SBERT_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
  • Ensure sufficient RAM available

Issue: "Dataset file not found"

Solution:

  • Verify dataset paths in script match your system
  • Update absolute paths in configuration
  • Ensure datasets directory exists with required CSV files

Issue: GUI window doesn't open

Solution:

  • Ensure all dependencies are installed: pip install -r requirements.txt
  • On Linux, may need: sudo apt-get install python3-tk
  • Try running from command line to see error messages

Issue: "Model not found" when running GUI or testing

Solution:

  • Train the model first: python run_pipeline_final.py
  • Wait for training to complete and artifacts to be saved
  • Check outputs/ folder for model files

Issue: Low model performance

Consider:

  • Verify dataset quality and format
  • Check feature engineering settings in features_fusion.py
  • Increase training data size
  • Adjust model hyperparameters in run_pipeline_final.py
  • Review data preprocessing in preprocess.py

๐Ÿ“š Key Components

Feature Fusion Builder (src/features_fusion.py)

Generates comprehensive features:

  • SBERT Embeddings: Semantic similarity between title and abstract
  • Lexical Features: Token overlap, length ratios, BM25 scores
  • Fusion Score: Combined metric from all feature sources

Preprocessing (src/preprocess.py)

Text cleaning:

  • Lowercase conversion
  • Special character removal
  • Whitespace normalization
  • Deduplication

Model Training (src/train_eval.py)

  • K-Fold cross-validation
  • StandardScaler normalization
  • LightGBM regression with early stopping
  • Multiple evaluation metrics

๐Ÿ“ License

[Specify your license here - e.g., MIT, Apache 2.0, etc.]

๐Ÿ‘ค Author

Kavana N

GitHub: @KavanaN12

๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

๐Ÿ“ง Support

For issues, questions, or suggestions, please:

  • Open an issue on GitHub
  • Check existing issues for solutions
  • Review the troubleshooting section above

๐Ÿ”— Related Resources


Last Updated: December 2025

Status: Active Development

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

โšก