A machine learning pipeline for ranking and scoring academic paper titles based on their relevance and quality. This project uses LightGBM with SBERT (Sentence Transformers) embeddings and custom feature fusion to predict title-abstract matching scores.
- Project Overview
- Prerequisites
- Installation
- Project Structure
- Usage
- Datasets
- Outputs
- Configuration
- Troubleshooting
This project implements a complete machine learning pipeline for academic title ranking that:
- Preprocesses text data (cleaning, deduplication)
- Generates Features using:
- SBERT embeddings (semantic similarity)
- Lexical features (token overlap, length ratio)
- Fusion-based scoring
- Trains a LightGBM regression model with K-Fold cross-validation
- Evaluates model performance with multiple metrics
- Provides both GUI and CLI interfaces for predictions
- Python: 3.8 or higher
- OS: Windows/Mac/Linux
- RAM: Minimum 8GB (16GB recommended for SBERT model)
- Disk Space: ~5GB (for SBERT model and datasets)
- Git (for cloning the repository)
- Python package manager (pip)
git clone https://github.com/KavanaN12/Title_Ranking_ML_model.git
cd Title_Ranking_ML_model/title_ranking_project# Windows (PowerShell)
python -m venv venv
.\venv\Scripts\Activate.ps1
# Windows (Command Prompt)
python -m venv venv
venv\Scripts\activate.bat
# Mac/Linux
python3 -m venv venv
source venv/bin/activatepip install --upgrade pip
pip install -r requirements.txtThis will install:
numpy,pandas- Data processingscikit-learn- Machine learning utilitieslightgbm- LightGBM modelsentence-transformers- SBERT embeddingsnltk,scipy- NLP utilitiesmatplotlib- Visualizationstreamlit- Web interface- And other dependencies
After installation, download NLTK data:
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"title_ranking_project/
โโโ README.md # This file
โโโ requirements.txt # Python dependencies
โโโ run_pipeline_final.py # Main training pipeline
โโโ gui_app.py # Tkinter GUI for predictions
โโโ bulk_test.py # Bulk evaluation on test dataset
โโโ model_test_lgb.py # Detailed model testing
โโโ src/
โ โโโ __init__.py
โ โโโ preprocess.py # Text preprocessing functions
โ โโโ features_fusion.py # Feature extraction and fusion
โ โโโ models.py # Model definitions
โ โโโ train_eval.py # Training and evaluation utilities
โ โโโ utils.py # Helper functions
โโโ outputs/ # Generated artifacts (after training)
โ โโโ models/
โ โ โโโ lgbm.joblib # Trained LightGBM model
โ โโโ feature_builder.joblib # Feature builder object
โ โโโ scaler.joblib # StandardScaler for features
โ โโโ target_stats.json # Target variable statistics
โ โโโ predictions_lgbm.csv # Training predictions
โ โโโ pipeline_meta.json # Pipeline metadata
โ โโโ bulk_test_results/ # Bulk evaluation results
โ โโโ model_test_plots/ # Test visualization plots
โโโ datasets/ # Data directory (symlink or copy)
โโโ train_real_world_dataset_10000.csv # Training dataset
โโโ real_world_dataset_2000_cleaned.csv # Test/evaluation dataset
To train the model from scratch using the training dataset:
python run_pipeline_final.pyWhat happens:
- Loads training data from
datasets/train_real_world_dataset_10000.csv - Preprocesses and cleans text
- Builds SBERT embeddings and fusion features
- Trains LightGBM with 5-Fold cross-validation
- Saves all artifacts to
outputs/ - Generates initial predictions on training data
Expected output:
outputs/models/lgbm.joblib- Trained modeloutputs/feature_builder.joblib- Feature builderoutputs/scaler.joblib- Feature scaleroutputs/target_stats.json- Target statisticsoutputs/predictions_lgbm.csv- Training predictions
Estimated time: 10-30 minutes (depending on hardware)
Launch the interactive GUI for single predictions:
python gui_app.pyFeatures:
- Enter title and abstract manually
- Get instant predictions with confidence scores
- Category mapping (Excellent, Strong, Moderate, Weak, NoMatch)
- Simple, user-friendly interface
Requirements:
- Model must be trained first (run
run_pipeline_final.py)
Run batch predictions and evaluation on the test dataset:
python bulk_test.pyWhat happens:
- Loads test dataset from
datasets/real_world_dataset_2000_cleaned.csv - Generates predictions for all records
- Computes evaluation metrics:
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- Rยฒ Score
- Spearman/Pearson Correlation
- Creates confusion matrix visualization
- Generates category distribution plots
Output:
outputs/bulk_test_results/metrics.json- Performance metricsoutputs/bulk_test_results/predictions_bulk.csv- Bulk predictionsoutputs/bulk_test_results/plots/- Visualization plots
Estimated time: 2-5 minutes
Generate comprehensive test report with detailed analysis:
python model_test_lgb.pyWhat happens:
- Loads trained model and artifacts
- Computes detailed performance metrics
- Generates individual feature importance plots
- Creates prediction distribution plots
- Produces residual analysis
Output:
outputs/model_test_plots/- Detailed test plots- Console output with performance summary
Location: datasets/train_real_world_dataset_10000.csv
- Size: 10,000 records
- Source: CrossRef / Real-world academic papers
- Format: CSV with columns:
title- Paper titleabstract- Paper abstractexpected- Target relevance score (0-1)
Location: datasets/real_world_dataset_2000_cleaned.csv
- Size: 2,000 records
- Source: Real-world academic papers (non-overlapping with training)
- Format: Same as training dataset
- Usage: Bulk evaluation and model validation
Both datasets must have:
title(string) - Paper titleabstract(string) - Paper abstractexpected(float) - Target score (range 0-1)
After training and evaluation, the following artifacts are generated:
| File | Description |
|---|---|
models/lgbm.joblib |
Trained LightGBM model |
feature_builder.joblib |
FeatureFusionBuilder object |
scaler.joblib |
StandardScaler for normalization |
target_stats.json |
Mean/std of target variable |
pipeline_meta.json |
Pipeline metadata and config |
| File | Description |
|---|---|
predictions_lgbm.csv |
Training set predictions |
bulk_test_results/predictions_bulk.csv |
Test set predictions |
bulk_test_results/metrics.json |
Performance metrics |
| File | Description |
|---|---|
bulk_test_results/plots/ |
Test set plots (distribution, confusion matrix, etc.) |
model_test_plots/ |
Detailed model analysis plots |
Located in run_pipeline_final.py:
DATASET_FOLDER = "D:/aimlTextPr/datasets" # Dataset location
CROSSREF_TRAIN_PATH = "datasets/train_real_world_dataset_10000.csv"
EVAL_TEST_PATH = "datasets/real_world_dataset_2000_cleaned.csv"
OUT_DIR = "outputs"
SBERT_MODEL = "sentence-transformers/paraphrase-MiniLM-L6-v2"
SEED = 42
N_SPLITS = 5 # K-Fold splits- Change training dataset: Modify
CROSSREF_TRAIN_PATH - Change test dataset: Modify
EVAL_TEST_PATH - Change SBERT model: Modify
SBERT_MODELto another HuggingFace model - Adjust K-Fold splits: Change
N_SPLITSvalue - Change random seed: Modify
SEEDfor reproducibility
Solution:
pip install sentence-transformers --upgradeSolution:
- Reduce batch size in feature builder
- Use smaller SBERT model:
SBERT_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
- Ensure sufficient RAM available
Solution:
- Verify dataset paths in script match your system
- Update absolute paths in configuration
- Ensure datasets directory exists with required CSV files
Solution:
- Ensure all dependencies are installed:
pip install -r requirements.txt - On Linux, may need:
sudo apt-get install python3-tk - Try running from command line to see error messages
Solution:
- Train the model first:
python run_pipeline_final.py - Wait for training to complete and artifacts to be saved
- Check
outputs/folder for model files
Consider:
- Verify dataset quality and format
- Check feature engineering settings in
features_fusion.py - Increase training data size
- Adjust model hyperparameters in
run_pipeline_final.py - Review data preprocessing in
preprocess.py
Generates comprehensive features:
- SBERT Embeddings: Semantic similarity between title and abstract
- Lexical Features: Token overlap, length ratios, BM25 scores
- Fusion Score: Combined metric from all feature sources
Text cleaning:
- Lowercase conversion
- Special character removal
- Whitespace normalization
- Deduplication
- K-Fold cross-validation
- StandardScaler normalization
- LightGBM regression with early stopping
- Multiple evaluation metrics
[Specify your license here - e.g., MIT, Apache 2.0, etc.]
Kavana N
GitHub: @KavanaN12
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
For issues, questions, or suggestions, please:
- Open an issue on GitHub
- Check existing issues for solutions
- Review the troubleshooting section above
Last Updated: December 2025
Status: Active Development