This project is a machine learning pipeline that detects spam SMS messages using natural language processing (NLP) techniques and a Naive Bayes classifier. The model is trained on the UCI SMS Spam Collection Dataset.
- Text Preprocessing with NLTK: tokenization, stopword removal, stemming
- Feature Extraction using CountVectorizer with unigrams and bigrams
- Classification Model: Multinomial Naive Bayes
- Hyperparameter Tuning with GridSearchCV
- Spam Prediction with probability scores
- Model Persistence using
joblib
The dataset contains 5,574 labeled SMS messages, split into spam and ham (not spam). It is publicly available from the UCI Machine Learning Repository.
- Downloaded and extracted using
requestsandzipfile - Stored in:
sms_spam_collection/SMSSpamCollection
git clone https://github.com/YOUR_USERNAME/spam-classification.git
cd spam-classificationpython -m venv myenv
# Windows
myenv\Scripts\activate
# macOS/Linux
source myenv/bin/activatepip install -r requirements.txt📥 Download & extract the dataset
python dataset.py🤖 Train and evaluate the model
python main.pyThe trained model evaluates custom SMS messages and returns:
-
Spam/Not-Spam label
-
Probability scores for each class
Example:
Message: Congratulations! You've won a $1000 Walmart gift card.
Prediction: Spam
Spam Probability: 0.98
Not-Spam Probability: 0.02