A machine learning classifier to identify spam emails using Natural Language Processing (NLP) techniques and text feature extraction.
Email spam is a persistent problem — unwanted promotional emails, phishing attempts, and malicious content flood inboxes daily. This project builds an Email Spam Detection System using NLP and machine learning to automatically classify emails as spam or legitimate (ham).
Developed as part of my Data Science Internship at Oasis Infobyte.
- ✅ Built an NLP-based text classifier for spam detection
- ✅ Used TF-IDF vectorization to convert text into numerical features
- ✅ Trained and compared multiple classification models
- ✅ Achieved high accuracy on real-world spam email dataset
- ✅ Clean Python script ready for production use
| Property | Details |
|---|---|
| Source | spam.csv — real-world email dataset |
| Task | Binary Classification (Spam vs Ham) |
| Features | Email text content |
| Target | spam / ham (legitimate) |
| Email Text | Label |
|---|---|
| "Congratulations! You've won $1000. Click here to claim." | SPAM |
| "Hey, are we still on for the meeting at 3pm?" | HAM |
| "URGENT: Your account will be suspended unless you verify now" | SPAM |
| "Thanks for sending the report, looks great!" | HAM |
- Removed special characters, numbers, and punctuation
- Converted all text to lowercase
- Removed stop words (common words like "the", "is", "at")
- Applied stemming/lemmatization to normalize words
- Used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
- Converted email text into numerical feature vectors
- Captured word importance across the entire dataset
- Trained multiple classifiers (Naive Bayes, Logistic Regression, SVM)
- Selected best model based on accuracy and precision
- Validated with cross-validation to prevent overfitting
| Tool | Purpose |
|---|---|
| Python | Core programming language |
| Pandas | Data manipulation |
| Scikit-learn | ML models & TF-IDF vectorization |
| NLTK / SpaCy | Text preprocessing & NLP |
| NumPy | Numerical operations |
| Metric | Score |
|---|---|
| Accuracy | (Add your score) |
| Precision | (Add your score) |
| Recall | (Add your score) |
| F1 Score | (Add your score) |
💡 Run
SpamDetection.pyto see the full evaluation metrics.
git clone https://github.com/Khiladi-786/Email-Spam-Detection.git
cd Email-Spam-Detectionpip install pandas scikit-learn nltk numpypython SpamDetection.pyEmail-Spam-Detection/
│
├── SpamDetection.py # Main spam detection script
├── spam.csv # Email dataset
└── README.md # Project documentation
Common spam indicators detected by the model:
- Words like "free", "win", "urgent", "click here", "congratulations"
- Excessive use of ALL CAPS and exclamation marks!!!
- Suspicious links and URLs
- Poor grammar and spelling errors
- Requests for personal information or account verification
How the model works:
- Email text is preprocessed (cleaned and normalized)
- TF-IDF converts text into numerical features
- Classifier predicts spam/ham based on word patterns
- High-confidence predictions flag suspicious emails
Nikhil More B.Tech CSE (AI/ML) — University of Mumbai (2023–2027)
Data Science Intern @ Oasis Infobyte | C-DAC Ambassador | Google Student Ambassador
This project is licensed under the MIT License.
⭐ If you found this project useful, please give it a star!