Skip to content

Khiladi-786/Email-Spam-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

📧 Email Spam Detection

Python NLP Scikit-learn Status License

A machine learning classifier to identify spam emails using Natural Language Processing (NLP) techniques and text feature extraction.


📌 Project Overview

Email spam is a persistent problem — unwanted promotional emails, phishing attempts, and malicious content flood inboxes daily. This project builds an Email Spam Detection System using NLP and machine learning to automatically classify emails as spam or legitimate (ham).

Developed as part of my Data Science Internship at Oasis Infobyte.


🎯 Key Highlights

  • ✅ Built an NLP-based text classifier for spam detection
  • ✅ Used TF-IDF vectorization to convert text into numerical features
  • ✅ Trained and compared multiple classification models
  • ✅ Achieved high accuracy on real-world spam email dataset
  • ✅ Clean Python script ready for production use

📊 Dataset

Property Details
Source spam.csv — real-world email dataset
Task Binary Classification (Spam vs Ham)
Features Email text content
Target spam / ham (legitimate)

🔍 Sample Data

Email Text Label
"Congratulations! You've won $1000. Click here to claim." SPAM
"Hey, are we still on for the meeting at 3pm?" HAM
"URGENT: Your account will be suspended unless you verify now" SPAM
"Thanks for sending the report, looks great!" HAM

🧠 Methodology

1. Text Preprocessing

  • Removed special characters, numbers, and punctuation
  • Converted all text to lowercase
  • Removed stop words (common words like "the", "is", "at")
  • Applied stemming/lemmatization to normalize words

2. Feature Extraction

  • Used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
  • Converted email text into numerical feature vectors
  • Captured word importance across the entire dataset

3. Model Training

  • Trained multiple classifiers (Naive Bayes, Logistic Regression, SVM)
  • Selected best model based on accuracy and precision
  • Validated with cross-validation to prevent overfitting

🛠️ Tech Stack

Tool Purpose
Python Core programming language
Pandas Data manipulation
Scikit-learn ML models & TF-IDF vectorization
NLTK / SpaCy Text preprocessing & NLP
NumPy Numerical operations

🏆 Model Results

Metric Score
Accuracy (Add your score)
Precision (Add your score)
Recall (Add your score)
F1 Score (Add your score)

💡 Run SpamDetection.py to see the full evaluation metrics.


🚀 How to Run

1. Clone the repository

git clone https://github.com/Khiladi-786/Email-Spam-Detection.git
cd Email-Spam-Detection

2. Install dependencies

pip install pandas scikit-learn nltk numpy

3. Run the classifier

python SpamDetection.py

📁 Project Structure

Email-Spam-Detection/
│
├── SpamDetection.py      # Main spam detection script
├── spam.csv              # Email dataset
└── README.md             # Project documentation

💡 Key Insights

Common spam indicators detected by the model:

  • Words like "free", "win", "urgent", "click here", "congratulations"
  • Excessive use of ALL CAPS and exclamation marks!!!
  • Suspicious links and URLs
  • Poor grammar and spelling errors
  • Requests for personal information or account verification

How the model works:

  1. Email text is preprocessed (cleaned and normalized)
  2. TF-IDF converts text into numerical features
  3. Classifier predicts spam/ham based on word patterns
  4. High-confidence predictions flag suspicious emails

👨‍💻 About the Author

Nikhil More B.Tech CSE (AI/ML) — University of Mumbai (2023–2027)

Data Science Intern @ Oasis Infobyte | C-DAC Ambassador | Google Student Ambassador


📄 License

This project is licensed under the MIT License.


If you found this project useful, please give it a star!

About

Built a classifier to identify spam emails using natural language processing techniques.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages