SMS Spam Classification
This project implements and compares different Natural Language Processing (NLP) techniques for SMS spam classification using various text vectorization methods and machine learning models.
Project Overview
The project explores three different approaches to SMS spam classification:
- Bag of Words (BOW)
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Word2Vec with Average Word Embeddings
Dataset
The project uses the SMS Spam Collection dataset, which contains labeled SMS messages categorized as either spam or ham (legitimate messages). The dataset is loaded from SMSSpamCollection with two columns:
label: Indicates whether the message is spam or hammessage: The actual text content of the SMS
Implementation Details
Common Preprocessing Steps
- Text cleaning using regular expressions
- Conversion to lowercase
- Removal of stopwords
- Text normalization (stemming/lemmatization)
1. Bag of Words (BOW) Approach
- Uses
CountVectorizerfrom scikit-learn - Features:
- Maximum 2,500 features
- Includes unigrams and bigrams (ngram_range=(1,2))
- Model: Multinomial Naive Bayes classifier
2. TF-IDF Approach
- Uses
TfidfVectorizerfrom scikit-learn - Features:
- Maximum 2,500 features
- Includes unigrams and bigrams (ngram_range=(1,2))
- Model: Multinomial Naive Bayes classifier
3. Word2Vec with Average Word Embeddings
- Uses
gensimWord2Vec model - Features:
- Custom trained Word2Vec model on the SMS dataset
- Average word vectors for sentence representation
- Vector dimension standardization to 100 features
- Model: Random Forest Classifier
Project Structure
|
├── 1_Spam_Classification_using_BOW.ipynb # BOW implementation
├── 2_Spam_Classification_using_TF-IDF.ipynb # TF-IDF implementation
├── 3_Spam_Classification_using_Word2Vec_AvgWord2Vec.ipynb # Word2Vec implementation
└── README.md
Requirements
The project requires the following Python libraries:
- pandas
- numpy
- scikit-learn
- nltk
- gensim
- tqdm
Model Performance
Each approach was evaluated using standard classification metrics including accuracy, precision, recall, and F1-score. The models were trained on 80% of the data and tested on the remaining 20%.
Usage
- Install the required packages:
pip install pandas numpy scikit-learn nltk gensim tqdm - Download required NLTK data:
import nltk nltk.download('stopwords') nltk.download('wordnet') nltk.download('punkt_tab') - Run the Jupyter notebooks in sequence to compare different approaches:
- Start with BOW implementation
- Move to TF-IDF implementation
- Finally, try the Word2Vec implementation
