SMS Spam Classification

Posted Nov 1, 2025

This project implements and compares different Natural Language Processing (NLP) techniques for SMS spam classification using various text vectorization methods and machine learning models.

Project Overview

The project explores three different approaches to SMS spam classification:

Bag of Words (BOW)
Term Frequency-Inverse Document Frequency (TF-IDF)
Word2Vec with Average Word Embeddings

Dataset

The project uses the SMS Spam Collection dataset, which contains labeled SMS messages categorized as either spam or ham (legitimate messages). The dataset is loaded from SMSSpamCollection with two columns:

label: Indicates whether the message is spam or ham
message: The actual text content of the SMS

Implementation Details

Common Preprocessing Steps

Text cleaning using regular expressions
Conversion to lowercase
Removal of stopwords
Text normalization (stemming/lemmatization)

1. Bag of Words (BOW) Approach

Uses CountVectorizer from scikit-learn
Features:
- Maximum 2,500 features
- Includes unigrams and bigrams (ngram_range=(1,2))
Model: Multinomial Naive Bayes classifier

2. TF-IDF Approach

Uses TfidfVectorizer from scikit-learn
Features:
- Maximum 2,500 features
- Includes unigrams and bigrams (ngram_range=(1,2))
Model: Multinomial Naive Bayes classifier

3. Word2Vec with Average Word Embeddings

Uses gensim Word2Vec model
Features:
- Custom trained Word2Vec model on the SMS dataset
- Average word vectors for sentence representation
- Vector dimension standardization to 100 features
Model: Random Forest Classifier

Project Structure

|
├── 1_Spam_Classification_using_BOW.ipynb        # BOW implementation
├── 2_Spam_Classification_using_TF-IDF.ipynb     # TF-IDF implementation
├── 3_Spam_Classification_using_Word2Vec_AvgWord2Vec.ipynb  # Word2Vec implementation
└── README.md

Requirements

The project requires the following Python libraries:

pandas
numpy
scikit-learn
nltk
gensim
tqdm

Model Performance

Each approach was evaluated using standard classification metrics including accuracy, precision, recall, and F1-score. The models were trained on 80% of the data and tested on the remaining 20%.

Usage

Install the required packages:

pip install pandas numpy scikit-learn nltk gensim tqdm

Download required NLTK data:

  
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

Run the Jupyter notebooks in sequence to compare different approaches:
- Start with BOW implementation
- Move to TF-IDF implementation
- Finally, try the Word2Vec implementation

This post is licensed under CC BY 4.0 by the author.