Kindle Review Sentiment Analysis

Posted Nov 1, 2025

This project performs sentiment analysis on Amazon Kindle Store reviews using various Natural Language Processing (NLP) techniques. The analysis compares three different text vectorization approaches: Bag of Words (BOW), TF-IDF, and Word2Vec, combined with machine learning models to classify review sentiments.

About the Dataset

Context

This dataset is a curated subset of Amazon product reviews from the Kindle Store category, offering valuable insights into customer opinions, product quality, and review behaviors spanning nearly two decades (May 1996 to July 2014).

Content

Part of Amazon’s 5-core collection (each product and reviewer has at least 5 reviews)
Total entries: 982,619
Time span: May 1996 to July 2014

Columns

Column	Description
`asin`	Unique product ID (e.g., B000FA64PK)
`helpful`	Helpfulness rating of the review (e.g., 2/3)
`overall`	Overall product rating (numeric)
`reviewText`	Full text of the review
`reviewTime`	Original review date
`reviewerID`	Unique reviewer ID
`reviewerName`	Name of the reviewer
`summary`	Short summary/title of the review
`unixReviewTime`	Review timestamp (Unix format)

Implementation Details

Data Preprocessing

Text Cleaning:
- Converting text to lowercase
- Removing special characters
- Removing URLs and HTML tags
- Removing stopwords
- Applying lemmatization
Feature Engineering:
- Binary sentiment classification (ratings < 3 are negative, ≥ 3 are positive)
- Text vectorization using three different approaches:
  - Bag of Words (BOW)
  - TF-IDF (Term Frequency-Inverse Document Frequency)
  - Word2Vec with averaged word embeddings

Model Architecture

The project implements and compares three approaches:

BOW + Gaussian Naive Bayes
TF-IDF + Gaussian Naive Bayes
Word2Vec + Gaussian Naive Bayes

Performance Results

Model accuracies on the test set (20% of data):

BOW Model: Accuracy of 59%
TF-IDF Model: Accuracy of 59%
Word2Vec Model: Accuracy of 75%

Requirements

Python 3.x
pandas
numpy
scikit-learn
nltk
gensim
BeautifulSoup4
tqdm

Setup and Installation

  
# Install required packages
pip install pandas numpy scikit-learn nltk gensim beautifulsoup4 tqdm

# Download required NLTK data
python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet'); nltk.download('punkt_tab')"

Usage

Load and preprocess the data:

  
import pandas as pd
data = pd.read_csv('all_kindle_review.csv')

Perform text preprocessing:
- Lowercase conversion
- Special character removal
- Stopword removal
- Lemmatization
Choose a vectorization method (BOW, TF-IDF, or Word2Vec)
Train the model and evaluate results

Best Practices

Data Preprocessing & Cleaning:
- Handle missing values and duplicates
- Normalize text
- Apply lemmatization
Train-Test Split:
- Use 80-20 split
- Consider stratified sampling for balanced classes
Feature Extraction:
- Compare different vectorization methods
- Consider dimensionality reduction if needed
Model Training & Evaluation:
- Use appropriate evaluation metrics
- Consider cross-validation
- Compare model performances

Acknowledgements

Dataset source: Amazon Product Data compiled by Julian McAuley and team at UC San Diego (UCSD)
Source: Amazon Product Data – UCSD
All rights and licenses belong to the original authors

This post is licensed under CC BY 4.0 by the author.