10.2 Data
Load/install the packages as follows.
In the classroom if qdap
fails to install, ignore it for the time being.
pacman::p_load(dplyr,
ggplot2,
tm, # For textmining
topicmodels, # For LDA
qdap, # For some text cleaning
caret # For random forest
)
We will use a Kaggle dataset consisting of 34,000 Amazon product reviews such as Kindle, Fire TV Stick, etc., provided by Datafiniti:
https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products
Download 1429_1.csv
file from Kaggle. You can also download this file from my Github repository: https://github.com/ashgreat/datasets/blob/master/1429_1.csv.zip. As the file size is 49 MB, I recommend downloading the zip file first to your computer and then reading it.
reviews <- read.csv("1429_1.csv",
stringsAsFactors = FALSE)
Take a look at the column names
names(reviews)
## [1] "id" "name" "asins"
## [4] "brand" "categories" "keys"
## [7] "manufacturer" "reviews.date" "reviews.dateAdded"
## [10] "reviews.dateSeen" "reviews.didPurchase" "reviews.doRecommend"
## [13] "reviews.id" "reviews.numHelpful" "reviews.rating"
## [16] "reviews.sourceURLs" "reviews.text" "reviews.title"
## [19] "reviews.userCity" "reviews.userProvince" "reviews.username"
We are interested in reviews.text
. Next we will delete all the rows where the number of words in the reviews were less than 20. You can change this number of something else depending on your application.
Don’t run this chunk if you could not install qdap
reviews <- reviews %>%
filter(qdap::word_count(.$reviews.text, byrow = TRUE) >= 20)
We are left with 18,123 rows.