10.2 Data

Load/install the packages as follows.

In the classroom if qdap fails to install, ignore it for the time being.

pacman::p_load(dplyr, 
               ggplot2, 
               tm, # For textmining 
               topicmodels, # For LDA
               qdap, # For some text cleaning
               caret # For random forest
               )

We will use a Kaggle dataset consisting of 34,000 Amazon product reviews such as Kindle, Fire TV Stick, etc., provided by Datafiniti:

https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products

Download 1429_1.csv file from Kaggle. You can also download this file from my Github repository: https://github.com/ashgreat/datasets/blob/master/1429_1.csv.zip. As the file size is 49 MB, I recommend downloading the zip file first to your computer and then reading it.

reviews <- read.csv("1429_1.csv",
                    stringsAsFactors = FALSE)

Take a look at the column names

names(reviews)
##  [1] "id"                   "name"                 "asins"               
##  [4] "brand"                "categories"           "keys"                
##  [7] "manufacturer"         "reviews.date"         "reviews.dateAdded"   
## [10] "reviews.dateSeen"     "reviews.didPurchase"  "reviews.doRecommend" 
## [13] "reviews.id"           "reviews.numHelpful"   "reviews.rating"      
## [16] "reviews.sourceURLs"   "reviews.text"         "reviews.title"       
## [19] "reviews.userCity"     "reviews.userProvince" "reviews.username"

We are interested in reviews.text. Next we will delete all the rows where the number of words in the reviews were less than 20. You can change this number of something else depending on your application.

Don’t run this chunk if you could not install qdap

reviews <- reviews %>% 
  filter(qdap::word_count(.$reviews.text, byrow = TRUE) >= 20)

We are left with 18,123 rows.