10.3 Pre-processing

We will extract only the review text.

review_text <- reviews %>% pull(reviews.text)

10.3.1 Get stop words

I have compiled a large list of stop words from various sources. The list consists of 1,182 stop words. You can download them from my Github repository: https://github.com/ashgreat/datasets/blob/master/my-stopwords.rds?raw=true

my_stopwords <- readRDS(gzcon(url("http://bit.ly/2X1Wv2x")))

Take a look at some of the stop words

head(my_stopwords, 10)
##  [1] "a"       "about"   "above"   "after"   "again"   "against" "ain"    
##  [8] "all"     "am"      "an"

10.3.2 tm package

tm is a powerful package for text processing. For it to operate, we will first create a corpus of all the documents. Next, we will use tm_map function to remove stop words, numbers, punctuation, white space. Finally we will also stem the words.

rev_corpus <- tm::Corpus(VectorSource(review_text)) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, c(my_stopwords, "amazon")) %>% 
  tm_map(removeNumbers) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(removePunctuation, preserve_intra_word_dashes = TRUE)%>% 
  tm_map(stemDocument)