10.3 Pre-processing
We will extract only the review text.
review_text <- reviews %>% pull(reviews.text)
10.3.1 Get stop words
I have compiled a large list of stop words from various sources. The list consists of 1,182 stop words. You can download them from my Github repository: https://github.com/ashgreat/datasets/blob/master/my-stopwords.rds?raw=true
my_stopwords <- readRDS(gzcon(url("http://bit.ly/2X1Wv2x")))
Take a look at some of the stop words
head(my_stopwords, 10)
## [1] "a" "about" "above" "after" "again" "against" "ain"
## [8] "all" "am" "an"
10.3.2 tm
package
tm
is a powerful package for text processing. For it to operate, we will first create a corpus of all the documents. Next, we will use tm_map
function to remove stop words, numbers, punctuation, white space. Finally we will also stem the words.
rev_corpus <- tm::Corpus(VectorSource(review_text)) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, c(my_stopwords, "amazon")) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(removePunctuation, preserve_intra_word_dashes = TRUE)%>%
tm_map(stemDocument)