10.4 Document term matrix
Create a document term matrix (DTM) such that it shows the frequency of each word for each document. The DTM rows will have 18,123 documents and the columns will have the unique words. The argument bounds
in the code below specifies dropping the terms that appear in documents fewer than the lower bound and more than the upper bound. This is instead of using TF-IDF, which is an alternative.
Thus, if a word appears in less than 100 documents or more than 1000 documents, we will drop it.
dtm <- rev_corpus %>%
DocumentTermMatrix(control = list(bounds = list(global = c(100, 1000))))
With this, we have just 432 words left in the DTM. Perhaps this is too small for the analysis. If you feel so, you could change the bounds.
dim(dtm)
## [1] 18123 432
10.4.1 Remove empty documents
For a few documents, all the frequencies in the corresponding rows are 0. We will get rid of these documents. First, we create an index which will hold the information on the rows to keep. This will be a logical vector.
index_kp <- rowSums(as.matrix(dtm)) > 0
Check how many rows we will keep.
sum(index_kp)
## [1] 17992
So, we are dropping 18,123 - 17,992 = 131 documents. Next, we will adjust dtm
and review_text
so that they each have 17,992 rows/elements.
dtm <- dtm[index_kp, ]
review_text <- review_text[index_kp]