10.8 Topic importance

LDA doesn’t tell us the importance of each topic because it doesn’t know how to determine the importance. However, if some topics relate to glowing reviews while some others are related to bad reviews, perhaps we can determine the topic importance by predicting review rating using the topic probabilities.

I will show you how to do it using Random Forest. There is a strong chance that fitting Random Forest model will take a long time so we will skip doing that in the classroom.

Before we proceed, we need to create a new data set with the review ratings and topic probabilities. Note that some of the rating are missing. Furthermore, I convert rating into an ordered factor. This is because there are only 5 distinct values of ratings and they are not uniformly distributed. It’s easier to treat it as a classification problem.

theta_rating <- as.data.frame(theta) %>% 
  rename_all(~ paste0("topic", 1:20)) %>% 
  mutate(rating = factor(reviews$reviews.rating[index_kp],
                         ordered = TRUE)) %>% 
  filter(!is.na(rating))
class(theta_rating$rating)
## [1] "ordered" "factor"