10.7 Topic distribution

We can get some idea about the incidence of each topic in the corpus by looking at the average posterior distribution of the topics. This is arguably a crude way. We can also plot the probability distributions.

For this, we first need to get the posterior distributions of \(\theta\). The function posterior() returns posteriors of both \(\theta\) and \(\beta\). Whereas \(\theta\) is the posterior distribution of the topics, \(\beta\) is the posterior distributions of the words or terms.

lda_post <- posterior(lda_model)
theta <- lda_post$topics
beta <- lda_post$terms

Let’s take a look at the average probability of each topic across all the documents.

colMeans(theta)
##          1          2          3          4          5          6 
## 0.05307823 0.04700218 0.05099744 0.04934087 0.04734909 0.05803044 
##          7          8          9         10         11         12 
## 0.04891069 0.04578102 0.04829971 0.04550166 0.04627746 0.05223969 
##         13         14         15         16         17         18 
## 0.05878948 0.04799089 0.05265680 0.04739848 0.05349202 0.04862603 
##         19         20 
## 0.04988862 0.04834919

We see that the average topic incidence probability is pretty much uniform across the documents. Of course, this hides all the variation. Perhaps it is easier to look at the plots of probabilities. Figure ?? shows the facet plot.

theta %>% 
  as.data.frame() %>% 
  rename_all(~ paste0("topic", 1:20)) %>%
  reshape2::melt() %>% 
  ggplot(aes(value)) +
  geom_histogram() +
  scale_x_continuous(limits = c(0, 0.8)) +
  scale_y_continuous(limits = c(0, 13000)) +
  facet_wrap(~ variable, scales = "free") +
  labs(x = "Topic Probability", y = "Frequency") +
  theme_minimal()

We don’t see a lot of variation in the probability distributions. However, this plot doesn’t tell us which topics belong to which documents. If some topics tend to relate closely to certain types of reviews, perhaps we can get an idea about how these topics relate to the reviewer rating. We turn to that next.