10.1 Latent Dirichlet Allocation
Imagine that you are in your dentist’s waiting room. You pick up a magazine and casually start browsing it. While eyeballing text on random pages, you came across the following text:
Williamson and Taylor then began to reap the rewards of their patience and accelerated their way to an immensely productive partnership of 160 from 28.5 overs before Taylor chipped to Jason Holder at mid-on off the bowling of Chris Gayle to depart for 69. Williamson remained steady as ever and went on to record his second consecutive hundred, following on from his knock against South Africa on Wednesday. The Kiwi skipper eventually fell for 148, amassing over half of his side’s final total of 291. Cottrell was West Indies’ leading man, finishing with figures of 4/56.
With Evin Lewis suffering an injury to his hamstring, Shai Hope joined Chris Gayle at the crease to begin the chase, but the right-hander perished early to Trent Boult, and Nicholas Pooran followed him back to the sheds not long after. Gayle took a liking to Matt Henry’s bowling and his partnership with Shimron Hetmyer – featuring some monstrous sixes – saw West Indies take control of the match. The game then swung back in the Black Caps’ favour as Lockie Ferguson interrupted with the removal of Hetmyer with an incredible slower ball that initiated a collapse of five wickets for 22 runs.54
Unless you are from the UK, Indian Subcontinent, Australia, New Zealand, South Africa, or the Caribbean, there is little chance you understood anything meaningful in this text! However, some of these words look familiar to you: ball, total, match, game, and runs. These words give you a hint about the topic underlying this text. This looks like a sports article discussing a game between South Africa and West Indies. Indeed, after some Google search, you find out that this is game of Cricket.
Note that in order to determine the topic underlying the text, you used some of the words in the text. In an article that describes a game of Cricket, it is likely that the article will use words associated with Cricket. However, some of these words also may appear in an article about Baseball. So there is some uncertainty in your mind about the topic of the text. You decide to assign 60-40 probabilities to Cricket and Baseball.
LDA works on a similar principle. It assumes a data generating process under which a document is generated based on a mix of latent topics and the words that pertain to those topics. As such LDA treats an article as a “bag of words”. LDA ignores the ordering of those words. Thus, for LDA both these sentences are the same:
Williamson remained steady as ever and went on to record his second consecutive hundred, following on from his knock against South Africa on Wednesday.
and
steady remained Williamson as ever and record on to went his hundred second consecutive Wednesday against Africa South on, knock following on from his.
LDA assumes that each document has a set of topics, which follow multinomial logistic distribution. However, the probabilities of the multinomial model are not fixed for all the documents. LDA assumes that the distribution of probabilities follow Dirichlet distribution. Thus, for each document, the topics are random draws from a multinomial distribution. The probabilities of the multinomial distribution are in turn are random draws from Dirichlet distribution. The choice of this distribution is due to mathematical convenience as Dirichlet distribution is a conjugate prior to multinomial logistic distribution. As a result, the posterior distribution of the probability distribution is Dirichlet too. This significantly simplifies the inference problem.55
Our task in topic modeling using LDA can be broken down in the following steps:
Create a corpus of multiple text documents.
Preprocess the text to remove numbers, stop words, punctuation, etc. Additionally, use stemming.
Decide the number of topics and fit LDA on the corpus. The number of topics is a hyperparameter to tune.
Get the most common words defining each topic. Give them meaningful labels.
For a partial mathematical treatment please refer to the original paper cited above.↩