Report on Modeling Discrete Dynamic Topics
Zijing Zhang (Zing)
5/21/2019 Report on Modeling Discrete Dynamic Topics 1
Issue
How to model topic trend? *
• Medium
• Academic paper
• Blog post
• Tweet
• Data Stream
• Continuous
• Discrete
• Change-detection
• Low delay
5/21/2019 Report on Modeling Discrete Dynamic Topics 2
Dataset Input [1]
• Signal Media
• a publicly available dataset
• 985,867 articles
• Average length 405 words
• from September 1 till September 30, 2015.
• removing stop words, URLs, tokens not starting with alphabet letters, punctuation marks, and words which occur less than 5 times.
• NFCWorld Twitter channel
• emerging technologies
• the Twitter API
• 3,374 tweets
• removing stopwords, URLs, hashtag signs, tokens not starting with alphabet letters, punctuation marks, and words occurring less than 3 times.
5/21/2019 Report on Modeling Discrete Dynamic Topics 3
Vocabulary *
• Generative Model
A model of joint probability distribution as how data have been generated.
• Dirichlet Distribution
A probability distribution of category probabilities adding up to 1.
• Kalman filter
A linear quadratic estimator that fuses noisy observable data series to guess the hidden state of variable.
• Logit
The logarithm of the odds p/(1-p)
• Logistic Normal Distribution
A probability distribution with logit as normal distribution
• Bayesian Information Criterion
Model selection criterion based on statistical principle
• Latent variable:
Variable inferred from other observed variables.
• Hidden Markov Model
Model which infers target hidden state of a random variable from another observable random variable's emission sequence.
• Negative Log Likelihood
Surprisal of a model
5/21/2019 Report on Modeling Discrete Dynamic Topics * Interpretation 4
Latent Dirichlet Allocation [4]
A generative model that categorizes observations with imaginary vocabulary. *
Document is a distribution of topics. * Topic is a distribution of words. *
5/21/2019 Report on Modeling Discrete Dynamic Topics 5
Dynamic Topic Model [1]
A generative model that analyzes the topic evolution over continuous time slices from document stream *
• Divides data into different time slices by using the document timestamps.
• Models topics of each time slice starting from the first one using LDA
• Uses Kalman filter to compute the evolution of each topic over time.
5/21/2019 Report on Modeling Discrete Dynamic Topics 6
Pros & Cons Prior Works
• LDA [2]
• Pros:
• Effective assumptions for topic modeling
• Cons:
• Topic number
• Lacking correlation detection among topics
• DTM [1]
• Pros:
• Connects the same topics over time
• Cons:
• New topic lag. [1]
• Topic continuous. [1]
• Effectiveness of chaining with Dirichlet prior *
5/21/2019 Report on Modeling Discrete Dynamic Topics 7
Discrete Dynamic Topic Model
5/21/2019 Report on Modeling Discrete Dynamic Topics 8
uses BIC to discover the best number of topic chains [1]
Performance of dDTM
5/21/2019 Report on Modeling Discrete Dynamic Topics 9
Result of Signal Media Dataset
5/21/2019 Report on Modeling Discrete Dynamic Topics 10
Result of NFCWorld Dataset
5/21/2019 Report on Modeling Discrete Dynamic Topics 11
Future [1]
• apply dDTM for analyzing trending topics and compare dDTM against other methodologies developed in this domain
• A comparison of different methods for computing HMM state probabilities such as Gaussian mixture models
• the time slice length allocation
• conduct a more in-depth analysis of dDTM which would include
analysis of the topic chains’ quality with respect to human judgment
5/21/2019 Report on Modeling Discrete Dynamic Topics 12
References
[1] https://dl.acm.org/citation.cfm?doid=3019612.3019673 [2] https://www.youtube.com/watch?v=DWJYZq_fQ2A
[3] https://www.sciencedirect.com/topics/pharmacology-toxicology- and-pharmaceutical-science/bayesian-information-criterion
[4] https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
• [ num ] : I.P. owner in references
• * : own interpretation
5/21/2019 Report on Modeling Discrete Dynamic Topics 13