A SURVEY ON TOPIC MODELLING

(1)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

ISSN (Online): 2347 - 2812, Volume-1, Issue - 2, 2013 57

A SURVEY ON TOPIC MODELLING

1M.Divya, ²K.Thendral, ³Dr.S.Chitrakala

1Post Graduate, ²Post Graduate, ³Associate Professor

1,2,3Anna University, Chennai.

1[email protected], ²[email protected], ³[email protected]

Abstract: Topic models have been widely used to identify topics in text corpora. Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts. They provide a simple way to analyse large volumes of unlabelled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. One key challenge in topic modelling is to develop a fast solution for indexing high dimensional data which is crucial to build large scale application data.

This survey highlights on the current probabilistic topic models. It also point out scope and challenges in topic modelling.

I. INTRODUCTION

The traditional method of learning about the particular topic is to read a book or a survey paper related to the topic. To buy a book is a time consuming and is inconvenient .This traditional method is not been applicable in most of the cases because of the topics and the technologies emerge in the fast growing world.

To know the depth knowledge about a particular topic among the collection of documents that has become increasingly important and popular. No one will spend time by writing a book on some particular topic. To have knowledge of such emerging topic one can resort to research papers. But the research papers are not understandable by non-researchers and few research papers cover all aspects of the topic, there came into existence of topic modelling.Topic modelling is the way of extracting common topics that occurs among the collection of documents. A document may have depth about a particular topic or multiple topics with different proportions. By using mathematical framework topic model examines each and every documents and discovers based on statistical words in each documents.

Fig 1: The classification of topic models

II. RELATED WORK

Topic modelling have been started with Topic detection and Tracking(TDT) project [2],[3], and extensively studied in the literature. TDT is used to find and track topic from a sequence of news sequences and the technique followed is clustering based. Later came into existence of Probabilistic Latent semantic analysis (PLSA) [4], Latent Dirichlet Allocation (LDA) and their derivatives.

Hoffman et.Al [3] has proposed PLSI is a statistical model for analyzing two modes and finding the co- occurrence of data. This model aims creating each document in corpus, where each word in a document is sampled from a mixture of multinomial distribution that can be interpreted as topics, and proportions corresponding to mixture weights are sampled from a separate multinomial distribution for each document.

Problem with PLSI has mixture parameter of each document that has too many parameters.

Bag of Word Sequences of

word

Supervise d

Unsupervise d

Supervis ed

Unsupervise d Topic

Model

(2)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

Fig2: PLSI plate notation

Bhei et.Al [4] used each document in the LDA to be viewed as the mixture of topics. The problem of PLSI can be overcome by LDA by increasing the number of estimates by using Dirichlet prior distribution. The complexity of LDA is more when compared to PLSI that helps to find exact inferences from generative model. To efficiently cope with this problem, several approximate inference algorithms are derived such as Variational Inference, and various Markov Chain Monte Carlo algorithms, such as Gibbs Sampling. LDA is more expensive than PLSI and LDA does not have some of the features such as more complex model like finding relation between the topics.

Fig3: LDA plate notation

Figure 4: A comparison of time and space efficiency between Gibbs sampling and sparse LDA algorithm

Bhei et.Al [6] introduced hierarchical LDA which is the LDA’s extension model is Hierarchical LDA and is introduced by Bhei in 2003. LDA model a flat topic structure instead HLDA models tree of topics.

Hierarchical LDA uses non-parametric Bayesian approach to model hierarchies. Tree of topic is constructed hierarchically of nodes by an algorithm. In topic tree model each and every node is represented by random number and has got corresponding word-topic distribution assigned to it. The tree can be traversed from the root till its leaves while sampling topics along the path.

Bhei et.Al [7] has proposed Dynamically Topic model (DTM) which enables to model topic over evolution of time.DTM is an enhancement to LDA. The advantage of DTM when compared to other probabilistic topic modelling algorithm is that it tracks topic over a period of time. Problem with DTM is that has fixed number of topics and a discrete notion of time.

Steyvers et.Als [9] used Correlated Topic Model (CTM) as a probabilistic topic model that enhances LDA with modelling of correlation between topics.

CorrLDA canmodel complex structure of underlying topics in textual, and provide a graph representation of topic relationships, as opposed to LDA model that imposes a strong mutual independence assumption on topics, thus making it more expressive. CorrLDA by Blei [10] provides more expressivity than LDA approach and generally can make a better fit to some textual corpora due to more assumptions made.

Wei Li and Andrew Mccallum [10] introduced PAM.

PAM is similar to that of CorrLDA in correlation between topics. As opposed to CorrLDA where topic correlations are modelled using covariance matrix representing pair wise correlations between topics, PAM redefines the concept of topic as a distribution not only over words, but as a distribution over word and other topics also.

Author Topic Model (ATM) is a generative probabilistic topic model introduced by M. R. Zvi [11]

et al in ACM Trans. Inf. Syst., in 2010. ATM is dependent on metadata associated with each document in corpus. Instead of modeling only document-topic and topic-word distributions. Author-topic and topic- word distributions can be learned using Markov Chain Monte Carlo algorithm based on generative model.

III. ANALYSIS OF TOPIC MODELLING

Analysis and comparison of various topic models techniques, methodology and how each model is classified and its features are provided in the following table. The table shows that the topic model are classified based on two scenarios supervised and unsupervised and the methodology used for topic models were bag of words and sequence of words approach. Unsupervised bag of words approach for topic models are used in information retrieval, document clustering

(3)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

and summarization due to their simplicity. Supervised model for topic model are used in supervised manner, using preassigned labels for training set while seeking to increase result specificality for domain of interest using some sort of in-domain knowledge .

S.NO Year Author Proposed Method

Methodology Classification Ability

1. 1999 Hoffman PLSI Bag of Words Unsupervised Number of topics found 2. 2003 Bhei LDA Bag of Words Unsupervised Arbitrary word level features 3. 2003 Bhei hLDA Bag of words Unsupervised Topic hierarchy, number of

topics not found 4. 2006 Bhei DTM Bag of words Unsupervised Time evolution of topic 5. 2006 Bhei CorrLDA Bag of words Unsupervised Topic correlations as matrix 6. 2006 We Li PAM Bag of words Unsupervised Topic correlations as DAG 7. 2007 Bhei sLDA Bag of words Supervised Arbitrary word level features 8. 2006 Wallach biLDA Sequence of words Unsupervised Bigram

8. 2010 Zhu sCTRF Sequence of words Unsupervised Arbitrary word level features

Table.1 Analysis of Topic model

IV. PARAMETERS USED FOR EXPERIMENTAL EVALUATION:

The standard parameters which are used for experimental evaluation are precision, recall and accuracy.

Precision is defined as number of retrieved relevant documents divided by total number of retrieved documents and the recall is the number of retrieved relevant document divided by total number of relevant documents in the database. Accuracy can be calculated as relevant document retrieved in top T returns divided by T. The formulas for calculation of these evaluation parameters can be given as following:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑅𝑒𝑐𝑎𝑙𝑙 =𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑒𝑚𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑖𝑛 𝑡𝑜𝑝 𝑇

𝑇

Perplexity is a commonly used intrinsic evaluation metric in the topic model literature (Blei et al. 2003;

Griffiths and Steyvers 2004). Perplexity originates from language modelling, and is calculated by estimating the probability of words in held out/test data based on training data.

WhereM is the number of documents in test; wm are the words in document m; and

Nm is the number of words in document m.

(a) Perplexity

(b)Avg. precision of top 60

(4)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

(c) Avg.recall of top 100

Figure 5 . Graphs showing the perplexity, average recall, the average precision as in [9]

V. INFERENCES MADE:

 Variational Bayesian method provides an analytical approximation to the posterior probability of the unobserved variables.

 Gibbs sampling is the accurate method for randomly initializing topic variables for the new documents .

 The efficiency of Gibbs sampling is increased by fast evaluation of the sampling distribution over topics for a given token.

 Time and memory efficient LDA together with sparse LDA improves the sampling performance.

 OnlineLda fits topic models to massive data and downloads random Wikipedia articles and fits a topic model to them.

 Ctr(Collaborative modelling for recommendation) recommend items to users based on item content and other users' ratings

VI. CHALLENGES:

The implementation of topic modelling raises several research challenges. Some of these are:

 Labelling a multinomial topic model

 Handling large number of documents

 Creating large number of topics

 Scalability

 Mismatch of the representation of a topic model and a label.

 Accurately interpreting meaning of each topic.

 Retrievals accuracy

 Inferring hidden topic structure.

 Continuous dynamic topic model (cDTM) resolves discretization.

VII. APPLICATIONS:

Topic models have seen applications in diverse fields and areas:

Domain Topic Model

Multimedia LDA,PLSI

Biometrics PAM

Image Retrieval ,Bio- Medical

LDA

Machine Learning, Robotics

hLDA

Automated Recommended System

CorrLDA

 In natural language processing tasks such as word sense induction (Brody and Lapata 2009)

 Document summarisation (Haghighi and Vanderwende 2009)

 Text segmentation (Sun et al.2008)

 In linguistic analysis (Daume III 2009)

 In informational retrieval (Wei and Croft 2006)

 In biomedical research (Konietzny et al. 2011).

VIII. CONCLUSION:

This paper reviewed the topic modelling which highlights the probabilistic topic models. These models make explicit assumptions about the causal process responsible for generating a document, and enable the use of sophisticated statistical methods to identify the latent structure that underlies a set of words.

Consequently, it is easy to explore different representations of words and documents, and to develop richer models capable of capturing more of the content of language. The proposed framework is not only useful in practice but also valuable to machine learning because human beings learn knowledge and even topics change over time and past knowledge is not discarded but used to solve new problems.

(5)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

REFERENCES:

[1] A. David, J. Li, L. Zhou, and F. Muhammad,

“Knowledge discovery through directed probabilistic topic models: a survey,” Frontiers of Computer Science in China, vol. 4, no. 2, pp.

280–301, Jun. 2010.

[2] David M. Blei. Introduction to Probabilistic Topic Models. Communications of the ACM, 2011 pp.

[3] Mark Steyvers, Tom Griffiths. Probabilistic Topic Models. In Landauer

[4] Zhu, Jun, and Eric P Xing. “Conditional Topic

Random Fields.” Forbes. Ed.

JohannesFürnkranz& Thorsten Joachims.

[5] Steven Abney and Marc Light. 1999. “Hiding a semantic hierarchy in a markov model”. In Proceedings of the Workshop on Unsupervised Learning in Natural Language Processing, pages 1–8.

[6] T. L. Griffiths, M. Steyvers, D. M. Blei, and J.

B. Tenenbaum, “Integrating topics and syntax,”

In Advances in Neural Information Processing Systems 17, vol. 17, 2005, pp. 537–544.

[7] A. Gruber, M. Rosen-Zvis, and Y. Weiss,

“Hidden topic markov models,” in Artificial Intelligence and Statistics, 2007.

[8] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. Suppl 1, pp. 5228–5235, Apr. 2004.

[9] Wang Wei, PayamBarnaghi, “Probabilistic Topic Models for LearningTerminological Ontologies” Member, IEEE, and AndrzejBargiela, Member, IEEE.

[10] X. Wang, A. McCallum, and X. Wei, “Topical n-grams: Phrase and topic discovery, with an application to information retrieval,” in ICDM 2007: Proceeding of the seventh IEEE International Conference On Data Mining, Ed., 2007, pp. 697–702.

[11] X. Wang and A. McCallum, “Topics over time:

a non-Markov continuous-time model of topical trends,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ser.

KDD ’06. New York, NY, USA: ACM, 2006, pp. 424–433.

[12] D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in Proceedings of the 23rd international conference on Machine learning, ser. ICML ’06. New York, NY, USA: ACM, 2006, pp. 113–120.

[13] D. Blei, T. Gri, M. Jordan, and J. Tenenbaum,

“Hierarchical topic models and the nested Chinese restaurant process,” 2003.

[14] D. M. Blei and J. D. Lafferty. (2006) Correlated topic models.

[15] T. Hofmann. (1999) Probabilistic latent semantic indexing.

[16] S. Deerwester, S. T. Dumais, G. W. Furnas, T.

K. Landauer, and R.Harshman,“Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol.

41, pp. 391–407, 1990.

[17] D. M. Blei and J. D. Mcauliffe, “Supervised topic models,”, in Proceedings of th Neural Information Processing Systems – NIPS, 2007.

[18] D. Mimno and A. McCallum, “Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression,” in Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI ’08), 2008.

[19] M. R. Zvi, C. Chemudugunta, T. Griffiths, P.

Smyth, and M. Steyvers, “Learning author-topic models from text corpora,” ACM Trans. Inf.

Syst., vol. 28, no. 1, pp. 1–38, Jan. 2010.

[20] D. M. Blei and P. J. Moreno, “Topic segmentation with an aspect hidden markov model,” in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR ’01. New York, NY, USA:

ACM, 2001, pp. 343–348.

[21] J.P. Yamron, I. Carp, L. Gillick, S. Lowe, and P.

van Mulbregt, “A Hidden Markov Model Approach to Text Segmentation and Event Tracking”, Proceedings ICASSP-98, Seattle, May 1998.

[22] Y. Wang, H. Bai, M. Stanton, W. Y. Chen, and E. Y. Chang, “PLDA: Parallel latent dirichlet allocation for Large-Scale applications,” in Proceedings of the 5th International Conference on Algorithmic Aspects in Information and Management, ser. AAIM ’09, vol. 5564. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 301–

314.

[23] W. Li and A. McCallum, “Pachinko allocation:

DAG-structured mixture models of topic correlations,” in Proceedings of the 23rd international conference on Machine learning, ser. ICML ’06. New York, NY, USA: ACM, 2006, pp. 577–58