期刊名称:International Journal of Advanced Computer Science and Applications(IJACSA)
印刷版ISSN:2158-107X
电子版ISSN:2156-5570
出版年度:2019
卷号:10
期号:12
DOI:10.14569/IJACSA.2019.0101253
出版社:Science and Information Society (SAI)
摘要:The earliest modification of Latent Dirichlet Allocation (LDA) in terms of words or document attributes is by relaxing its exchangeability assumption via the Bag-of-word (BoW) matrix. Several authors have proposed many modifications of the original LDA by focusing on model that assumes the current topic depends on the words from previous topic. Most of the earlier work ignored the document length distribution since it is assumed that it will fizzle out at the modelling stage. Thus, in this paper, the Poisson document length distribution of LDA model is replaced with Generalized Poisson (GP) distribution which has the strength of capturing complex structures. The main strengths of GP are in capturing overdispersed (variance larger than mean) and under dispersed (variance smaller than mean) count data. The Poisson distribution used by LDA strongly relies on the assumption that the mean and variance of document lengths are equal. This assumption is often unrealistic with most real-life text data where the variance of document length may be greater than or less than their mean. Approximate estimate of the GPLDA model parameters was achieved using Newton-Raphson approximation technique of log-likelihood. Performance and comparative analysis of GPLDA with LDA using accuracy and F1 showed improved results.