期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2020
卷号:98
期号:22
页码:3583-3596
出版社:Journal of Theoretical and Applied
摘要:In this paper, we introduce a novel unsupervised method for keyword extraction, based on non-smooth nonnegative matrix factorization. We generate a document-term matrix from a given corpus and factorize it into the product of two special matrices: documents-by-topics and topics-by-terms. In our method, we choose a low degree of factorization (k=3,4,5) and use only topics-by-terms matrix to extract top N keywords for each of k topics. Then we merge these obtained N*k keywords into a resulting keyword list excluding duplicates and assign keywords to documents. We validate our method with a large text corpora: �Introduction to information retrieval� textbook (by Manning, Raghavan and Sch�tze), available online. The result of our method is compared with three popular unsupervised keyword extraction algorithms: TextRank, Rake and Yake. The experiments confirm that the proposed method shows the promising performance in terms of precision, recall and F-measure with respect to various number of candidate keywords.